├── Chapter_1_IPython.md ├── Chapter_2_NumPy.md ├── Chapter_3_pandas.md ├── Chapter_4_Matplotlib.md ├── Chapter_5_Machine_Learning.md ├── LICENSE-TEXT └── README.md /Chapter_1_IPython.md: -------------------------------------------------------------------------------- 1 | # Chapter 1: IPython 2 | 3 | ## General help 4 | 5 | `thing?` shows docstring of `thing`. 6 | 7 | `thing??` shows source of `thing` (unless in some compiled language, 8 | e.g. C extensions). 9 | 10 | ## Tab completion 11 | 12 | ### Of object attributes and methods 13 | 14 | `obj.` or `obj._` for internal attributes/methods. 15 | 16 | ## Wildcard matching 17 | 18 | `*Warning?` lists objects in the namespace that end with Warning. 19 | 20 | `*` matches any string, including ''. 21 | 22 | `str.*find*?` 23 | 24 | ## Shortcuts 25 | 26 | ### Navigation 27 | 28 | Ctrl-a: Move cursor to beginning of line. 29 | Ctrl-e: Move cursor to end of line. 30 | Ctrl-b (or left arrow): Move cursor back one character. 31 | Ctrl-f (or right arrow): Move cursor forward one character. 32 | 33 | ### Text entry 34 | 35 | Ctrl-d: Delete next character in line. 36 | Ctrl-k: Cut text from cursor to end of line. 37 | Ctrl-u: Cut text from beginning of line to cursor. 38 | Ctrl-y: Yank (paste) text that was cut. 39 | Ctrl-t: Transpose previous two characters. 40 | 41 | ### Command history 42 | 43 | Ctrl-p (or up arrow): Access previous command in history. 44 | Ctrl-n (or down arrow): Access next command in history. 45 | Ctrl-r: Reverse search through history. 46 | 47 | ### Miscellaneous 48 | 49 | Ctrl-l: Clear terminal screen. 50 | Ctrl-c: Interrupt Python command. 51 | Ctrl-d: Exit IPython. 52 | 53 | ## Magic functions 54 | 55 | Enhancements over standard Python shell. 56 | 57 | Prefixed by `%`. 58 | 59 | Line magics: single `%`, operate on a line of input. Function gets this 60 | line as an argument. 61 | 62 | Cell magics: double `%%`, operate on multiple input lines. Function gets 63 | the first line as an argument, and the lines below as a separate 64 | argument. 65 | 66 | ### Help for magic functions 67 | 68 | See `%magic` and `%lsmagic`. 69 | 70 | ### Pasting code blocks 71 | 72 | `%paste`: pasting preformatted code block from elsewhere, e.g. with 73 | interpreter markers. 74 | 75 | `%cpaste`: similar, can paste one or more chunks 76 | of code to run in batch. 77 | 78 | ### Running external code 79 | 80 | `%run`: Runs Python file as a program. Anything defined in there, is 81 | then available in IPython subsequently (unless you run the code with the 82 | profiler via -p). 83 | 84 | ### Timing code execution 85 | 86 | `%timeit` times how long a line of code takes to execute. 87 | 88 | `%%timeit` is the cell magic version, can specify multiple lines of code 89 | to execute. 90 | 91 | ## Input and output history 92 | 93 | ### In and Out 94 | 95 | `In` is a list of commands. 96 | 97 | `Out` is a dictionary that maps input numbers to any outputs. 98 | 99 | Can use these to reuse commands or outputs, useful if an output takes a 100 | long time to compute. 101 | 102 | **NB: not everything gives an Out value (e.g. a function call that 103 | returns None).** 104 | 105 | ### Underscore shortcuts 106 | 107 | `_` access last output (also works in standard Python shell). 108 | `__`access penultimate output. 109 | `___` access propenultimate output. 110 | 111 | `_X` is a shortcut for `Out[X]`. 112 | 113 | ### Suppress output 114 | 115 | Use a semicolon at end of command. Either, e.g. useful for plotting 116 | commands, or to allow the output to be deallocated. 117 | 118 | Doesn't display and doesn't get added to `Out`. 119 | 120 | ### Related magic commands 121 | 122 | `%history -n 1-4`: print first four inputs. 123 | 124 | `%rerun`: rerun some portion of command history. 125 | 126 | `%save`: save some portion of command history to file. 127 | 128 | ## IPython and the system command shell 129 | 130 | ### Running commands 131 | 132 | Anything following `!` on a line executed by the system command shell, 133 | not IPython. 134 | 135 | ### Passing data from system shell to IPython 136 | 137 | Can use this to interact with IPython, e.g. `contents = !ls`. Such a 138 | "list" isn't a Python list, but a special IPython shell type; these have 139 | `grep`, `fields` methods, and `s`, `n` and `p` properties to search, 140 | filter and display results. 141 | 142 | ### Passing data from IPython to system shell 143 | 144 | Using `{varname}` substitutes that Python variable's value into the 145 | command. 146 | 147 | `message = "hello from Python"` 148 | 149 | and then you can: 150 | 151 | `!echo {message}` 152 | 153 | Can't use `!cd` to navigate, because commands are in a subshell. Need to 154 | use `%cd` or can even just do `cd` which is an `automagic` function. Can 155 | toggle this behaviour with `%automagic`. 156 | 157 | Other shell-like magic functions: `%cat`, `%cp`, `%env`, `%ls`, `%man`, 158 | `%mkdir`, `%more`, `%mv`, `%pwd`, `%rm`, `%rmdir`. These can all be used 159 | without `%` if `automagic` is enabled. 160 | 161 | ## Errors and debugging 162 | 163 | ### Controlling exceptions: `%xmode` 164 | 165 | x is for *exception*. Changes reporting: 166 | 167 | `%xmode Plain` (less information) 168 | `%xmode Context` (default) 169 | `%xmode Verbose` (more information, displays arguments to functions) 170 | 171 | ### Debugging: `ipdb` 172 | 173 | `ipdb` is the IPython version of `pdb`, the Python debugger. 174 | 175 | Using `%debug` immediately following an exception opens a debugging 176 | prompt at the point of the exception. 177 | 178 | In the `ipdb>` prompt, can do `up` or `down` to move through the stack 179 | and explore variables there. 180 | 181 | `%pdb on` enables the debugger by default when an exception is raised. 182 | 183 | #### Debugger commands 184 | 185 | `list` Show current location in file. 186 | `h(elp)` Show list of commands, or get help on current command. 187 | `q(uit)` Quit debugger and program. 188 | `c(ontinue)` Quit debugger, continue program. 189 | `n(ext)` Continue until next line in current function is reached 190 | or it returns; called functions are executed without 191 | stopping. 192 | `` Repeat previous command. 193 | `p(rint)` Print variables. 194 | `s(tep)` Continue, but stop as soon as possible (whether in a 195 | called function or in current function). 196 | `r(eturn)` Continue until function returns. 197 | 198 | ### Stepping through code 199 | 200 | `%run -d` your script, and then use `next` to move through lines of 201 | code. 202 | 203 | ## Profiling and timing code 204 | 205 | ### Timing lines of code 206 | 207 | `%timeit` (line) and `%%timeit` (cell). 208 | 209 | By default, these repeat the code. 210 | 211 | Repeating sometimes is misleading, e.g. sorting a sorted list. 212 | 213 | Instead, can use `%time` to run once. 214 | 215 | Also: `%timeit` prevents system calls from interfering with timing. 216 | 217 | ### Profiling a statement: `%prun` 218 | 219 | e.g. `%prun ` 220 | 221 | Tells you how long programs spends in particular function calls. 222 | 223 | ### Line-by-line profiling: `%lprun` 224 | 225 | This requires the `line_profiler` extension: 226 | 227 | `pip install line_profiler` 228 | 229 | and then load this extension: 230 | 231 | `%load_ext line_profiler` 232 | 233 | then: 234 | 235 | `%lprun -f ` 236 | 237 | to profile the code. Shows you specific lines and how long the program 238 | spends completing them. 239 | 240 | ### Memory usage: `%memit` and `%mprun` 241 | 242 | This requires the `memory_profiler` extension: 243 | 244 | `pip install memory_profiler` 245 | 246 | and then load this extension: 247 | 248 | `%load_ext memory_profiler` 249 | 250 | then: 251 | 252 | `%memit ` (like `%timeit`) 253 | 254 | or: 255 | 256 | `%mprun ` (like `%prun`, line-by-line) 257 | 258 | to profile the code. 259 | 260 | `%mprun` requires a function in a file, but can use `%%file ` 261 | to create a new file with content. 262 | -------------------------------------------------------------------------------- /Chapter_2_NumPy.md: -------------------------------------------------------------------------------- 1 | # Chapter 2: NumPy 2 | 3 | ## Why NumPy? 4 | 5 | Provides efficient storage and operations on homogeneous arrays. 6 | 7 | NumPy arrays are homogeneous multidimensional arrays; the basic values (i.e. 8 | values that aren't themselves arrays) inside arrays all have to be the same 9 | type, and arrays can contain arrays. 10 | 11 | Standard Python lists are flexible: allow hetereogeneous data to be 12 | stored, but at a cost of more information needed and more indirection 13 | required. 14 | 15 | Python 3 has the array module for dense, homogeneous arrays to be stored 16 | efficiently. What NumPy's `ndarray` type gives you over these array 17 | objects is efficient operations too. 18 | 19 | ## Importing 20 | 21 | Normally as: 22 | 23 | ```python 24 | import numpy as np 25 | ``` 26 | 27 | ## Creating arrays 28 | 29 | ### From Python lists 30 | 31 | ```python 32 | >>> np.array([1, 4, 2, 5, 3]) 33 | array([1, 4, 2, 5, 3]) 34 | ``` 35 | 36 | **NumPy arrays contain values of the same type: therefore may get implicit 37 | conversion if possible.** 38 | 39 | For example, for a mixed integer, floating point array: 40 | 41 | ```python 42 | >>> np.array([3.14, 4, 2, 3]) 43 | array([ 3.14, 4. , 2. , 3. ]) 44 | ``` 45 | 46 | Can explicitly set type with `dtype`: 47 | 48 | ```python 49 | >>> np.array([1, 2, 3, 4], dtype='float32') 50 | array([ 1., 2., 3., 4.], dtype=float32) 51 | ``` 52 | 53 | Can create multidimensional arrays, e.g. with a list of lists: 54 | 55 | ```python 56 | >>> np.array([range(i, i + 3) for i in [2, 4, 6]]) 57 | array([[2, 3, 4], 58 | [4, 5, 6], 59 | [6, 7, 8]]) 60 | ``` 61 | 62 | Inner lists are rows of the 2D array. 63 | 64 | ### From NumPy directly 65 | 66 | 1D array of zeroes: 67 | 68 | ```python 69 | >>> np.zeros(10, dtype=int) 70 | array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 71 | ``` 72 | 73 | 2D array, that's 3x5: 74 | 75 | ```python 76 | >>> np.ones((3, 5), dtype=float) 77 | array([[ 1., 1., 1., 1., 1.], 78 | [ 1., 1., 1., 1., 1.], 79 | [ 1., 1., 1., 1., 1.]]) 80 | ``` 81 | 82 | Same again, but with a specified value filling the array: 83 | 84 | ```python 85 | >>> np.full((3, 5), 3.14) 86 | array([[ 3.14, 3.14, 3.14, 3.14, 3.14], 87 | [ 3.14, 3.14, 3.14, 3.14, 3.14], 88 | [ 3.14, 3.14, 3.14, 3.14, 3.14]]) 89 | ``` 90 | 91 | Sequence starts at 0, ends at 20 (non-inclusive), steps by 2 (like `range()`): 92 | 93 | ```python 94 | >>> np.arange(0, 20, 2) 95 | array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]) 96 | ``` 97 | 98 | Five values spaced between 0 and 1 evenly: 99 | 100 | ```python 101 | >>> np.linspace(0, 1, 5) 102 | array([ 0. , 0.25, 0.5 , 0.75, 1. ]) 103 | ``` 104 | 105 | Random values in 3x3 array: 106 | 107 | ```python 108 | >>> np.random.random((3, 3)) 109 | array([[ 0.99844933, 0.52183819, 0.22421193], 110 | [ 0.08007488, 0.45429293, 0.20941444], 111 | [ 0.14360941, 0.96910973, 0.946117 ]]) 112 | ``` 113 | 114 | Can also use `np.random.normal()`, `np.random.randint()`. 115 | 116 | Identity matrix: 117 | 118 | ```python 119 | >>> np.eye(3) 120 | array([[ 1., 0., 0.], 121 | [ 0., 1., 0.], 122 | [ 0., 0., 1.]]) 123 | ``` 124 | 125 | Uninitialized array, just contains whatever is currently in memory: 126 | 127 | ```python 128 | >>> np.empty(3) 129 | array([ 1., 1., 1.]) 130 | ``` 131 | 132 | ## Data types 133 | 134 | Can specify these as `dtypes` as strings, e.g. `int16` or as NumPy 135 | objects, e.g. `np.int16`. 136 | 137 | Boolean type, various (un)signed integers, floats and complex numbers. 138 | 139 | Also compound data types (see later). 140 | 141 | ## Working with arrays 142 | 143 | ### Array attributes 144 | 145 | `.ndim`: number of dimensions. 146 | `.shape`: size of dimensions. 147 | `.size`: total size of array. 148 | `.dtype`: data type of array. 149 | `.itemsize`: size in bytes of each array element. 150 | `.nbytes`: size in bytes of the array. 151 | 152 | ### Array indexing 153 | 154 | Can use Python-like indexing, square brackets, including negative 155 | indices. Counting starts at zero, as usual with Python indexing. 156 | 157 | For multidimensional arrays, can use comma-separated tuples of indices, 158 | 159 | e.g. `a[0, 0]`. 160 | 161 | ### Value assignment in arrays 162 | 163 | Can assign using e.g. `a[0, 0] = 3`. But, ensure you use the correct 164 | type, e.g. putting a float into a NumPy array with integer type, will 165 | convert it. 166 | 167 | ### Array slicing 168 | 169 | Can slice as in Python, `x[start:stop:step]`. 170 | 171 | Omitted values default to: 172 | 173 | * `start`, 0; 174 | * `stop`, size of dimension; 175 | * `step`, 1. 176 | 177 | Although, if `step` is negative, defaults for start and stop get 178 | reversed. Can use `a[::-1]` to reverse an array. 179 | 180 | #### Multidimensional array slicing 181 | 182 | Separate slices for each dimension with commas, e.g. `a[:2, :3]` gives the 183 | first two rows, and first three columns for a 2D array. 184 | 185 | ### Accessing rows or columns of 2D array 186 | 187 | Combine indexing and slicing, e.g. `a[:, 0]` gives the first column of 188 | `a` while `a[0, :]` is the first row. Can omit empty slice for a row, 189 | e.g. `a[0]`. 190 | 191 | ### Slices are views, not copies! 192 | 193 | Changing a value of a NumPy array slice changes the original array. 194 | 195 | #### Copying arrays 196 | 197 | If you really want a copy, can use `.copy()` on the slice. 198 | 199 | ### Array reshaping 200 | 201 | Change a 1D array to a 3x3 grid: 202 | 203 | ```python 204 | grid = np.arange(1, 10).reshape((3, 3)) 205 | print(grid) 206 | [[1 2 3] 207 | [4 5 6] 208 | [7 8 9]] 209 | ``` 210 | 211 | Reshaped array has to have same size as original. May be a view of the 212 | initial array, where possible. 213 | 214 | 1D array into 2D row or matrix, can use reshape such as here to change 215 | an array into a row vector. 216 | 217 | ```python 218 | >>> x = np.array([1, 2, 3]) 219 | >>> x.reshape((1, 3)) 220 | array([[1, 2, 3]]) 221 | ``` 222 | 223 | Alternatively, can use `np.newaxis`: 224 | 225 | ```python 226 | >>> x[np.newaxis, :] 227 | array([[1, 2, 3]]) 228 | ``` 229 | 230 | and with columns: 231 | 232 | ```python 233 | >>> x[:, np.newaxis] # equivalent to x.reshape((3, 1)) 234 | array([[1], 235 | [2], 236 | [3]]) 237 | ``` 238 | 239 | ### Array concatenation 240 | 241 | `np.concatenate()` takes a tuple or list of arrays as first argument. 242 | 243 | Can join multiple arrays. 244 | 245 | Also, for multidimensional arrays. By default, it concatenates along the 246 | first axis, but can specify an axis to concatenate along (zero-indexed). 247 | 248 | `np.vstack()`, `np.hstack()` and `np.dstack()` can be clearer for arrays of 249 | mixed dimensions. 250 | 251 | ### Array splitting 252 | 253 | `np.split()`, `np.hsplit()`, `np.vsplit()`. 254 | 255 | Pass a list of indices giving split points. 256 | 257 | ## Computation on arrays 258 | 259 | ### Universal functions (ufuncs) 260 | 261 | Computation on arrays can be fast or slow. It is fast when using 262 | vectorised operations, via NumPy's ufuncs. 263 | 264 | If handling values individually, may think to use a loop. However, slow 265 | due to overhead of type checking and looking up the correct function to 266 | use for the type. If we knew the type before the code executes, we could 267 | compute this faster as we could skip these steps. 268 | 269 | Instead, use ufuncs which are optimised. 270 | 271 | Perform operation on array instead, and this is applied to every 272 | element, e.g. do `1.0/myarray`, not iterate through `myarray` and 273 | compute the `1.0/myarray[i]` each time. 274 | 275 | Can operate on two arrays, not just an individual value and an array: 276 | 277 | ```python 278 | >>> np.arange(5) / np.arange(1, 6) 279 | array([ 0. , 0.5 , 0.66666667, 0.75 , 0.8 ]) 280 | ``` 281 | 282 | Also on multidimensional arrays: 283 | 284 | ```python 285 | >>> x = np.arange(9).reshape((3, 3)) 286 | >>> 2 ** x 287 | array([[ 1, 2, 4], 288 | [ 8, 16, 32], 289 | [ 64, 128, 256]]) 290 | ``` 291 | 292 | **If looping through an array to compute something, consider whether 293 | there's an appropriate ufunc instead.** 294 | 295 | Both unary and binary ufuncs exist, i.e. operating on one or two arrays. 296 | 297 | #### Mathematical functions 298 | 299 | ##### Arithmetic 300 | 301 | `+`, `*`, `-`, `/`, `//` (integer division), `-` (negation), `**` 302 | (exponentiation), `%` (modulus). 303 | 304 | These are wrappers around NumPy functions, e.g. when you use `+` with an 305 | array, you are really using `np.add()`. 306 | 307 | ##### Absolute value 308 | 309 | Also, `abs()` is really `np.absolute()` or `np.abs()`; also works with 310 | complex numbers. 311 | 312 | ##### Trigonometric functions 313 | 314 | `np.sin()`, `np.cos()`, `np.tan()` and inverse functions: `np.arcsin()`, 315 | `np.arccos()`, `np.arctan()`. 316 | 317 | ##### Exponents and logarithms 318 | 319 | `np.exp()` uses e as base, `np.exp2()` uses 2 as base and 320 | `np.power()` lets you specify a base (or bases, in an array). 321 | 322 | `np.log()` is natural logarithm, can also have base 2, `np.log2()` or 323 | base 10, `np.log10()`. 324 | 325 | For small inputs, `np.expm1()` and `np.log1p()` to maintain greater 326 | precision. 327 | 328 | ##### And more 329 | 330 | More in NumPy itself, also lots more specialised functions for maths in 331 | `scipy.special`. 332 | 333 | #### Specifying array output 334 | 335 | Can skip writing array values into a temporary array and then copying 336 | into the target, by directly writing into the target. 337 | 338 | ```python 339 | >>> x = np.arange(5) 340 | >>> y = np.empty(5) 341 | >>> np.multiply(x, 10, out=y) 342 | ``` 343 | 344 | Can use with array views too, e.g.: 345 | 346 | ```python 347 | >>> x = np.arange(5) 348 | >>> y = np.zeros(10) 349 | >>> np.power(2, x, out=y[::2]) 350 | ``` 351 | 352 | If we'd assigned as `y[::2] = 2 ** x`, this would produce a temporary 353 | array, then copy the values to `y`. 354 | 355 | #### Aggregates 356 | 357 | ##### `reduce` 358 | 359 | Applies an operation repeatedly to elements of array until there is a 360 | single result. 361 | 362 | ```python 363 | >>> x = np.arange(1, 6) 364 | >>> np.add.reduce(x) 365 | 15 366 | ``` 367 | 368 | ##### `accumulate` 369 | 370 | Applies an operation repeatedly to elements of array and stores each 371 | result in turn. 372 | 373 | ```python 374 | >>> x = np.arange(1, 6) 375 | >>> np.multiply.accumulate(x) 376 | array([ 1, 2, 6, 24, 120]) 377 | ``` 378 | 379 | However, note that NumPy has `np.sum()`, `np.prod()`, `np.cumsum()`, 380 | `np.cumprod()` for these specific cases. 381 | 382 | #### Outer products: `outer` 383 | 384 | Computes output of all pairs of two inputs. 385 | 386 | ```python 387 | >>> x = np.arange(1, 6) 388 | >>> np.multiply.outer(x, x) 389 | array([[ 1, 2, 3, 4, 5], 390 | [ 2, 4, 6, 8, 10], 391 | [ 3, 6, 9, 12, 15], 392 | [ 4, 8, 12, 16, 20], 393 | [ 5, 10, 15, 20, 25]]) 394 | ``` 395 | 396 | 397 | ### Aggregation functions 398 | 399 | #### Summation 400 | 401 | Can use Python's `sum()`, but this is slower than NumPy's `np.sum()` and 402 | these two functions behave differently. 403 | 404 | #### Minimum and maximum 405 | 406 | Again, `np.min()` and `np.max()` are faster than `min()` and `max()` on 407 | NumPy arrays. 408 | 409 | Can also use method forms of these on the array itself: 410 | `my_array.sum()`. 411 | 412 | #### Multidimensional arrays 413 | 414 | By default, aggregation occurs over the whole array. 415 | 416 | However, can specify an axis along which to apply aggregation, e.g. 417 | `my_array.min(axis=0)`. 418 | 419 | ```python 420 | >>> M = np.random.random((3, 4)) 421 | >>> M 422 | [[ 0.8967576 0.03783739 0.75952519 0.06682827] 423 | [ 0.8354065 0.99196818 0.19544769 0.43447084] 424 | [ 0.66859307 0.15038721 0.37911423 0.6687194 ]] 425 | >>> M.min(axis=0) 426 | array([ 0.66859307, 0.03783739, 0.19544769, 0.06682827]) 427 | >>> M.max(axis=1) 428 | array([ 0.8967576 , 0.99196818, 0.6687194 ]) 429 | ``` 430 | 431 | For a 2D array, can think of the rows as a vertical axis, 0, and the columns as 432 | a horizontal axis, 1. 433 | 434 | This is slightly confusing terminology though. When you move "along the rows", 435 | along axis 0, you're really moving through columns, aggregating into one row. 436 | Could think of as moving along the rows for each column. 437 | 438 | Likewise, when you move "along columns", along axis 1, you're really moving 439 | across through a row, aggregating into one column. Could think of as moving 440 | along the columns for each row. 441 | 442 | If we move along the 0 axis, we move down the rows and compute the aggregation 443 | for each column in the row. 444 | 445 | If we move along the 1 axis, we move along the columns and compute the 446 | aggregation for each row in the column. 447 | 448 | The book describes this keyword `axis` as specifying "the dimension of the 449 | array that will be collapsed". This is a simpler way of thinking about this, 450 | especially for higher dimensional arrays. 451 | 452 | Also note that when a dimension is "collapsed" in this way, the resulting array 453 | actually has one fewer dimension; e.g. aggregating along columns for a 2D 454 | array, we don't get a 2D array with just one item in each row; we actually get 455 | an equivalent 1D array containing all of the items instead (because we don't 456 | need any extent along the now removed dimension any longer). 457 | 458 | (I guess that this is a nice feature because it means you get a consistent 459 | shape of output for this aggregation regardless of the axis you aggregated 460 | along.) 461 | 462 | #### Other functions 463 | 464 | Number of aggregation functions: 465 | 466 | `np.sum()`, `np.prod()`, `np.mean()`, `np.std()`, `np.var()`, 467 | `np.min()`, `np.max()`, `np.argmin()`, `np.argmax()`, `np.median()`, 468 | `np.percentile()` 469 | 470 | also with alternative `NaN` safe versions that ignore missing values. 471 | 472 | For evaluating whether elements are true: 473 | 474 | `np.any()`, `np.all()` 475 | 476 | ## Broadcasting 477 | 478 | For arrays of the same size, binary operations are applied 479 | element-by-element. 480 | 481 | ```python 482 | >>> a = np.array([0, 1, 2]) 483 | >>> b = np.array([5, 5, 5]) 484 | >>> a + b 485 | array([5, 6, 7]) 486 | ``` 487 | 488 | Broadcasting extends this idea: it is a set of rules for applying ufuncs 489 | to arrays of different sizes. 490 | 491 | e.g. add a scalar to an array: 492 | 493 | ```python 494 | >>> a + 5 495 | array([5, 6, 7]) 496 | ``` 497 | 498 | Can imagine that the value 5 is duplicated into an array [5, 5, 5] and 499 | then added to `a`, though this doesn't really happen (and is therefore 500 | an advantage of how NumPy works). 501 | 502 | Can extend this to higher dimension arrays; add a 1D array to a 2D array: 503 | 504 | ```python 505 | >>> M = np.ones((3, 3)) 506 | >>> M + a 507 | array([[ 1., 2., 3.], 508 | [ 1., 2., 3.], 509 | [ 1., 2., 3.]]) 510 | ``` 511 | 512 | `a` now gets stretched, or broadcast, across the second dimension to match M. 513 | 514 | Sometimes *both* arrays can be broadcast this way. 515 | 516 | ```python 517 | >>> a = np.arange(3) 518 | >>> b = np.arange(3)[:, np.newaxis] 519 | >>> print(a) 520 | [0 1 2] 521 | >>> print(b) 522 | [[0] 523 | [1] 524 | [2]] 525 | >>> a + b 526 | array([[0, 1, 2], 527 | [1, 2, 3], 528 | [2, 3, 4]]) 529 | ``` 530 | 531 | Both `a` and `b` are stretched to a common shape, `a` is stretched along rows, 532 | while `b` is stretched along columns, the 0, 1, 2 values can be thought 533 | of as being replicated (although they aren't). Both are broadcast to arrays 534 | with three rows and columns, and then the operation is applied. 535 | 536 | ### Broadcasting rules 537 | 538 | When arrays interact, broadcasting follows these rules: 539 | 540 | 1. If two arrays differ in their number of dimensions, the shape of the 541 | one with fewer dimensions is left padded with ones. 542 | 2. If the shape of the two arrays doesn't match in any dimension, the 543 | array with shape equal to 1 in that dimension is stretched to match 544 | the other shape. 545 | 3. If in any dimension the sizes disagree and neither is equal to 1, 546 | an error is raised. 547 | 548 | Applies to all binary ufuncs. 549 | 550 | ### Examples 551 | 552 | #### Adding 2D array to 1D array 553 | 554 | ```python 555 | >>> M = np.ones((2, 3)) 556 | >>> a = np.arange(3) 557 | ``` 558 | 559 | `M.shape` = `(2, 3)` 560 | `a.shape` = `(3,)` 561 | 562 | Rule 1: `a` has fewer dimensions, left pad with 1: 563 | 564 | `a.shape` = `(1, 3)` 565 | 566 | Rule 2: the first dimensions of `a` and `M` don't match, so stretch `a` 567 | to match `M` as `a` has shape 1 in first dimension. 568 | 569 | `a.shape` = `(2, 3)` and now these arrays can be added. Effectively, can 570 | think of `a` getting stretched along rows, so the `[0, 1, 2]` gets 571 | duplicated. 572 | 573 | ```python 574 | >>> M + a 575 | array([[ 1., 2., 3.], 576 | [ 1., 2., 3.]]) 577 | ``` 578 | 579 | #### Broadcasting both arrays 580 | 581 | ```python 582 | >>> a = np.arange(3).reshape((3, 1)) 583 | >>> b = np.arange(3) 584 | ``` 585 | 586 | `a.shape = (3, 1)` 587 | `b.shape = (3,)` 588 | 589 | Rule 1: `b` has fewer dimensions, left pad with 1: 590 | 591 | `b.shape = (1, 3)` 592 | 593 | Rule 2: stretch array dimensions that are 1 to match the corresponding 594 | dimension of the other array. Here, *both* arrays are stretched. 595 | 596 | `a.shape = (3, 3)` 597 | `b.shape = (3, 3)` 598 | 599 | Now we can add them. 600 | 601 | ```python 602 | >>> a + b 603 | array([[0, 1, 2], 604 | [1, 2, 3], 605 | [2, 3, 4]]) 606 | ``` 607 | 608 | #### Where broadcasting fails 609 | 610 | ```python 611 | >>> M = np.ones((3, 2)) 612 | >>> a = np.arange(3) 613 | ``` 614 | 615 | `M.shape = (3, 2)` 616 | `a.shape = (3,)` 617 | 618 | Rule 1: left pad `a`. 619 | 620 | `a.shape = (1, 3)` 621 | 622 | Rule 2: stretch first dimension of `a` to match `M`: 623 | 624 | `a.shape = (3, 3)` 625 | 626 | Rule 3: here, the final shapes of `a` and `M` don't match, so an error occurs 627 | if you try and do `M + a`. 628 | 629 | NB: these arrays *would* be compatible if only we padded `a` on the right, 630 | instead of the left (because `a.shape` would be `(3, 1)` which could then be 631 | stretched to `(3, 2)` which matches `m.shape`). However, if you wish to do 632 | this, you have to do it explicitly first; broadcasting's rules don't allow 633 | this: 634 | 635 | ```python 636 | >>> a[:, np.newaxis].shape 637 | (3, 1) 638 | >>> M + a[:, np.newaxis] 639 | array([[ 1., 1.], 640 | [ 2., 2.], 641 | [ 3., 3.]]) 642 | ``` 643 | 644 | #### Uses of broadcasting 645 | 646 | Examples: 647 | 648 | * centring an array of observations, where each contains multiple values. 649 | Can calculate the mean of these values, which is a 1D array, and subtract 650 | from our 2D array of observations, and the subtraction occurs via 651 | broadcasting the 1D array. 652 | 653 | * calculating/plotting a 2D function; can create a range of values in a 1D 654 | array and another range of values in a 2D array with one column per row as a 655 | series of x and y values to calculate z for, then use broadcasted values to 656 | calculate this for each combination of x and y. 657 | 658 | * likewise, a times table is another example, can have 1-10 in a 1D array, 659 | 1-10 in a 2D array with a single column, and then do the multiplication by 660 | broadcasting both arrays. (So similar function here to outer product.) 661 | 662 | ## Comparisons, masks and Boolean logic 663 | 664 | ### Comparison operators as ufuncs 665 | 666 | Comparison operators are implemented as ufuncs. 667 | 668 | * `<` as `np.less` 669 | * `>` as `np.greater` 670 | * `<=` as `np.less_equal` 671 | * `>=` as `np.greater_equal` 672 | * `==` as `np.equal` 673 | * `!=` as `np.not_equal` 674 | 675 | The output of these is an array with Boolean data type. 676 | 677 | ```python 678 | >>> x = np.array([1, 2, 3, 4, 5]) 679 | >>> x < 3 680 | array([ True, True, False, False, False], dtype=bool) 681 | ``` 682 | 683 | Can also do element-wise comparison of arrays and use compound 684 | expressions: 685 | 686 | ```python 687 | >>> (2 * x) == (x ** 2) 688 | array([False, True, False, False, False], dtype=bool) 689 | ``` 690 | 691 | Work on any size or shape of array. 692 | 693 | ### Working with Boolean arrays 694 | 695 | `np.count_nonzero()` for counting the number of `True` entries in a 696 | Boolean array. 697 | 698 | ```python 699 | >>> y = array([[5, 0, 3, 3], 700 | [7, 9, 3, 5], 701 | [2, 4, 7, 6]]) 702 | >>> np.count_nonzero(y < 6) 703 | 8 704 | ``` 705 | 706 | Can also use `np.sum()` where `True` is interpreted as `1` and `False` 707 | as `0`. The advantage of this is that you can do this `sum` along an 708 | axis, e.g. to find the counts along rows or columns. 709 | 710 | ```python 711 | >>> np.sum(y < 6, axis=1) 712 | array([4, 2, 2]) 713 | ``` 714 | 715 | This calculates the sum in each row. 716 | 717 | For checking whether any or all values are `True`, can use `np.any()` or 718 | `np.all()`. 719 | 720 | ```python 721 | >>> np.all(y < 10) 722 | True 723 | ``` 724 | 725 | and can use both `np.any()` and `np.all()` along axes. 726 | 727 | ```python 728 | >>> np.all(y < 8, axis=1) 729 | array([ True, False, True], dtype=bool) 730 | ``` 731 | 732 | **Ensure you use the NumPy `np.sum()`, `np.any()` and `np.all()` 733 | functions, not the Python built-ins `sum()`, `any()` and `all()` as 734 | these may not behave as expected.** 735 | 736 | ### Boolean operators 737 | 738 | Can also use Boolean operators. 739 | 740 | Use Python's bitwise logic operators: `&` (bitwise and), `|` (bitwise 741 | or), `^` (bitwise exclusive or), `~` (complement). 742 | 743 | These have the equivalent ufuncs: 744 | 745 | * `&` `np.bitwise_and()` 746 | * `|` `np.bitwise_or()` 747 | * `^` `np.bitwise_xor()` 748 | * `~` `np.bitwise.not()` 749 | 750 | Can combine these on an array representing rainfall data: 751 | 752 | ```python 753 | >>> np.sum((inches > 0.5) & (inches < 1)) 754 | ``` 755 | 756 | Need the parentheses, otherwise `0.5 & inches` gets evaluated first and 757 | results in an error. 758 | 759 | This *A AND B* condition is equivalent to: 760 | 761 | ```python 762 | >>> np.sum(~( (inches <= 0.5) | (inches >= 1) )) 763 | ``` 764 | 765 | (Since this represents *NOT (NOT A OR NOT B)* and *NOT A OR NOT B* is 766 | the same as *NOT (A AND B)*.) 767 | 768 | #### Why bitwise operators? 769 | 770 | `and`, `or` operate on *entire objects*, while `&` and `|` refer to 771 | *bits wihin an object*. 772 | 773 | When we have Boolean values in a NumPy array, effectively this is a 774 | series of bits where `1` is `True` and `0` is `False`. We therefore can 775 | use the bitwise operators to carry out these operations, as these are 776 | applied to individual bits within the array. 777 | 778 | On the other hand, the Boolean value of an array with more than one 779 | object is ambiguous. 780 | 781 | ### Boolean arrays as masks 782 | 783 | Can use Boolean arrays as masks, to select subsets of data. 784 | 785 | ```python 786 | >>> y < 5 787 | array([[False, True, True, True], 788 | [False, False, True, False], 789 | [ True, True, False, False]], dtype=bool) 790 | ``` 791 | 792 | This gives a Boolean array. We can use this array as an index value; this is 793 | a *masking* operation. 794 | 795 | ```python 796 | >>> y[y < 5] 797 | array([0, 3, 3, 3, 2, 4]) 798 | ``` 799 | 800 | The result is a 1D array that contains the values that meet the 801 | condition, i.e. where the mask array is `True`. 802 | 803 | ## Fancy indexing 804 | 805 | So far, indexing via indices, slices and Boolean masks. 806 | 807 | Fancy indexing is where arrays of indices are passed, for accessing or 808 | modifying complicated subsets of array values. 809 | 810 | ### Simple uses 811 | 812 | ```python 813 | >>> import numpy as np 814 | >>> rand = np.random.RandomState(42) 815 | >>> x = rand.randint(100, size=10) 816 | [51 92 14 71 60 20 82 86 74 74] 817 | ``` 818 | 819 | Could access different elements by individually accessing indices: 820 | 821 | ```python 822 | >>> [x[3], x[7], x[4]] 823 | [71, 86, 60] 824 | ``` 825 | 826 | or by passing a list or array of indices: 827 | 828 | ```python 829 | >>> x[[3, 7, 4]] 830 | array([71, 86, 60]) 831 | ``` 832 | 833 | When using fancy indexing, the shape of the result reflects the shape of 834 | the index array, not the shape of the array being indexed: 835 | 836 | ```python 837 | >>> ind = np.array([[3, 7], 838 | [4, 5]]) 839 | >>> x[ind] 840 | array([[71, 86], 841 | [60, 20]]) 842 | ``` 843 | 844 | ### With multidimensional arrays 845 | 846 | ```python 847 | >>> X = np.arange(12).reshape((3, 4)) 848 | >>> X 849 | array([[ 0, 1, 2, 3], 850 | [ 4, 5, 6, 7], 851 | [ 8, 9, 10, 11]]) 852 | ``` 853 | 854 | ```python 855 | >>> row = np.array([0, 1, 2]) 856 | >>> col = np.array([2, 1, 3]) 857 | >>> X[row, col] 858 | array([ 2, 5, 11]) 859 | ``` 860 | 861 | The first index refers to the row, the second index to the column, so we 862 | effectively get the `X` values `[0, 2]`, `[1, 1]`, `[2, 3]`. 863 | 864 | Pairing of indices also follows broadcasting rules. 865 | 866 | ```python 867 | >>> X[row[:, np.newaxis], col] 868 | array([[ 2, 1, 3], 869 | [ 6, 5, 7], 870 | [10, 9, 11]]) 871 | ``` 872 | 873 | Each value from `row` gets matched with a value from `col`; both these 874 | arrays are broadcast, with the `row` column getting duplicated left to 875 | right and the `col` rows getting duplicated top to bottom. 876 | 877 | So, we get values `[0, 2]`, `[0, 1]`, `[0, 3]`, then `[1, 2]`, `[1, 1]`, 878 | `[1, 3]` etc. 879 | 880 | ### Combined indexing 881 | 882 | Can combine these different ways of indexing. 883 | 884 | Fancy and simple indices: 885 | 886 | ```python 887 | >>> X[2, [2, 0, 1]] 888 | array([10, 8, 9]) 889 | ``` 890 | 891 | Fancy indexing and slices: 892 | 893 | ```python 894 | >>> X[1:, [2, 0, 1]] 895 | array([[ 6, 4, 5], 896 | [10, 8, 9]]) 897 | ``` 898 | 899 | Fancy indexing and masking: 900 | 901 | ```python 902 | >>> mask = np.array([1, 0, 1, 0], dtype=bool) 903 | >>> X[row[:, np.newaxis], mask] 904 | array([[ 0, 2], 905 | [ 4, 6], 906 | [ 8, 10]]) 907 | ``` 908 | 909 | ### Modifying values 910 | 911 | Simple example where you set values of an array according to an array of 912 | indices. 913 | 914 | ```python 915 | >>> x = np.arange(10) 916 | >>> i = np.array([2, 1, 8, 4]) 917 | >>> x[i] = 99 918 | >>> x 919 | [ 0 99 99 3 99 5 6 7 99 9] 920 | ``` 921 | 922 | Can use any assignment-type operator. 923 | 924 | ```python 925 | >>> x[i] -= 10 926 | >>> x 927 | [ 0 89 89 3 89 5 6 7 89 9] 928 | ``` 929 | 930 | Repeated indices can give unexpected behaviour. 931 | 932 | ```python 933 | >>> x = np.zeros(10) 934 | >>> x[[0, 0]] = [4, 6] 935 | >>> x 936 | [ 6. 0. 0. 0. 0. 0. 0. 0. 0. 0.] 937 | ``` 938 | 939 | What happens is that the `x[0] = 4` assignment happens first, then `x[0] = 6`. 940 | 941 | But: 942 | 943 | ```python 944 | >>> i = [2, 3, 3, 4, 4, 4] 945 | >>> x[i] += 1 946 | >>> x 947 | array([ 6., 0., 1., 1., 1., 0., 0., 0., 0., 0.]) 948 | ``` 949 | 950 | Incrementing doesn't occur repeatedly. 951 | 952 | `x[i] += 1` means `x[i] = x[i] + 1`. 953 | 954 | The evaluation of `x[i] + 1` happens first, then the result is assigned 955 | to each index. So, the assignment happens repeatedly, not the increment. 956 | 957 | If you do want repeated incrementing, then you can do: 958 | 959 | ```python 960 | >>> x = np.zeros(10) 961 | >>> np.add.at(x, i, 1) 962 | >>> x 963 | [ 0. 0. 1. 2. 3. 0. 0. 0. 0. 0.] 964 | ``` 965 | 966 | Here, `at()` applies the operator (`add`) at the indices `i` with the 967 | value 1. 968 | 969 | ## Sorting arrays 970 | 971 | `np.sort()` uses quicksort, with mergesort and heapsort available. 972 | 973 | To get a new sorted array: 974 | 975 | ```python 976 | >>> x = np.array([2, 1, 4, 3, 5]) 977 | >>> np.sort(x) 978 | array([1, 2, 3, 4, 5]) 979 | ``` 980 | 981 | Sort in-place: 982 | 983 | ```python 984 | >>> x.sort() 985 | ``` 986 | 987 | `np.argsort()` returns indices of sorted elements as an array. Could use 988 | this with fancy indexing to get the sorted array. 989 | 990 | ```python 991 | >>> x = np.array([2, 1, 4, 3, 5]) 992 | >>> i = np.argsort(x) 993 | >>> print(i) 994 | [1 0 3 2 4] 995 | >>> x[i] 996 | array([1, 2, 3, 4, 5]) 997 | ``` 998 | 999 | ### Sort along rows or columns 1000 | 1001 | Use `axis`. 1002 | 1003 | ```python 1004 | >>> rand = np.random.RandomState(42) 1005 | >>> X = rand.randint(0, 10, (4, 6)) 1006 | >>> print(X) 1007 | [[6 3 7 4 6 9] 1008 | [2 6 7 4 3 7] 1009 | [7 2 5 4 1 7] 1010 | [5 1 4 0 9 5]] 1011 | >>> np.sort(X, axis=0) 1012 | array([[2, 1, 4, 0, 1, 5], 1013 | [5, 2, 5, 4, 3, 7], 1014 | [6, 3, 7, 4, 6, 7], 1015 | [7, 6, 7, 4, 9, 9]]) 1016 | ``` 1017 | 1018 | ### Partial sorts by `np.partition` 1019 | 1020 | Finds the smallest `k` values in the array and returns a new array with the 1021 | smallest `k` values to the left of the partition and the remaining values to 1022 | the right, in arbitrary order: 1023 | 1024 | ```python 1025 | >>> x = np.array([7, 2, 3, 1, 6, 5, 4]) 1026 | >>> np.partition(x, 3) 1027 | array([2, 1, 3, 4, 6, 5, 7]) 1028 | ``` 1029 | 1030 | Can also use along an axis of a multidimensional array. 1031 | 1032 | ```python 1033 | >>> np.partition(X, 2, axis=1) 1034 | array([[3, 4, 6, 7, 6, 9], 1035 | [2, 3, 4, 7, 6, 7], 1036 | [1, 2, 4, 5, 7, 7], 1037 | [0, 1, 4, 5, 9, 5]]) 1038 | ``` 1039 | 1040 | Also `np.argpartition()` available and this is analogous to `np.argsort()`. 1041 | 1042 | ## Structured arrays 1043 | 1044 | Often data can be represented by homogeneous values. Not always. NumPy 1045 | has some support for compound, heterogeneous values in arrays. For 1046 | simple cases, can use structured arrays and record arrays. But, for more 1047 | complex cases, better to use pandas instead. 1048 | 1049 | Strcutured arrays are arrays with compound data types. 1050 | 1051 | ```python 1052 | >>> name = ['Alice', 'Bob', 'Cathy', 'Doug'] 1053 | >>> age = [25, 45, 37, 19] 1054 | >>> weight = [55.0, 85.5, 68.0, 61.5] 1055 | >>> data = np.zeros(4, dtype={'names':('name', 'age', 'weight'), 1056 | 'formats':('U10', 'i4', 'f8')}) 1057 | >>> print(data.dtype) 1058 | [('name', '>> data['name'] = name 1067 | >>> data['age'] = age 1068 | >>> data['weight'] = weight 1069 | >>> print(data) 1070 | [('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0) 1071 | ('Doug', 19, 61.5)] 1072 | ``` 1073 | 1074 | Can identify values by index or name: 1075 | 1076 | ```python 1077 | >>> data['name'] 1078 | array(['Alice', 'Bob', 'Cathy', 'Doug'], 1079 | dtype='>> data[0] 1081 | ('Alice', 25, 55.0) 1082 | >>> data[-1]['name'] 1083 | 'Doug' 1084 | >>> data[data['age'] < 30]['name'] # Boolean masking 1085 | array(['Alice', 'Doug'], 1086 | dtype='>> np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')]) 1094 | dtype([('name', 'S10'), ('age', '>> np.dtype('S10,i4,f8') 1101 | dtype([('f0', 'S10'), ('f1', '` indicates little or big endian; the next character 1105 | indicates the type of data and the last character represents the size in bytes. 1106 | 1107 | Note that these NumPy structured array `dtypes` map directly to C 1108 | structure definitions. 1109 | 1110 | ### Record arrays 1111 | 1112 | `np.recarray` are record arrays, like structured arrays but can access fields 1113 | as attributes instead of dictionary keys. However, these fields can be slower 1114 | to access than in structured arrays, even when using dictionary keys. 1115 | -------------------------------------------------------------------------------- /Chapter_4_Matplotlib.md: -------------------------------------------------------------------------------- 1 | # Chapter 4: Matplotlib 2 | 3 | ## Visualisation with Matplotlib 4 | 5 | Built on NumPy arrays, and designed to work with the SciPy stack. 6 | 7 | Works with many operating systems and graphics backends, supporting many 8 | output types: a big strength of Matplotlib, and led to a large user and 9 | developer base, as well as it being prominent in scientific Python use. 10 | 11 | More recently, the interface and style of Matplotlib has begun to show 12 | their age. Newer tools like ggplot and ggvis in R, and web visualisation 13 | toolkits based on D3 and HTML5 canvas, make Matplotlib feel clunky and 14 | old-fashioned. But, Matplotlib does still have strengths in being 15 | well-tested and cross-platform. 16 | 17 | Recent versions of Matplotlib have made it relatively easy to set global 18 | plotting styles, and people have developed new packages that build on it 19 | to drive Matplotlib via cleaner APIs (e.g. Seaborn, ggpy, HoloViews, 20 | Altair and pandas itself). Even with wrappers like these, it is useful 21 | to dive into Matplotlib's syntax to adjust the final plot output. It may 22 | be that Matplotlib remains a vital tool for data visualisation, even if 23 | new tools mean users move away from using Matplotlib's API directly. 24 | 25 | ### General Matplotlib 26 | 27 | #### Importing Matplotlib 28 | 29 | Often use shorthand for importing Matplotlib, as for NumPy and pandas: 30 | 31 | ```python 32 | import matplotlib as mpl 33 | import matplotlib.pyplot as plt 34 | ``` 35 | 36 | The `plt` interface will be used most often here. 37 | 38 | #### Setting styles 39 | 40 | Use `plt.style` directive to choose an aesthetic style for figures: 41 | 42 | ```python 43 | plt.style.use('classic') 44 | ``` 45 | 46 | Here we set the `classic` style, which ensures that plots use the 47 | classic Matplotlib style. 48 | 49 | #### How to display plots 50 | 51 | Viewing Matplotlib plots depends on context. The best use of Matplotlib 52 | differs depending on how you use it. 53 | 54 | ##### Plotting from a script 55 | 56 | Here, `plt.show()` is what you want. `plt.show()` starts an event loop, 57 | looks for all currently active figure objects and opens one or more 58 | interactive windows that display your figure or figures. 59 | 60 | For example, you may have `myplot.py`: 61 | 62 | ```python 63 | import matplotlib.pyplot as plt 64 | import numpy as np 65 | 66 | x = np.linspace(0, 10, 100) 67 | 68 | plt.plot(x, np.sin(x)) 69 | plt.plot(x, np.cos(x)) 70 | 71 | plt.show() 72 | ``` 73 | 74 | You can run this script: 75 | 76 | ```sh 77 | $ python myplot.py 78 | ``` 79 | 80 | which will result in a window opening with your figure displayed. 81 | 82 | The `plt.show()` command does a lot: interacting with your system's 83 | interactive graphical backend. How this works can vary from system to 84 | system, and even from installation to installation. 85 | 86 | `plt.show()` should only be used once per Python session. Usually, it is 87 | used at the end of the script. Multiple `show()` commands can have 88 | unpredictable, backend-dependent behaviour, and should typically be 89 | avoided. 90 | 91 | ##### Plotting from an IPython shell 92 | 93 | It can be convenient to use Matplotlib interactively with an IPython 94 | shell. IPython can work well with Matplotlib if you use Matplotlib mode, 95 | by using the `%matplotlib` magic command: 96 | 97 | ```python 98 | In [1]: %matplotlib 99 | Using matplotlib backend as TkAgg 100 | 101 | In [2]: import matplotlib.pyplot as plt 102 | ``` 103 | 104 | From then on, any `plt` plot command will cause a figure window to open, 105 | and further commands can be run to update the plot. Some changes (such 106 | as modifying properties of lines that are already drawn) will not draw 107 | automatically: use `plt.draw()` to force an update. Using `plt.show()` 108 | in Matplotlib mode is not required. 109 | 110 | ##### Plotting from an IPython notebook 111 | 112 | The IPython notebook is a browser-based interactive data analysis tool 113 | that combines text, code, graphics, HTML elements and more into a single 114 | executable document. 115 | 116 | Plotting interactively within an IPython notebook can be done with the 117 | `%matplotlib` command and works in a similar way to the IPython shell. 118 | It is also possible to embed graphics directly in the notebook: 119 | 120 | * `%matplotlib notebook` leads to *interactive* plots embedded in the 121 | notebook. 122 | * `%matplotlib inline` leads to *static* images of plots embedded in the 123 | notebook. 124 | 125 | After running `%matplotlib inline` in a notebook (once per 126 | kernel/session), any cell within the notebok that creates a plot will 127 | embed a PNG image of the resulting graphic. 128 | 129 | For example: 130 | 131 | ```python 132 | import numpy as np 133 | x = np.linspace(0, 10, 100) 134 | 135 | fig = plt.figure() 136 | plt.plot(x, np.sin(x), '-') 137 | plt.plot(x, np.cos(x), '--'); # NB: using a semi-colon suppresses the command output. 138 | ``` 139 | 140 | ##### Saving figures to file 141 | 142 | Matplotlib can save figures in a wide variety of formats using the 143 | `savefig()` command. For example, to save the previous figure as a PNG 144 | file: 145 | 146 | ```python 147 | fig.savefig('my_figure.png') 148 | ``` 149 | 150 | We now have the file in the current working directory shown by: 151 | 152 | ``` 153 | !ls -lh my_figure.png 154 | ``` 155 | 156 | To confirm it contains the figure, we can use the IPython `Image` object 157 | to display the file's contents: 158 | 159 | ```python 160 | from IPython.display import Image 161 | Image('my_figure.png') 162 | ``` 163 | 164 | In `savefig()`, the file format is inferred from the filename extension. 165 | Depending on the installed backends, many different file formats are 166 | available. The list of supported file types can be found by: 167 | 168 | ```python 169 | fig.canvas.get_supported_filetypes() 170 | ``` 171 | 172 | which shows output like: 173 | 174 | ```python 175 | {'eps': 'Encapsulated Postscript', 176 | 'jpeg': 'Joint Photographic Experts Group', 177 | 'jpg': 'Joint Photographic Experts Group', 178 | 'pdf': 'Portable Document Format', 179 | 'pgf': 'PGF code for LaTeX', 180 | 'png': 'Portable Network Graphics', 181 | 'ps': 'Postscript', 182 | 'raw': 'Raw RGBA bitmap', 183 | 'rgba': 'Raw RGBA bitmap', 184 | 'svg': 'Scalable Vector Graphics', 185 | 'svgz': 'Scalable Vector Graphics', 186 | 'tif': 'Tagged Image File Format', 187 | 'tiff': 'Tagged Image File Format'} 188 | ``` 189 | 190 | #### Dual interfaces 191 | 192 | A potentially confusing feature of Matplotlib is its dual interfaces: a 193 | convenient MATLAB-style state-based interface, and a more powerful 194 | object-oriented interface. 195 | 196 | ##### MATLAB-style interface 197 | 198 | Originally Matplotlib was written as a Python alternative for MATLAB users, 199 | and much of its syntax reflects that. The MATLAB-style tools are contained in 200 | the pyplot (`plt`) interface. For example, the following code may look familiar 201 | to MATLAB users: 202 | 203 | ```python 204 | plt.figure() # create a plot figure 205 | 206 | # create the first of two panels and set current axis 207 | plt.subplot(2, 1, 1) # (rows, columns, panel number) 208 | plt.plot(x, np.sin(x)) 209 | 210 | # create the second panel and set current axis 211 | plt.subplot(2, 1, 2) 212 | plt.plot(x, np.cos(x)); 213 | ``` 214 | 215 | This is a *stateful* interface. It keeps track of the "current" figure and 216 | axes, which are where all `plt` commands are applied. You can get a reference 217 | to these using the `plt.gcf()` (get current figure) and `plt.gca()` (get 218 | current axes) routines. 219 | 220 | While this stateful interface is fast and convenient for simple plots, it is 221 | easy to run into problems. For example, once the second panel is created, how 222 | do we go back and add something to the first? This is possible, but clunky. 223 | Fortunately, there is a better way. 224 | 225 | ##### Object-oriented interface 226 | 227 | The object-oriented interface is available for these more complicated 228 | situations, and for when you want more control over your figure. Rather than 229 | depending on some notion of an "active" figure or axes, in the object-oriented 230 | interface the plotting functions are *methods* of explicit `Figure` and `Axes` 231 | objects. To recreate the previous plot using this style of code, you might do: 232 | 233 | ```python 234 | # First create a grid of plots 235 | # ax will be an array of two Axes objects 236 | fig, ax = plt.subplots(2) 237 | 238 | # Call plot() method on the appropriate object 239 | ax[0].plot(x, np.sin(x)) 240 | ax[1].plot(x, np.cos(x)); 241 | ``` 242 | 243 | For simple plots, the choice of which style to use is largely a matter of 244 | preference, but the object-oriented approach can become a necessity as plots 245 | become more complicated. In this chapter, the most convenient interface will 246 | be used as appropriate. In most cases, the difference is as small as switching 247 | `plt.plot()` to `ax.plot()`, but there are a few gotchas that will be 248 | highlighted as they are encountered. 249 | 250 | ## Simple line plots 251 | 252 | A simple plot is the visualisation of a single function y=f(x). 253 | 254 | As with following sections, we'll start by setting up the notebook for 255 | plotting and importing the packages we will use: 256 | 257 | ```python 258 | %matplotlib inline 259 | import matplotlib.pyplot as plt 260 | plt.style.use('seaborn-whitegrid') 261 | import numpy as np 262 | ``` 263 | 264 | For all Matplotlib plots, we start by creating a figure and axes; for 265 | example, in their simplest form: 266 | 267 | ```python 268 | fig = plt.figure() 269 | ax = plt.axes() 270 | ``` 271 | 272 | A Matplotlib *figure* (an instance of the `plt.Figure` class) can be 273 | thought of as a single container that contains all the objects 274 | representing axes, graphics, text and labels. The *axes* (an instance of 275 | the `plt.Axes` class) is what we see from the code above: a bounding box 276 | with ticks and labels, which will eventually contain the plot elements 277 | that constitute the visualisation. Below, we'll use the variable name 278 | `fig` to refer to `Figure` instance, and `ax` to refer to an `Axes` 279 | instance or group of `Axes` instances. 280 | 281 | Once we have created an axes, we can use `ax.plot()` to plot data: 282 | 283 | ```python 284 | fig = plt.figure() 285 | ax = plt.axes() 286 | 287 | x = np.linspace(0, 10, 1000) 288 | ax.plot(x, np.sin(x)); 289 | ``` 290 | 291 | Alternatively, we can use the pylab interface and let the figure and 292 | axes be created for us in the background: 293 | 294 | ```python 295 | plt.plot(x, np.sin(x)); 296 | ``` 297 | 298 | Calling `plot()` again allows us to create a single figure with multiple 299 | lines: 300 | 301 | ```python 302 | plt.plot(x, np.sin(x)) 303 | plt.plot(x, np.cos(x)); 304 | ``` 305 | 306 | ### Adjusting the plot: line colours and styles 307 | 308 | `plt.plot()` takes arguments to specify line colour and style. 309 | 310 | For colour adjustment, use the `color` keyword, which can be specified 311 | in a multiple of ways: 312 | 313 | ```python 314 | plt.plot(x, np.sin(x - 0), color='blue') # specify color by name 315 | plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk) 316 | plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1 317 | plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF) 318 | plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1 319 | plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported 320 | ``` 321 | 322 | If no colour is specified, Matplotlib will cycle through a set of 323 | default colours for multiple lines. 324 | 325 | The line style can be adjusted using the `linestyle` keyword: 326 | 327 | ```python 328 | plt.plot(x, x + 0, linestyle='solid') 329 | plt.plot(x, x + 1, linestyle='dashed') 330 | plt.plot(x, x + 2, linestyle='dashdot') 331 | plt.plot(x, x + 3, linestyle='dotted'); 332 | 333 | # For short, you can use the following codes: 334 | plt.plot(x, x + 4, linestyle='-') # solid 335 | plt.plot(x, x + 5, linestyle='--') # dashed 336 | plt.plot(x, x + 6, linestyle='-.') # dashdot 337 | plt.plot(x, x + 7, linestyle=':'); # dotted 338 | ``` 339 | 340 | It is possible to combine colour and style into a single non-keyword 341 | argument to `plt.plot()`: 342 | 343 | ```python 344 | plt.plot(x, x + 0, '-g') # solid green 345 | plt.plot(x, x + 1, '--c') # dashed cyan 346 | plt.plot(x, x + 2, '-.k') # dashdot black 347 | plt.plot(x, x + 3, ':r'); # dotted red 348 | ``` 349 | 350 | There are meny other keyword arguments that can be used to adjust the 351 | appearance of the plot; see the docstring of `plt.plot()`. 352 | 353 | ### Adjusting the plot: axes limits 354 | 355 | Matplotlib does a decent job of choosing default axes limits, but 356 | sometimes more control is needed. The most basic way is to use 357 | `plt.xlim()` and `plt.ylim()` methods. 358 | 359 | ```python 360 | plt.plot(x, np.sin(x)) 361 | 362 | plt.xlim(-1, 11) 363 | plt.ylim(-1.5, 1.5); 364 | ``` 365 | 366 | If you want an axis displayed in reverse, you can reverse the order of 367 | arguments: 368 | 369 | ```python 370 | plt.plot(x, np.sin(x)) 371 | 372 | plt.xlim(10, 0) 373 | plt.ylim(1.2, -1.2); 374 | ``` 375 | 376 | A useful related method is `plt.axis()` (NB: `axis` not `axes`). This 377 | lets you set the x and y limits by passing a list: `[xmin, xmax, ymin, 378 | ymax]`. 379 | 380 | ```python 381 | plt.plot(x, np.sin(x)) 382 | plt.axis([-1, 11, -1.5, 1.5]); 383 | ``` 384 | 385 | The `plt.axis()` method also has other features, for example, allowing 386 | tightening the bounds around the current plot automatically: 387 | 388 | ```python 389 | plt.plot(x, np.sin(x)) 390 | plt.axis('tight'); 391 | ``` 392 | 393 | or higher-level specifications, such as ensuring an equal aspect ratio, 394 | so one unit in x equals one unit in y: 395 | 396 | ```python 397 | plt.plot(x, np.sin(x)) 398 | plt.axis('equal'); 399 | ``` 400 | 401 | For more, see the `plt.axis()` docstring. 402 | 403 | ### Labelling plots 404 | 405 | Titles and axis labels are the simplest labels of plots with methods to 406 | quickly set them: 407 | 408 | ```python 409 | plt.plot(x, np.sin(x)) 410 | plt.title("A Sine Curve") 411 | plt.xlabel("x") 412 | plt.ylabel("sin(x)"); 413 | ``` 414 | 415 | The position, size and style of these labels can be adjusted using 416 | optional arguments: see the documentation and docstrings. 417 | 418 | When multiple lines are shown within a single axes, it can be useful to 419 | create a plot legend to label the lines. This is done via 420 | `plt.legend()`. However, it can be easier to specify the label of each 421 | line using the `label` keyword of the `plt.plot()`: 422 | 423 | ```python 424 | plt.plot(x, np.sin(x), '-g', label='sin(x)') 425 | plt.plot(x, np.cos(x), ':b', label='cos(x)') 426 | plt.axis('equal') 427 | 428 | plt.legend(); 429 | ``` 430 | 431 | ### Matplotlib gotchas 432 | 433 | Most `plt` functions translate directly to `ax` methods (such as 434 | `plt.plot()` → `ax.plot()`, `plt.legend()` → `ax.legend()` etc.), this 435 | is not always the case. Functions to set limits, labels and titles are 436 | slightly modified: 437 | 438 | * `plt.xlabel()` → `ax.set_xlabel()` 439 | * `plt.ylabel()` → `ax.set_ylabel()` 440 | * `plt.xlim()` → `ax.set_xlim()` 441 | * `plt.ylim()` → `ax.set_ylim()` 442 | * `plt.title()` → `ax.set_title()` 443 | 444 | In the object-oriented interface to plotting, it is often more 445 | convenient to use `ax.set()` to set these all at once: 446 | 447 | ```python 448 | ax = plt.axes() 449 | ax.plot(x, np.sin(x)) 450 | ax.set(xlim=(0, 10), ylim=(-2, 2), 451 | xlabel='x', ylabel='sin(x)', 452 | title='A Simple Plot'); 453 | ``` 454 | 455 | ## Simple scatter plots 456 | 457 | Another commonly used plot type is the simple scatter plot. Instead of 458 | points being joined by line segments, here points are represented 459 | individually using a dot, circle or other shape. 460 | 461 | We start as we did for line plots: 462 | 463 | ```python 464 | %matplotlib inline 465 | import matplotlib.pyplot as plt 466 | plt.style.use('seaborn-whitegrid') 467 | import numpy as np 468 | ``` 469 | 470 | ### Scatter plots with `plt.plot` 471 | 472 | `plt.plot()` (and `ax.plot()`) can also produce line plots. It turns out 473 | that this same function can produce scatter plots too: 474 | 475 | ```python 476 | x = np.linspace(0, 10, 30) 477 | y = np.sin(x) 478 | 479 | plt.plot(x, y, 'o', color='black'); 480 | ``` 481 | 482 | The third argument in the `plt.plot()` call is a character representing 483 | the type of symbol used for the plotting. Just as you can specify `-`, 484 | `--` to control the line style, the marker style has its own set of 485 | short string codes. Most possibilities are intuitive; here are some 486 | examples: 487 | 488 | ```python 489 | rng = np.random.RandomState(0) 490 | for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']: 491 | plt.plot(rng.rand(5), rng.rand(5), marker, 492 | label="marker='{0}'".format(marker)) 493 | plt.legend(numpoints=1) 494 | plt.xlim(0, 1.8); 495 | ``` 496 | 497 | For more possibilities, these character codes can be used together with 498 | line and colour codes to plot points along with a line connecting them: 499 | 500 | ```python 501 | plt.plot(x, y, '-ok'); 502 | ``` 503 | 504 | Additional keyword arguments to `plt.plot()` specify a wide range of 505 | properties of the lines and markers: 506 | 507 | ```python 508 | plt.plot(x, y, '-p', color='gray', 509 | markersize=15, linewidth=4, 510 | markerfacecolor='white', 511 | markeredgecolor='gray', 512 | markeredgewidth=2) 513 | plt.ylim(-1.2, 1.2); 514 | ``` 515 | 516 | ### Scatter plots with `plt.scatter` 517 | 518 | A second, more powerful method of creating scatter plots is the `plt.scatter()` 519 | function, which can be used very similarly to the `plt.plot()` function. 520 | 521 | ```python 522 | plt.scatter(x, y, marker='o'); 523 | ``` 524 | 525 | The primary difference of `plt.scatter` from `plt.plot` is that it can be used 526 | to create scatter plots where properties of each individual point (size, face 527 | colour, edge colour etc.) can be individually controlled or mapped to data. 528 | 529 | The following example uses this to create a random scatter plot with points of 530 | many colours and sizes. The `alpha` keyword adjusts the transparency level to 531 | better see overlapping results. 532 | 533 | ```python 534 | rng = np.random.RandomState(0) 535 | x = rng.randn(100) 536 | y = rng.randn(100) 537 | colors = rng.rand(100) 538 | sizes = 1000 * rng.rand(100) 539 | 540 | plt.scatter(x, y, c=colors, s=sizes, alpha=0.3, 541 | cmap='viridis') 542 | plt.colorbar(); # show color scale 543 | ``` 544 | 545 | The colour argument is automatically mapped to a colour scale (shown by the 546 | `colorbar()` command, and the size argument is given in pixels. In this way, 547 | the colour and size of points can be used to convey information in the 548 | visualisation, to visualise multidimensional data. 549 | 550 | This is used in this example: 551 | 552 | ```python 553 | from sklearn.datasets import load_iris 554 | iris = load_iris() 555 | features = iris.data.T 556 | 557 | plt.scatter(features[0], features[1], alpha=0.2, 558 | s=100*features[3], c=iris.target, cmap='viridis') 559 | plt.xlabel(iris.feature_names[0]) 560 | plt.ylabel(iris.feature_names[1]); 561 | ``` 562 | 563 | Each sample is one of three types of flowers that has had the size of its 564 | petals and sepals measured. The plot shows the sepal length and width on the 565 | axes, the species of flower by colour, and the size of point relates to petal 566 | width. 567 | 568 | ### Efficiency of `plot` versus `scatter` 569 | 570 | Apart from the different features of `plt.plot` and `plt.scatter`, why might 571 | you want to choose one over the other. While it doesn't matter as much for 572 | small amounts of data, as datasets get larger than a few thousand points, 573 | `plt.plot()` can be noticeably more efficient than `plt.scatter()`. This is 574 | because since `plt.scatter()` can render a different colour and/or size for 575 | each point, the renderer must do the extra work of constructing each point 576 | individually. In `plt.plot()`, the points are essentially clones of each other, 577 | so the work of determining the points' appearance is done once only for the set 578 | of data, and so `plt.plot()` is preferable for large datasets. 579 | 580 | ## Visualising errors 581 | 582 | For scientific measurements, accurate accounting for errors is nearly as 583 | important, if not more important, than accurate reporting of the number 584 | itself. 585 | 586 | For example, imagine that I am using some astrophysical observations to 587 | estimate the Hubble Constant, the local measurement of the expansion rate of 588 | the Universe. I know that the current literature suggests a value of around 71 589 | (km/s)/Mpc, and I measure a value of 74 (km/s)/Mpc with my method. Are the 590 | values consistent? The only correct answer, given this information, is this: 591 | there is no way to know. 592 | 593 | Suppose I augment this information with reported uncertainties: the current 594 | literature suggests a value of around 71 ± 2.5 (km/s)/Mpc, and my method has 595 | measured a value of 74 ± 5 (km/s)/Mpc. Now are the values consistent? That is 596 | a question that can be quantitatively answered. 597 | 598 | In visualization of data and results, showing these errors effectively can make 599 | a plot convey much more complete information. 600 | 601 | ### Basic error bars 602 | 603 | Basic error bars can be created with a single Matplotlib function call: 604 | 605 | ```python 606 | %matplotlib inline 607 | import matplotlib.pyplot as plt 608 | plt.style.use('seaborn-whitegrid') 609 | import numpy as np 610 | 611 | x = np.linspace(0, 10, 50) 612 | dy = 0.8 613 | y = np.sin(x) + dy * np.random.randn(50) 614 | 615 | plt.errorbar(x, y, yerr=dy, fmt='.k'); 616 | ``` 617 | 618 | `fmt` is a format code controlling the appearance of lines and points, and has 619 | the same syntax as the shorthand used in `plt.plot()`. 620 | 621 | `errorbar` has options to further adjust the outputs. For example, it can be 622 | helpful to make the error bars lighter than the points: 623 | 624 | ```python 625 | plt.errorbar(x, y, yerr=dy, fmt='o', color='black', 626 | ecolor='lightgray', elinewidth=3, capsize=0); 627 | ``` 628 | 629 | It is also possible to specify horizontal error bars (`xerr`), one-sided error 630 | bars, and more. 631 | 632 | ### Continuous errors 633 | 634 | In some cases, showing error bars on continuous quantities is desired. 635 | Matplotlib doesn't directly have a routine for this, but it is possible to 636 | combine `plt.plot` and `plt.fill_between` for a useful result. 637 | 638 | (Also note that Seaborn provides visualisation of continuous errors too.) 639 | 640 | This is a simple Gaussian process regression, using scikit-learn. This fits a 641 | very flexible non-parametric function to data with a continuous measure of the 642 | uncertainty: 643 | 644 | ```python 645 | from sklearn.gaussian_process import GaussianProcess 646 | 647 | # define the model and draw some data 648 | model = lambda x: x * np.sin(x) 649 | xdata = np.array([1, 3, 5, 6, 8]) 650 | ydata = model(xdata) 651 | 652 | # Compute the Gaussian process fit 653 | gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1, 654 | random_start=100) 655 | gp.fit(xdata[:, np.newaxis], ydata) 656 | 657 | xfit = np.linspace(0, 10, 1000) 658 | yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True) 659 | dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region 660 | ``` 661 | 662 | We now have `xfit`, `yfit` and `dyfit` which sample the continuous fit to the 663 | data. We could pass these to `plt.errorbar` but we don't want to plot 1000 664 | points with 1000 errorbars. Instead, we can use `plt.fill_between()` with a 665 | light colour to visualise this continuous error. 666 | 667 | ```python 668 | # Visualize the result 669 | plt.plot(xdata, ydata, 'or') 670 | plt.plot(xfit, yfit, '-', color='gray') 671 | 672 | plt.fill_between(xfit, yfit - dyfit, yfit + dyfit, 673 | color='gray', alpha=0.2) 674 | plt.xlim(0, 10); 675 | ``` 676 | 677 | With `fill_between()`, we pass the x value, then the lower and upper y bounds, 678 | and the area between these regions is filled. 679 | 680 | The figure gives an insight into what the Gaussian process regression does. 681 | Near measured data points, the model is strongly constrained, with small model 682 | errors, while further away, the model is not strongly constrained and errors 683 | increase. 684 | 685 | ## Density and contour plots 686 | 687 | Sometimes it is useful to display 3D data in 2D using contours or colour-coded 688 | regions. There are three Matplotlib functions that can help here: `plt.contour` 689 | for contour plots, `plt.contourf` for filled contour plots and `plt.imshow` for 690 | showing images. 691 | 692 | Again, setting up as before: 693 | 694 | ```python 695 | %matplotlib inline 696 | import matplotlib.pyplot as plt 697 | plt.style.use('seaborn-white') 698 | import numpy as np 699 | ``` 700 | 701 | ### Visualising a 3D function 702 | 703 | We'll start by demonstrating a contour plot using a function z=f(x,y), using 704 | the following function: 705 | 706 | ```python 707 | def f(x, y): 708 | return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x) 709 | ``` 710 | 711 | A contour plot can be created with `plt.contour()`. It takes three arguments: 712 | a grid of *x* values, a grid of *y* values and a grid of *z* values. The *x* 713 | and *y* values represent positions on the plot, and the *z* values will be 714 | represented by the contour levels. Perhaps the most straightforward way to 715 | prepare such data is to use the `np.meshgrid` function, which builds 2D grids 716 | from 1D arrays: 717 | 718 | ```python 719 | x = np.linspace(0, 5, 50) 720 | y = np.linspace(0, 5, 40) 721 | 722 | X, Y = np.meshgrid(x, y) 723 | Z = f(X, Y) 724 | ``` 725 | 726 | To create a line-only contour plot: 727 | 728 | ```python 729 | plt.contour(X, Y, Z, colors='black'); 730 | ``` 731 | 732 | When a single colour is used, negative values are represented by dashed lines, 733 | and positive values by solid lines. Alternatively, the lines can be 734 | colour-coded by specifying a colormap with the `cmap` argument: 735 | 736 | ```python 737 | plt.contour(X, Y, Z, 20, cmap='RdGy'); 738 | ``` 739 | 740 | Above, 20 is the number of equally spaced intervals within the data range. 741 | 742 | `RdGy` is the *Red-Grey* colormap, which is a good choice for centred data. 743 | Matplotlib has a range of colormaps available, which are found inside the 744 | `plt.cm` module. 745 | 746 | The spaces between the lines can look distracting, so we can change this to 747 | a filled contour plot using `plt.contourf()` which uses largely the same syntax 748 | as `plt.contour()`. 749 | 750 | ```python 751 | plt.contourf(X, Y, Z, 20, cmap='RdGy') 752 | plt.colorbar(); # Create additional axis with labelled colour info for plot. 753 | ``` 754 | 755 | Here, the colorbar makes it clear that the black regions are "peaks", while the 756 | red regions are "valleys". 757 | 758 | One potential issue with this plot is that it is "splotchy": colour steps are 759 | discrete rather than continuous. This could be remedied by setting the number 760 | of contours to a high number, but results in an inefficient plot as Matplotlib 761 | must render a new polygon for each step in the level. A better solution is to 762 | use the `plt.imshow()` function, which interprets a 2D grid of data as an 763 | image: 764 | 765 | ```python 766 | plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower', 767 | cmap='RdGy') 768 | plt.colorbar() 769 | plt.axis(aspect='image'); 770 | ``` 771 | 772 | There are gotchas with `imshow()`: 773 | 774 | * it doesn't accept an *x* and *y* grid, so you must specify the extent 775 | [*xmin*, *xmax*, *ymin*, *ymax*] of the image on the plot. 776 | * it follows the standard image array defintiion where the origin is in the 777 | upper left, not in the lower left as in most contour plots. This must be 778 | changed when showing gridded data. 779 | * it will automatically adjust the axis aspect ratio to match the input data; 780 | this can be changed by setting, e.g. `plt.axis(aspect='image')` to make 781 | *x* and *y* units match. 782 | 783 | Finally, it can be sometimes useful to combine contour plots and image plots. 784 | Here a partially transparent background image is used with contours 785 | overplotted, and labels on the contours themselves: 786 | 787 | ```python 788 | contours = plt.contour(X, Y, Z, 3, colors='black') 789 | plt.clabel(contours, inline=True, fontsize=8) # Label contours. 790 | 791 | plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower', 792 | cmap='RdGy', alpha=0.5) # alpha sets transparency. 793 | plt.colorbar(); 794 | ``` 795 | 796 | ## Histograms, binnings and density 797 | 798 | A simple histogram can be a first step in understanding a dataset. 799 | 800 | It's possible to create a basic histogram in one line, once the normal imports 801 | are done: 802 | 803 | ```python 804 | %matplotlib inline 805 | import numpy as np 806 | import matplotlib.pyplot as plt 807 | plt.style.use('seaborn-white') 808 | 809 | data = np.random.randn(1000) 810 | 811 | plt.hist(data); 812 | ``` 813 | 814 | The `hist()` function has many options to tune both the calculation and the 815 | display; here's an example of a more customised histogram: 816 | 817 | ```python 818 | plt.hist(data, bins=30, normed=True, alpha=0.5, 819 | histtype='stepfilled', color='steelblue', 820 | edgecolor='none'); 821 | ``` 822 | 823 | `stepfilled` with `alpha` can be useful to compare histograms of different 824 | distributions: 825 | 826 | ```python 827 | x1 = np.random.normal(0, 0.8, 1000) 828 | x2 = np.random.normal(-2, 1, 1000) 829 | x3 = np.random.normal(3, 2, 1000) 830 | 831 | kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40) 832 | 833 | plt.hist(x1, **kwargs) 834 | plt.hist(x2, **kwargs) 835 | plt.hist(x3, **kwargs); 836 | ``` 837 | 838 | If you would like to compute the histogram (count the points in a given bin), 839 | but not display it, use `np.histogram()`: 840 | 841 | ```python 842 | counts, bin_edges = np.histogram(data, bins=5) 843 | print(counts) 844 | ``` 845 | 846 | ### 2D histograms and binnings 847 | 848 | Just as we create histograms in 1D by dividing the number line into bins, we 849 | can create histograms in 2D by dividing points among 2D bins. 850 | 851 | Start by defining some data — an *x* and *y* array drawn from a multivariate 852 | Gaussian distribution: 853 | 854 | ```python 855 | mean = [0, 0] 856 | cov = [[1, 1], [1, 2]] 857 | x, y = np.random.multivariate_normal(mean, cov, 10000).T 858 | ``` 859 | 860 | #### `plt.hist2d`: 2D histogram 861 | 862 | This is an easy way to create a 2D histogram: 863 | 864 | ```python 865 | plt.hist2d(x, y, bins=30, cmap='Blues') 866 | cb = plt.colorbar() 867 | cb.set_label('counts in bin') 868 | ``` 869 | 870 | `plt.hist2d` has extra options to adjust the plot and binning, just as 871 | `plt.hist`. And, just as there is `np.histogram`, there is `np.histogram2d` 872 | which can be used as: 873 | 874 | ```python 875 | counts, xedges, yedges = np.histogram2d(x, y, bins=30) 876 | ``` 877 | 878 | There is also `np.histogramdd()` for histogram binning in more than two 879 | dimensions. 880 | 881 | #### `plt.hexbin`: hexagonal binnings 882 | 883 | The 2D histogram creates a tesselation of squares across the axes. Another 884 | shape for such a tesselation is the regular hexagon. Matplotlib provides the 885 | `plt.hexbin` function to represent a 2D dataset binned with a grid of hexagons. 886 | 887 | ```python 888 | plt.hexbin(x, y, gridsize=30, cmap='Blues') 889 | cb = plt.colorbar(label='count in bin') 890 | ``` 891 | 892 | `plt.hexbin` also has a number of options, including the ability to specify 893 | weights for each point, and to change the output in each bin to any NumPy 894 | aggregate (mean of weights, standard deviation of weights etc.). 895 | 896 | #### Kernel density estimation 897 | 898 | Another method to evaluate densities in multiple dimensions is *kernel density 899 | estimation* (KDE). KDE can be thought of as a way to "smear out" the points in 900 | space and add up the result to obtain a smooth function. 901 | 902 | `scipy.stats` contains a quick and simple KDE implementation: 903 | 904 | ```python 905 | from scipy.stats import gaussian_kde 906 | 907 | # fit an array of size [Ndim, Nsamples] 908 | data = np.vstack([x, y]) 909 | kde = gaussian_kde(data) 910 | 911 | # evaluate on a regular grid 912 | xgrid = np.linspace(-3.5, 3.5, 40) 913 | ygrid = np.linspace(-6, 6, 40) 914 | Xgrid, Ygrid = np.meshgrid(xgrid, ygrid) 915 | Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()])) 916 | 917 | # Plot the result as an image 918 | plt.imshow(Z.reshape(Xgrid.shape), 919 | origin='lower', aspect='auto', 920 | extent=[-3.5, 3.5, -6, 6], 921 | cmap='Blues') 922 | cb = plt.colorbar() 923 | cb.set_label("density") 924 | ``` 925 | 926 | KDE has a smoothing length that effectively turns the dial between detail and 927 | smoothness. `gaussian_kde` uses a rule-of-thumb to attempt to find a nearly 928 | optimal smoothing length for the input data. 929 | 930 | Other KDE implementations are available, e.g. 931 | `sklearn.neighbors.KernelDensity` and 932 | `statsmodels.nonparametric.kernel_density.KDEMultivariate`. KDE visualisations 933 | with Matplotlib can be verbose; using Seaborn (see below) can be more terse. 934 | 935 | ## Customising plot legends 936 | 937 | Plot legends assign meaning to various plot elements. 938 | 939 | The simplest legend can be created with `plt.legend()`, creating a 940 | legend automatically for any labelled plot elements: 941 | 942 | ```python 943 | import matplotlib.pyplot as plt 944 | plt.style.use('classic') 945 | %matplotlib inline 946 | import numpy as np 947 | 948 | x = np.linspace(0, 10, 1000) 949 | fig, ax = plt.subplots() 950 | ax.plot(x, np.sin(x), '-b', label='Sine') 951 | ax.plot(x, np.cos(x), '--r', label='Cosine') 952 | ax.axis('equal') 953 | leg = ax.legend(); 954 | ``` 955 | 956 | But there are many ways we might want to customise a legend. For 957 | example, we can specify the location and turn off the frame: 958 | 959 | ```python 960 | ax.legend(loc='upper left', frameon=False) 961 | fig 962 | ``` 963 | 964 | We can use `ncol` to specify the number of columns in the legend: 965 | 966 | ```python 967 | ax.legend(frameon=False, loc='lower center', ncol=2) 968 | fig 969 | ``` 970 | 971 | We can use a rounded box (`fancybox`) or add a shadow, change the 972 | transparency of the frame, or the text padding: 973 | 974 | ```python 975 | ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1) 976 | fig 977 | ``` 978 | 979 | ### Choosing elements for the legend 980 | 981 | The legend includes all labelled elements by default. If we do not want 982 | this, we can fine tune which elements and labels appear by using the 983 | objects returned by plot commands. `plt.plot()` can create multiple 984 | lines at once, and returns a list of created line instances. Passing 985 | any of these to `plt.legend()` tell it which to identify, along with the 986 | labels we want to specify: 987 | 988 | ```python 989 | y = np.sin(x[:, np.newaxis] + np.pi * np.arange(0, 2, 0.5)) 990 | lines = plt.plot(x, y) 991 | 992 | # lines is a list of plt.Line2D instances 993 | plt.legend(lines[:2], ['first', 'second']); 994 | ``` 995 | 996 | It can be clearer to instead apply labels to the plot elements you'd 997 | like to show on the legend: 998 | 999 | ```python 1000 | plt.plot(x, y[:, 0], label='first') 1001 | plt.plot(x, y[:, 1], label='second') 1002 | plt.plot(x, y[:, 2:]) 1003 | plt.legend(framealpha=1, frameon=True); 1004 | ``` 1005 | 1006 | By default, the legend ignores elements without a `label` set. 1007 | 1008 | ### Legend for size of points 1009 | 1010 | Sometimes the legend defaults are insufficient for the given 1011 | visualisation. For example, if you use the size of points to mark 1012 | features of the data and want to create a legend to reflect this. 1013 | 1014 | Here is such an example: size of points indicate populations of 1015 | California cities. The legend should specify the scale of the sizes of 1016 | the points, and this is achieved by plotting labelled data with no 1017 | entries: 1018 | 1019 | ```python 1020 | import pandas as pd 1021 | cities = pd.read_csv('data/california_cities.csv') 1022 | 1023 | # Extract the data we're interested in 1024 | lat, lon = cities['latd'], cities['longd'] 1025 | population, area = cities['population_total'], cities['area_total_km2'] 1026 | 1027 | # Scatter the points, using size and color but no label 1028 | plt.scatter(lon, lat, label=None, 1029 | c=np.log10(population), cmap='viridis', 1030 | s=area, linewidth=0, alpha=0.5) 1031 | plt.axis(aspect='equal') 1032 | plt.xlabel('longitude') 1033 | plt.ylabel('latitude') 1034 | 1035 | plt.colorbar(label='log$_{10}$(population)') 1036 | plt.clim(3, 7) 1037 | 1038 | # Here we create a legend: 1039 | # we'll plot empty lists with the desired size and label 1040 | for area in [100, 300, 500]: 1041 | plt.scatter([], [], c='k', alpha=0.3, s=area, 1042 | label=str(area) + ' km$^2$') 1043 | plt.legend(scatterpoints=1, frameon=False, labelspacing=1, title='City Area') 1044 | 1045 | plt.title('California Cities: Area and Population'); 1046 | ``` 1047 | 1048 | The legend always references some object that is on the plot, so if we want to 1049 | display a particular shape we need to plot it. The circles we want for the 1050 | legend are not on the plot, so we fake them by plotting empty lists. 1051 | 1052 | By plotting empty lists, we create labelled plot objects that are picked up by 1053 | the legend, and now the legend tells us useful information. 1054 | 1055 | (It creates a legend using one scatter point for each "plot", where the point 1056 | size equals the area, which is specified as 100, 300 and 500, and will be to 1057 | the same scale as the real plots, even though no points are actually plotted.) 1058 | 1059 | ### Multiple legends 1060 | 1061 | Sometimes you would like to add multiple legends to the same axes. Matplotlib 1062 | does not make this easy. The standard `legend` interface only allows creating 1063 | a single legend for the plot. Using `plt.legend()` or `ax.legend()` repeatedly 1064 | just overrides a previous entry. We can work around this by creating a new 1065 | legend artist from scratch, and then using the lower level `ax.add_artist()` 1066 | method to manually add the second artist to the plot: 1067 | 1068 | ```python 1069 | fig, ax = plt.subplots() 1070 | 1071 | lines = [] 1072 | styles = ['-', '--', '-.', ':'] 1073 | x = np.linspace(0, 10, 1000) 1074 | 1075 | for i in range(4): 1076 | lines += ax.plot(x, np.sin(x - i * np.pi / 2), 1077 | styles[i], color='black') 1078 | ax.axis('equal') 1079 | 1080 | # specify the lines and labels of the first legend 1081 | ax.legend(lines[:2], ['line A', 'line B'], 1082 | loc='upper right', frameon=False) 1083 | 1084 | # Create the second legend and add the artist manually. 1085 | from matplotlib.legend import Legend 1086 | leg = Legend(ax, lines[2:], ['line C', 'line D'], 1087 | loc='lower right', frameon=False) 1088 | ax.add_artist(leg); 1089 | ``` 1090 | 1091 | ## Customising colorbars 1092 | 1093 | Plot legends identify discrete labels of discrete points. For continuous 1094 | labels based on the colour of points, lines or regions, a labelled 1095 | colorbar can be a great tool. In Matplotlib, a colorbar is a separate 1096 | axes that can provide a key for the meaning of colours in a plot. 1097 | 1098 | ```python 1099 | import matplotlib.pyplot as plt 1100 | plt.style.use('classic') 1101 | %matplotlib inline 1102 | import numpy as np 1103 | ``` 1104 | 1105 | To create a simple colorbar, use `plt.colorbar()`: 1106 | 1107 | ```python 1108 | x = np.linspace(0, 10, 1000) 1109 | I = np.sin(x) * np.cos(x[:, np.newaxis]) 1110 | 1111 | plt.imshow(I) 1112 | plt.colorbar(); 1113 | ``` 1114 | 1115 | ### Adjusting colorbars 1116 | 1117 | The colormap can be specified using the `cmap` argument to the plotting 1118 | function that is creating the visualisation: 1119 | 1120 | ```python 1121 | plt.imshow(I, cmap='gray'); 1122 | ``` 1123 | 1124 | The available colormaps are in the `plt.cm` namespace. 1125 | 1126 | #### Choosing the colormap 1127 | 1128 | There are three categories of colormap: 1129 | 1130 | * *Sequential colormaps*: these are made up of one continuous sequence 1131 | of colour (e.g. `binary` or `viridis`). 1132 | * *Divergent colormaps*: these usually contain two distinct colors, which 1133 | shows positive and negative deviations from a mean (e.g. `RdBu` or 1134 | `PuOr`). 1135 | * *Qualitative colormaps*: these mix colors with no particular sequence 1136 | (e.g. `rainbow` or `jet`). 1137 | 1138 | Qualitative colormaps can be a poor choice for quantitative data: for 1139 | instance, they do not usually display a uniform progression in 1140 | brightness as the scale increases. This in turn means that the eye can 1141 | be drawn to certain portions of the colour regions, potentially 1142 | emphasising unimportant parts of the dataset. Colormaps with an even 1143 | brightness variation across the range are better here, and translate 1144 | well to greyscale printing too. `cubehelix` is a better alternative 1145 | rainbow scheme for continuous data. 1146 | 1147 | Colormaps such as `RdBu` lose the positive-negative information on 1148 | conversion to greyscale! 1149 | 1150 | #### Colour limits and extensions 1151 | 1152 | Matplotlib allows for a large range of colorbar customisation. The 1153 | colorbar itself is an instance of `plt.Axes`, so all of the axes and 1154 | tick formatting tricks above apply. The colorbar has flexibility too: 1155 | e.g. we can narrow the colour limits and indicate the out-of-bounds 1156 | values with a triangular arrow at te top and bottom by setting `extend`. 1157 | This might be useful if displaying an image that is subject to noise. 1158 | 1159 | ```python 1160 | # make noise in 1% of the image pixels 1161 | speckles = (np.random.random(I.shape) < 0.01) # Produces a Boolean mask. 1162 | I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles)) 1163 | 1164 | plt.figure(figsize=(10, 3.5)) 1165 | 1166 | plt.subplot(1, 2, 1) 1167 | plt.imshow(I, cmap='RdBu') 1168 | plt.colorbar() 1169 | 1170 | plt.subplot(1, 2, 2) 1171 | plt.imshow(I, cmap='RdBu') 1172 | plt.colorbar(extend='both') 1173 | plt.clim(-1, 1); 1174 | ``` 1175 | 1176 | In the first image generated here, the default colour limits respond to 1177 | the noisy pixels, largely washing out the pattern we are interested in. 1178 | In the second image, the colour limits are set manually and 1179 | out-of-bounds values are indicated by the arrows. 1180 | 1181 | #### Discrete colorbars 1182 | 1183 | Colormaps are by default continuous, but sometimes you'd like them to be 1184 | discrete. The easiest way is to use `plt.cm.get_cmap()` and pass the 1185 | name of a suitable colormap with the number of desired bins: 1186 | 1187 | ```python 1188 | plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6)) 1189 | plt.colorbar() 1190 | plt.clim(-1, 1); 1191 | ``` 1192 | 1193 | #### Other notes 1194 | 1195 | Can use `ticks` and `label` in `plt.colorbar()` to customise the 1196 | colorbar. 1197 | 1198 | ## Multiple subplots 1199 | 1200 | Sometimes it is helpful to compare different views of data side by side. 1201 | For this purpose, Matplotlib has *subplots*: groups of smaller axes that 1202 | can exist together within a single figure. These subplots might be 1203 | insets, grids of plots or more complicated layouts. 1204 | 1205 | ```python 1206 | %matplotlib inline 1207 | import matplotlib.pyplot as plt 1208 | plt.style.use('seaborn-white') 1209 | import numpy as np 1210 | ``` 1211 | 1212 | ### `plt.axes`: subplots by hand 1213 | 1214 | The most basic method of creating an axes is `plt.axes()`. As we've seen 1215 | previously, by default this creates a standard axes object that fills 1216 | the entire figure. `plt.axes` also takes an optional argument that is a 1217 | list of four numbers in the figure coordinate system. These numbers 1218 | represent `[left, bottom, width, height]` in the figure coordinate 1219 | system, which ranges from 0 at the bottom left of the figure to 1 at the 1220 | top right of the figure. 1221 | 1222 | For example, we might create an inset axes at the top-right corner of 1223 | another axes by setting the *x* and *y* position to 0.65 (starting at 1224 | 65% of the width, and 65% of the height of the figure), and the *x* and 1225 | *y* extents to 0.2 (the size of the axes is 20% of the width and 20% of 1226 | the height of the figure): 1227 | 1228 | ```python 1229 | ax1 = plt.axes() # standard axes 1230 | ax2 = plt.axes([0.65, 0.65, 0.2, 0.2]) 1231 | ``` 1232 | 1233 | The equivalent of this command in the object-oriented interface is 1234 | `fig.add_axes()`. Using this to create two vertically stacked axes: 1235 | 1236 | ```python 1237 | fig = plt.figure() 1238 | ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4], 1239 | xticklabels=[], ylim=(-1.2, 1.2)) 1240 | ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4], 1241 | ylim=(-1.2, 1.2)) 1242 | 1243 | x = np.linspace(0, 10) 1244 | ax1.plot(np.sin(x)) 1245 | ax2.plot(np.cos(x)); 1246 | ``` 1247 | 1248 | We now have two axes (the top with no tick labels) that are just touching: the 1249 | bottom of the upper panel (position 0.5) matches the top of the lower panel 1250 | (position 0.1 + 0.4). 1251 | 1252 | ### `plt.subplot`: simple grids of subplots 1253 | 1254 | Aligned columns or rows of subplots are often enough used that Matplotlib has 1255 | routines to create these easily. The lowest level of these is `plt.subplot()` 1256 | which creates a single subplot within a grid. It takes three integer arguments: 1257 | the number of rows, the number of columns, and the index of the plot to be 1258 | created in this scheme, which runs from upper left to bottom right. 1259 | 1260 | ```python 1261 | for i in range(1, 7): 1262 | plt.subplot(2, 3, i) 1263 | plt.text(0.5, 0.5, str((2, 3, i)), 1264 | fontsize=18, ha='center') 1265 | ``` 1266 | 1267 | `plt.subplots_adjust()` can be used to adjust the spacing between plots. 1268 | 1269 | `fig.add_subplot()` is the equivalent object-oriented command: 1270 | 1271 | ```python 1272 | fig = plt.figure() 1273 | fig.subplots_adjust(hspace=0.4, wspace=0.4) 1274 | for i in range(1, 7): 1275 | ax = fig.add_subplot(2, 3, i) 1276 | ax.text(0.5, 0.5, str((2, 3, i)), 1277 | fontsize=18, ha='center') 1278 | ``` 1279 | 1280 | `hspace` and `wspace` arguments of `subplots_adjust()` specify the spacing 1281 | along the height and width of the figure, in units of the subplot size (in 1282 | this case, the space is 40% of the subplot height and width). 1283 | 1284 | ### `plt.subplots`: the whole grid in one go 1285 | 1286 | The approach described above can be tedious when creating a large grid of 1287 | subplots, especially if you'd like to hide the x- and y-axis labels on the 1288 | inner plots. For this purpose, `plt.subplots()` is the easier tool to use. 1289 | Instead of creating a single subplot, `plt.subplots()` creates a full grid of 1290 | subplots in a single line, returning them in a NumPy array. The arguments are 1291 | the number of rows and number of columns, along with optional keywords `sharex` 1292 | and `sharey` allowing you to specify the relationships between different axes. 1293 | 1294 | This example creates a 2x3 grid of subplots where all axes in the same row 1295 | share their y-axis scale and all axes in the same column share their x-axis 1296 | scale: 1297 | 1298 | ```python 1299 | fig, ax = plt.subplots(2, 3, sharex='col', sharey='row') 1300 | ``` 1301 | 1302 | Using `sharex` and `sharey` removes the inner labels on the grid to make the 1303 | plot cleaner. The resulting grid of axes instances is returned within a NumPy 1304 | array, allowing for specification of the desired axes using array indexing: 1305 | 1306 | ```python 1307 | # axes are in a two-dimensional array, indexed by [row, col] 1308 | for i in range(2): 1309 | for j in range(3): 1310 | ax[i, j].text(0.5, 0.5, str((i, j)), 1311 | fontsize=18, ha='center') 1312 | fig 1313 | ``` 1314 | 1315 | ### `plt.GridSpec`: more complicated arrangements 1316 | 1317 | To go beyond a regular grid to subplots that span multiple rows and columns, 1318 | `plt.GridSpec` is the best tool. It does not create a plot itself: it is simply 1319 | a convenient interface recognised by `plt.subplot()`. For example, a gridspec 1320 | for a grid of two rows and three columns with some specified width and height 1321 | space looks like: 1322 | 1323 | ```python 1324 | grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3) 1325 | ``` 1326 | 1327 | We can specify subplot locations and extents using Python slicing syntax: 1328 | 1329 | ```python 1330 | plt.subplot(grid[0, 0]) 1331 | plt.subplot(grid[0, 1:]) 1332 | plt.subplot(grid[1, :2]) 1333 | plt.subplot(grid[1, 2]); 1334 | ``` 1335 | 1336 | ## Text and annotation 1337 | 1338 | Axes labels and titles are the most basic types of annotations, but the options 1339 | go beyond this. 1340 | 1341 | `plt.text`/`ax.text` allow placing of text at a particular, x/y value, e.g. 1342 | 1343 | ```python 1344 | style = dict(size=10, color='gray') 1345 | ax.text('2012-1-1', 3950, "New Year's Day", **style) 1346 | ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style) 1347 | ``` 1348 | 1349 | `ax.text()` takes x position, y position, a string, and optional keywords 1350 | specifying colour, size, style, alignment and other properties of the text. 1351 | 1352 | `ha` is horizontal alignment. 1353 | 1354 | ### Transforms and text position 1355 | 1356 | We previously anchored text annotations to data locations. Sometimes it is 1357 | preferable to anchor the text to a position on the axes or figure, independent 1358 | of the data. In Matplotlib, this is done by modifying the *transform*. 1359 | 1360 | Any graphics display framework needs some scheme for translating between 1361 | coordinate systems. For example, a data point at x,y of 1,1 needs to be 1362 | represented at a certain location on the figure, which in turn needs to be 1363 | represented by pixels on the screen. Mathematically, such coordinate 1364 | transformations are straightforward and Matplotlib has tools it uses internally 1365 | to perform them (in the `matplotlib.transforms` module). 1366 | 1367 | These details are usually not important to typical users, but it is helpful 1368 | to be aware of when considering the placement of text on a figure. 1369 | 1370 | There are three predefined transforms that can be useful in this situation: 1371 | 1372 | * `ax.transData`: transform associated with data coordinates 1373 | * `ax.transAxes`: transform associated with the axes (in units of axes dimensions) 1374 | * `fig.transFigure`: transform associated with the figure (in units of figure dimensions) 1375 | 1376 | Here is an example of drawing text at locations using these transforms: 1377 | 1378 | ```python 1379 | fig, ax = plt.subplots(facecolor='lightgray') 1380 | ax.axis([0, 10, 0, 10]) 1381 | 1382 | # transform=ax.transData is the default, but we'll specify it anyway 1383 | ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData) 1384 | ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes) 1385 | ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure); 1386 | ``` 1387 | 1388 | Note that by default, the text is aligned above and to the left of the 1389 | specified coordinates: here the "." at the beginning of each string will 1390 | approximately mark the given coordinate location. 1391 | 1392 | The `transData` coordinates give the usual data coordinates associated with the 1393 | x- and y-axis labels. The `transAxes` coordinates give the location from the 1394 | bottom-left corner of the axes, as a fraction of the axes size. The 1395 | `transFigure` coordinates are similar, specifying the position from the 1396 | bottom-left of the figure as a fraction of the figure size. 1397 | 1398 | Notice now that if we change the axes limits, it is only the `transData` 1399 | coordinates that will be affected, while the others remain stationary: 1400 | 1401 | ```python 1402 | ax.set_xlim(0, 2) 1403 | ax.set_ylim(-6, 6) 1404 | fig 1405 | ``` 1406 | 1407 | ### Arrows and annotation 1408 | 1409 | Along with tick marks and text, arrows are another useful annotation. 1410 | 1411 | Drawing arrows in Matplotlib is tricky. There is a `plt.arrow()` function 1412 | available, but is not too useful: the arrows it creates are SVG objects that 1413 | are subject to the varying aspect ratio of your plots, and the result is not 1414 | what the user intended. Instead, `plt.annotate()` is perhaps better. This 1415 | function creates some text and an arrow, and the arrows can be flexibly 1416 | specified. 1417 | 1418 | ```python 1419 | %matplotlib inline 1420 | 1421 | fig, ax = plt.subplots() 1422 | 1423 | x = np.linspace(0, 20, 1000) 1424 | ax.plot(x, np.cos(x)) 1425 | ax.axis('equal') 1426 | 1427 | ax.annotate('local maximum', xy=(6.28, 1), xytext=(10, 4), 1428 | arrowprops=dict(facecolor='black', shrink=0.05)) 1429 | 1430 | ax.annotate('local minimum', xy=(5 * np.pi, -1), xytext=(2, -6), 1431 | arrowprops=dict(arrowstyle="->", 1432 | connectionstyle="angle3,angleA=0,angleB=-90")); 1433 | ``` 1434 | 1435 | The arrow style is controlled through the `arrowprops` dictionary, with many 1436 | options. 1437 | 1438 | ## Customising ticks 1439 | 1440 | Matplotlib's default tick locators and formatters are sufficient in many 1441 | situations, but can be adjusted if not. 1442 | 1443 | Matplotlib plots have an object hierarchy. Matplotlib aims to have a Python 1444 | object representing everything that appears on the plot: for example, the 1445 | `figure` is the bounding box within which plot elements appear. Each object 1446 | can act as a container of sub-objects: e.g. each `figure` can contain one or 1447 | more `axes` objects, each of which in turn contain other objects representing 1448 | plot contents. 1449 | 1450 | The tick marks are no exception. Each `axes` has attributes `xaxis` and `yaxis` 1451 | which in turn have attributes that contain all the properties of the lines, 1452 | ticks and labels that make up the axes. 1453 | 1454 | ### Major and minor ticks 1455 | 1456 | Within each axis, there is the concept of a *major* tick mark and a *minor* 1457 | tick mark. As the names imply, major ticks are usually bigger or more 1458 | pronounced, while minor ticks are usually smaller. By default, Matplotlib 1459 | rarely makes use of minor ticks, but one place you can see them is within 1460 | logarithmic plots: 1461 | 1462 | ```python 1463 | import matplotlib.pyplot as plt 1464 | plt.style.use('classic') 1465 | %matplotlib inline 1466 | import numpy as np 1467 | 1468 | ax = plt.axes(xscale='log', yscale='log') 1469 | ax.grid(); 1470 | ``` 1471 | 1472 | We see that each major tick shows a large tickmark and a label, while each 1473 | minor tick shows a smaller tickmark with no label. 1474 | 1475 | These tick properties — locations and labels — can be customised by setting 1476 | the `formatter` and `locator` objects of each axis. Let's examine these 1477 | for the x axis of the just shown logarithmic plot: 1478 | 1479 | ```python 1480 | print(ax.xaxis.get_major_locator()) # LogLocator 1481 | print(ax.xaxis.get_minor_locator()) # LogLocator 1482 | print(ax.xaxis.get_major_formatter()) # LogFormatterMathtext 1483 | print(ax.xaxis.get_minor_formatter()) # NullFormatter 1484 | ``` 1485 | 1486 | ### Hiding ticks or labels 1487 | 1488 | A common formatting operation is hiding ticks or labels. This can be done using 1489 | `plt.NullLocator()` or `plt.NullFormatter()`: 1490 | 1491 | ```python 1492 | ax = plt.axes() 1493 | ax.plot(np.random.rand(50)) 1494 | 1495 | ax.yaxis.set_major_locator(plt.NullLocator()) 1496 | ax.xaxis.set_major_formatter(plt.NullFormatter()) 1497 | ``` 1498 | 1499 | We've removed the labels, but kept the ticks, from the x-axis, and removed the 1500 | ticks (and the labels too) from the y-axis. Having no ticks at all can be 1501 | useful: for example, when you want to show a grid of images. 1502 | 1503 | ### Reducing or increasing the number of ticks 1504 | 1505 | A problem with the default settings is that smaller subplots can end up with 1506 | crowded labels. For example: 1507 | 1508 | ```python 1509 | fig, ax = plt.subplots(4, 4, sharex=True, sharey=True) 1510 | ``` 1511 | 1512 | Particularly for the x ticks, the numbers almost overlap, making them 1513 | difficult to read. We can fix this with `plt.MaxNLocator()` which allows 1514 | specification of the maximum number of ticks to display. Given this maximum 1515 | number, Matplotlib will use internal logic to choose the particular tick 1516 | locations. 1517 | 1518 | ```python 1519 | # For every axis, set the x and y major locator 1520 | for axi in ax.flat: 1521 | axi.xaxis.set_major_locator(plt.MaxNLocator(3)) 1522 | axi.yaxis.set_major_locator(plt.MaxNLocator(3)) 1523 | fig 1524 | ``` 1525 | 1526 | ### Fancy tick formats 1527 | 1528 | The default tick formatting works well as a default, but sometimes you may 1529 | want more. Consider: 1530 | 1531 | ```python 1532 | # Plot a sine and cosine curve 1533 | fig, ax = plt.subplots() 1534 | x = np.linspace(0, 3 * np.pi, 1000) 1535 | ax.plot(x, np.sin(x), lw=3, label='Sine') 1536 | ax.plot(x, np.cos(x), lw=3, label='Cosine') 1537 | 1538 | # Set up grid, legend, and limits 1539 | ax.grid(True) 1540 | ax.legend(frameon=False) 1541 | ax.axis('equal') 1542 | ax.set_xlim(0, 3 * np.pi); 1543 | ``` 1544 | 1545 | First, it's more natural for this data to space the ticks and grid lines in 1546 | multiples of π. We can do this by setting a `MultipleLocator`, which locates 1547 | ticks at a multiple of the number you provide: 1548 | 1549 | ```python 1550 | ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2)) 1551 | ax.xaxis.set_minor_locator(plt.MultipleLocator(np.pi / 4)) 1552 | fig 1553 | ``` 1554 | 1555 | But now the tick labels look silly: they are multiples of π, but the decimal 1556 | representation doesn't immediately convey this. We can change the tick 1557 | formatter to fix this. There's no built-in formatter to do this, so we'll 1558 | use `plt.FuncFormatter` which accepts a user-defined function giving control 1559 | over the tick outputs: 1560 | 1561 | ```python 1562 | def format_func(value, tick_number): 1563 | # find number of multiples of pi/2 1564 | N = int(np.round(2 * value / np.pi)) 1565 | if N == 0: 1566 | return "0" 1567 | elif N == 1: 1568 | return r"$\pi/2$" 1569 | elif N == 2: 1570 | return r"$\pi$" 1571 | elif N % 2 > 0: 1572 | return r"${0}\pi/2$".format(N) 1573 | else: 1574 | return r"${0}\pi$".format(N // 2) 1575 | 1576 | ax.xaxis.set_major_formatter(plt.FuncFormatter(format_func)) 1577 | fig 1578 | ``` 1579 | 1580 | By enclosing the string in dollar signs, this enables LaTeX support. 1581 | 1582 | ## Customising Matplotlib: configurations and stylesheets 1583 | 1584 | ### Plot customisation by hand 1585 | 1586 | It is possible to customise plot settings to end up with something nicer than 1587 | the default on an individual basis. 1588 | 1589 | This is a drab default histogram: 1590 | 1591 | ```python 1592 | import matplotlib.pyplot as plt 1593 | plt.style.use('classic') 1594 | import numpy as np 1595 | 1596 | %matplotlib inline 1597 | 1598 | x = np.random.randn(1000) 1599 | plt.hist(x); 1600 | ``` 1601 | 1602 | that can be adjusted to make it more visually pleasing: 1603 | 1604 | ```python 1605 | # use a gray background 1606 | ax = plt.axes(axisbg='#E6E6E6') 1607 | ax.set_axisbelow(True) 1608 | 1609 | # draw solid white grid lines 1610 | plt.grid(color='w', linestyle='solid') 1611 | 1612 | # hide axis spines 1613 | for spine in ax.spines.values(): 1614 | spine.set_visible(False) 1615 | 1616 | # hide top and right ticks 1617 | ax.xaxis.tick_bottom() 1618 | ax.yaxis.tick_left() 1619 | 1620 | # lighten ticks and labels 1621 | ax.tick_params(colors='gray', direction='out') 1622 | for tick in ax.get_xticklabels(): 1623 | tick.set_color('gray') 1624 | for tick in ax.get_yticklabels(): 1625 | tick.set_color('gray') 1626 | 1627 | # control face and edge color of histogram 1628 | ax.hist(x, edgecolor='#E6E6E6', color='#EE6666'); 1629 | ``` 1630 | 1631 | But this took a lot of effort, and would be repetitive to do for each plot. 1632 | Fortunately, there is a way to adjust these defaults once only. 1633 | 1634 | ### Changing the defaults: `rcParams` 1635 | 1636 | Each time Matplotlib loads, it defines a runtime configuration (rc) containing 1637 | the default styles for every plot element you create. This configuration can 1638 | be adjusted at any time using `plt.rc()`. 1639 | 1640 | Here we will modify the rc parameters to make our default plot look similar to 1641 | the improved version. 1642 | 1643 | First, we save a copy of the current `rcParams` dictionary to easily reset 1644 | these changes in the current session: 1645 | 1646 | ```python 1647 | IPython_default = plt.rcParams.copy() 1648 | ``` 1649 | 1650 | Now we use `plt.rc()` to change some of these settings: 1651 | 1652 | ```python 1653 | from matplotlib import cycler 1654 | colors = cycler('color', 1655 | ['#EE6666', '#3388BB', '#9988DD', 1656 | '#EECC55', '#88BB44', '#FFBBBB']) 1657 | plt.rc('axes', facecolor='#E6E6E6', edgecolor='none', 1658 | axisbelow=True, grid=True, prop_cycle=colors) 1659 | plt.rc('grid', color='w', linestyle='solid') 1660 | plt.rc('xtick', direction='out', color='gray') 1661 | plt.rc('ytick', direction='out', color='gray') 1662 | plt.rc('patch', edgecolor='#E6E6E6') 1663 | plt.rc('lines', linewidth=2) 1664 | ``` 1665 | 1666 | With these settings defined, we can create a plot to see these settings 1667 | applied: 1668 | 1669 | ```python 1670 | plt.hist(x); 1671 | ``` 1672 | 1673 | Simple plots also can look nice with these same rc parameters: 1674 | 1675 | ```python 1676 | for i in range(4): 1677 | plt.plot(np.random.rand(10)) 1678 | ``` 1679 | 1680 | Settings can be saved in a *.matplotlibrc* file. 1681 | 1682 | ### Stylesheets 1683 | 1684 | Stylesheets are another way to customise Matplotlib. These let you create and 1685 | package your own styles, as well as use some built-in defaults. They are 1686 | formatted similarly to *.matplotlibrc* files but must have a *.mplstyle* 1687 | extension. 1688 | 1689 | Available styles can be listed: 1690 | 1691 | ```python 1692 | plt.style.available 1693 | ``` 1694 | 1695 | The way to switch to a stylesheet is: 1696 | 1697 | ```python 1698 | plt.style.use('stylename') 1699 | ``` 1700 | 1701 | but this will change the style for the rest of the session. Alternatively, 1702 | a style context manager is available to set the style temporarily: 1703 | 1704 | ```python 1705 | with plt.style.context('stylename'): 1706 | make_a_plot() 1707 | ``` 1708 | 1709 | Here is a function that makes a histogram and line plot to show the effects 1710 | of stylesheets: 1711 | 1712 | ```python 1713 | def hist_and_lines(): 1714 | np.random.seed(0) 1715 | fig, ax = plt.subplots(1, 2, figsize=(11, 4)) 1716 | ax[0].hist(np.random.randn(1000)) 1717 | for i in range(3): 1718 | ax[1].plot(np.random.rand(10)) 1719 | ax[1].legend(['a', 'b', 'c'], loc='lower left') 1720 | ``` 1721 | 1722 | First, reset the runtime configuration: 1723 | 1724 | ```python 1725 | # reset rcParams 1726 | plt.rcParams.update(IPython_default); 1727 | ``` 1728 | 1729 | Now see how the plots look with the default styling: 1730 | 1731 | ```python 1732 | hist_and_lines() 1733 | ``` 1734 | 1735 | and use the context manager to apply another style: 1736 | 1737 | ```python 1738 | with plt.style.context('ggplot'): 1739 | hist_and_lines() 1740 | ``` 1741 | 1742 | #### Seaborn style 1743 | 1744 | Matplotlib also has stylesheets inspired by Seaborn. These styles are loaded 1745 | automatically when Seaborn is imported: 1746 | 1747 | ```python 1748 | import seaborn 1749 | hist_and_lines() 1750 | ``` 1751 | 1752 | ## 3D plotting in Matplotlib 1753 | 1754 | Matplotlib originally was designed for 2D plotting, then 3D plotting 1755 | tools were added later. 3D plots are enabled by importing `mplot3d`: 1756 | 1757 | ```python 1758 | from mpl_toolkits import mplot3d 1759 | ``` 1760 | 1761 | Once this is imported, a 3D axes can be created by passing the keyword 1762 | `projection='3d'` to any of the normal axes creation routines: 1763 | 1764 | ```python 1765 | %matplotlib inline 1766 | import numpy as np 1767 | import matplotlib.pyplot as plt 1768 | 1769 | fig = plt.figure() 1770 | ax = plt.axes(projection='3d') 1771 | ``` 1772 | 1773 | With this 3D axes enabled, we can plot a variety of 3D plot types. 3D 1774 | plotting benefits from viewing figures interactively in a notebook: use 1775 | `%matplotlib notebook` instead of `%matplotlib inline` to do this. 1776 | 1777 | ### 3D points and lines 1778 | 1779 | The most basic 3D plot is a line or collection of scatter plots created 1780 | from sets of (x, y, z) triples. These can be created using `ax.plot3D()` 1781 | and `ax.scatter3D()`. The call signature for these is almost identical 1782 | to that of their 2D counterparts. 1783 | 1784 | This example plots a trigonometric spiral, along with some points drawn 1785 | randomly near the line: 1786 | 1787 | ```python 1788 | ax = plt.axes(projection='3d') 1789 | 1790 | # Data for a three-dimensional line 1791 | zline = np.linspace(0, 15, 1000) 1792 | xline = np.sin(zline) 1793 | yline = np.cos(zline) 1794 | ax.plot3D(xline, yline, zline, 'gray') 1795 | 1796 | # Data for three-dimensional scattered points 1797 | zdata = 15 * np.random.random(100) 1798 | xdata = np.sin(zdata) + 0.1 * np.random.randn(100) 1799 | ydata = np.cos(zdata) + 0.1 * np.random.randn(100) 1800 | ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens'); 1801 | ``` 1802 | 1803 | By default, the scatter points have their transparency adjusted to give 1804 | a sense of depth on the page. 1805 | 1806 | ### 3D contour plots 1807 | 1808 | `mplot3d` contains tools to create 3D relief plots. Like the 2D 1809 | `ax.contour` plots, `ax.contour3D` requires all the input data to be in 1810 | the form of 2D regular grids, with the Z data evaluated at each point. 1811 | 1812 | This is a 3D contour diagram of a 3D sinusoidal function: 1813 | 1814 | ```python 1815 | def f(x, y): 1816 | return np.sin(np.sqrt(x ** 2 + y ** 2)) 1817 | 1818 | x = np.linspace(-6, 6, 30) 1819 | y = np.linspace(-6, 6, 30) 1820 | 1821 | X, Y = np.meshgrid(x, y) 1822 | Z = f(X, Y) 1823 | 1824 | fig = plt.figure() 1825 | ax = plt.axes(projection='3d') 1826 | ax.contour3D(X, Y, Z, 50, cmap='binary') 1827 | ax.set_xlabel('x') 1828 | ax.set_ylabel('y') 1829 | ax.set_zlabel('z'); 1830 | ``` 1831 | 1832 | Sometimes the default viewing angle is not optimal. We can use the 1833 | `view_init` method to set the elevation and azimuthal angles. Here, we 1834 | set the elevation to 60 degrees (60 degrees above the xy-plane) and an 1835 | azimuth of 35 degrees (rotated 35 degrees anticlockwise about the 1836 | z-axis). 1837 | 1838 | ```python 1839 | ax.view_init(60, 35) 1840 | fig 1841 | ``` 1842 | 1843 | Of course, with an interactive plot, this view adjustment can be carried 1844 | out by the user. 1845 | 1846 | ### Wireframes and surface plots 1847 | 1848 | Two other types of 3D plots that work on gridded data are wireframes and 1849 | surface plots. These take a grid of values and project it onto the 1850 | specified 3D surface, making the resulting 3D forms easy to visualise. 1851 | 1852 | Here is an example: 1853 | 1854 | ```python 1855 | fig = plt.figure() 1856 | ax = plt.axes(projection='3d') 1857 | ax.plot_wireframe(X, Y, Z, color='black') 1858 | ax.set_title('wireframe'); 1859 | ``` 1860 | 1861 | A surface plot is like a wireframe plot, but each face of the wireframe 1862 | is a filled polygon. Adding a colormap to the filled polygons can help 1863 | make the surface topology clearer: 1864 | 1865 | ```python 1866 | ax = plt.axes(projection='3d') 1867 | ax.plot_surface(X, Y, Z, rstride=1, cstride=1, 1868 | cmap='viridis', edgecolor='none') 1869 | ax.set_title('surface'); 1870 | ``` 1871 | 1872 | The grid of values for a surface plot needs to be 2D, but not 1873 | necessarily rectilinear. Here is an example of creating a partial polar 1874 | grid: 1875 | 1876 | ```python 1877 | r = np.linspace(0, 6, 20) 1878 | theta = np.linspace(-0.9 * np.pi, 0.8 * np.pi, 40) 1879 | r, theta = np.meshgrid(r, theta) 1880 | 1881 | X = r * np.sin(theta) 1882 | Y = r * np.cos(theta) 1883 | Z = f(X, Y) 1884 | 1885 | ax = plt.axes(projection='3d') 1886 | ax.plot_surface(X, Y, Z, rstride=1, cstride=1, 1887 | cmap='viridis', edgecolor='none'); 1888 | ``` 1889 | 1890 | ### Surface triangulations 1891 | 1892 | For some applications, evenly sampled grids as required by the above 1893 | routines is restrictive. In these cases, triangulation-based plots can 1894 | be useful. What if, rather than an even draw from a Cartesian or a polar 1895 | grid, we instead have a set of random draws? 1896 | 1897 | ```python 1898 | theta = 2 * np.pi * np.random.random(1000) 1899 | r = 6 * np.random.random(1000) 1900 | x = np.ravel(r * np.sin(theta)) 1901 | y = np.ravel(r * np.cos(theta)) 1902 | z = f(x, y) 1903 | ``` 1904 | 1905 | We could create a scatter plot to get an idea of the surface we're 1906 | sampling from: 1907 | 1908 | ```python 1909 | ax = plt.axes(projection='3d') 1910 | ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5); 1911 | ``` 1912 | 1913 | This leaves much to be desired. The function that will help us is 1914 | `ax.plot_trisurf` which creates a surface by first finding a set of 1915 | triangles formed between adjacent points: 1916 | 1917 | ```python 1918 | ax = plt.axes(projection='3d') 1919 | ax.plot_trisurf(x, y, z, 1920 | cmap='viridis', edgecolor='none'); 1921 | ``` 1922 | 1923 | This is not as clean as when plotted with a grid, but allows for 1924 | interesting 3D plots. 1925 | 1926 | ## Geographic data with Basemap 1927 | 1928 | Geographic data is often visualised. Matplotlib's main tool for this 1929 | type of visualisation is the Basemap toolkit, one of several Matplotlib 1930 | toolkits that lives under the `mpl_toolkits` namespace. Basemap can be 1931 | clunky to use and often even simple visualisations can take longer to 1932 | render than is desirable. More modern visualisations, such as Leaflet or 1933 | the Google Maps API, may be a better choice for more intensive map 1934 | visualisations. Basemap is, however, still a useful tool to be aware of. 1935 | 1936 | It requires separate installation, e.g. with `pip` or `conda install basemap`. 1937 | 1938 | We add the Basemap import: 1939 | 1940 | ```python 1941 | %matplotlib inline 1942 | import numpy as np 1943 | import matplotlib.pyplot as plt 1944 | from mpl_toolkits.basemap import Basemap 1945 | ``` 1946 | 1947 | Geographic plots are just a few lines away (the graphics also require 1948 | the `PIL` package in Python 2, or the `pillow` package in Python 3): 1949 | 1950 | ```python 1951 | plt.figure(figsize=(8, 8)) 1952 | m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100) 1953 | m.bluemarble(scale=0.5); 1954 | ``` 1955 | 1956 | The globe shown is not an image: it is a Matplotlib axes that 1957 | understands spherical coordinates and allows overplotting data on the 1958 | map. 1959 | 1960 | Here a different map projection is used, zoomed in to North America and 1961 | the location of Seattle plotted: 1962 | 1963 | ```python 1964 | fig = plt.figure(figsize=(8, 8)) 1965 | m = Basemap(projection='lcc', resolution=None, 1966 | width=8E6, height=8E6, 1967 | lat_0=45, lon_0=-100,) 1968 | m.etopo(scale=0.5, alpha=0.5) 1969 | 1970 | # Map (long, lat) to (x, y) for plotting 1971 | x, y = m(-122.3, 47.6) 1972 | plt.plot(x, y, 'ok', markersize=5) 1973 | plt.text(x, y, ' Seattle', fontsize=12); 1974 | ``` 1975 | 1976 | ### Map projections 1977 | 1978 | The first thing to decide when using maps is the projection to use. It is 1979 | impossible to project a spherical map onto a flat surface without some 1980 | distortion or breaking its continuity. There are lots of choices of projection. 1981 | Depending on the intended use of the map projection, there are certain map 1982 | features (e.g. direction, area, distance, shape) that are useful to maintain. 1983 | 1984 | The Basemap package implements many projections, referenced by a short format 1985 | code. 1986 | 1987 | This is a convenience function to draw the world map along with latitude and 1988 | longitude lines: 1989 | 1990 | ```python 1991 | from itertools import chain 1992 | 1993 | def draw_map(m, scale=0.2): 1994 | # draw a shaded-relief image 1995 | m.shadedrelief(scale=scale) 1996 | 1997 | # lats and longs are returned as a dictionary 1998 | lats = m.drawparallels(np.linspace(-90, 90, 13)) 1999 | lons = m.drawmeridians(np.linspace(-180, 180, 13)) 2000 | 2001 | # keys contain the plt.Line2D instances 2002 | lat_lines = chain(*(tup[1][0] for tup in lats.items())) 2003 | lon_lines = chain(*(tup[1][0] for tup in lons.items())) 2004 | all_lines = chain(lat_lines, lon_lines) 2005 | 2006 | # cycle through these lines and set the desired style 2007 | for line in all_lines: 2008 | line.set(linestyle='-', alpha=0.3, color='w') 2009 | ``` 2010 | 2011 | #### Cylindrical projections 2012 | 2013 | Cylindrical projections are the simplest map projections. Lines of constant 2014 | latitude and longitude are mapped to horizontal and vertical lines. This type 2015 | of mapping represents equatorial regions well, but results in extreme 2016 | distortion near the poles. The spacing of latitude lines varies between 2017 | different cylindrical projections, leading to different conservation properties 2018 | and different distortion near the poles. 2019 | 2020 | The following code generates an example of the *equidistant cylindrical 2021 | projection* which chooses a latitude scaling that preserves distance along 2022 | meridians. Other cylindrical projections are the Mercator (`projection='merc'`) 2023 | and the cylindrical equal area (`projection='cea'`). 2024 | 2025 | ```python 2026 | fig = plt.figure(figsize=(8, 6), edgecolor='w') 2027 | m = Basemap(projection='cyl', resolution=None, 2028 | llcrnrlat=-90, urcrnrlat=90, 2029 | llcrnrlon=-180, urcrnrlon=180, ) 2030 | draw_map(m) 2031 | ``` 2032 | 2033 | `llcrnlat` and `urcrnlat` set the lower-left corner and upper-right corner 2034 | latitude for the map (and the `lon` equivalents set the longitude). 2035 | 2036 | #### Pseudo-cylindrical projections 2037 | 2038 | Pseudo-cylindrical projections relax the requirement that meridians (lines 2039 | of constant longitude) remain vertical; this can give better properties near 2040 | the poles of the projection. The Mollweide projection (`projection='moll'`) is 2041 | one example of this, in which all meridians are elliptical arcs. It is 2042 | constructed so as to preserve area across the map: though there are distortions 2043 | near the poles, the area of small patches reflects the true area. Other 2044 | pseudo-cylindrical projections are sinusoidal (`projection='sinu'`) and 2045 | Robinson (`projection='robin'`). 2046 | 2047 | ```python 2048 | fig = plt.figure(figsize=(8, 6), edgecolor='w') 2049 | m = Basemap(projection='moll', resolution=None, 2050 | lat_0=0, lon_0=0) 2051 | draw_map(m) 2052 | ``` 2053 | 2054 | #### Perspective projections 2055 | 2056 | Perspective projections are constructed using a particular choice of 2057 | perspective point, similar to if you photographed the Earth from a particular 2058 | point in space (a point which, for some projections, technically lies within 2059 | the Earth). One common example is the orthographic projection 2060 | (`projection='ortho'`) which shows one side of the globe as seen from a viewer 2061 | at a very long distance; it therefore can only show half the globe at a time. 2062 | Other perspective-based projections include the gnomonic projection 2063 | (`projection='gnom'`) and stereographic projection (`projection='stere'`). 2064 | These are often the most useful for showing small portions of the map. 2065 | 2066 | ```python 2067 | fig = plt.figure(figsize=(8, 8)) 2068 | m = Basemap(projection='ortho', resolution=None, 2069 | lat_0=50, lon_0=0) 2070 | draw_map(m); 2071 | ``` 2072 | 2073 | #### Conic projections 2074 | 2075 | A conic projection projects the map onto a single cone, which is then unrolled. 2076 | This can lead to good local properties, but regions far from the focus point of 2077 | the cone may become distorted. One example is the Lambert Conformal Conic 2078 | projection (`projection='lcc'`): it projects the map onto a cone arranged in 2079 | such a way that two standard parallels (specified in Basemap by `lat_1` and 2080 | `lat_2`) have well-represented distances, with scale decreasing between them 2081 | and increasing outside of them. Other useful conic projections are the 2082 | equidistant conic projection (`projection='eqdc'`) and the Albers equal-area 2083 | projection (`projection='aea'`). Conic projections, like perspective 2084 | projections, tend to be good choices for representing small to medium patches 2085 | of the globe. 2086 | 2087 | ```python 2088 | fig = plt.figure(figsize=(8, 8)) 2089 | m = Basemap(projection='lcc', resolution=None, 2090 | lon_0=0, lat_0=50, lat_1=45, lat_2=55, 2091 | width=1.6E7, height=1.2E7) 2092 | draw_map(m) 2093 | ``` 2094 | 2095 | ### Drawing a map background 2096 | 2097 | The Basemap package contains a range of functions for drawing borders of 2098 | physical features, as well as political boundaries. 2099 | 2100 | #### Physical boundaries and bodies of water 2101 | 2102 | `drawcoastlines()`: Draw continental coast lines 2103 | `drawlsmask()`: Draw a mask between the land and sea, for use with projecting images on one or the other 2104 | `drawmapboundary()`: Draw the map boundary, including the fill color for oceans. 2105 | `drawrivers()`: Draw rivers on the map 2106 | `fillcontinents()`: Fill the continents with a given color; optionally fill lakes with another color 2107 | 2108 | #### Political boundaries 2109 | 2110 | `drawcountries()`: Draw country boundaries 2111 | `drawstates()`: Draw US state boundaries 2112 | `drawcounties()`: Draw US county boundaries 2113 | 2114 | #### Map features 2115 | 2116 | `drawgreatcircle()`: Draw a great circle between two points 2117 | `drawparallels()`: Draw lines of constant latitude 2118 | `drawmeridians()`: Draw lines of constant longitude 2119 | `drawmapscale()`: Draw a linear scale on the map 2120 | 2121 | #### Whole-globe images 2122 | 2123 | `bluemarble()`: Project NASA's blue marble image onto the map 2124 | `shadedrelief()`: Project a shaded relief image onto the map 2125 | `etopo()`: Draw an etopo relief image onto the map 2126 | `warpimage()`: Project a user-provided image onto the map 2127 | 2128 | #### Resolution and boundary-based features 2129 | 2130 | For the boundary-based features, you must set the desired resolution when 2131 | creating a Basemap image. The `resolution` argument of the `Basemap` class sets 2132 | the level of details in boundaries, either `'c'` (crude), `'l'` (low), 2133 | `'i'` (intermediate), `'h'` (high), `'f'` (full), or `None` if no boundaries 2134 | will be used. This choice is important: setting high-resolution boundaries on 2135 | a global map can be slow. 2136 | 2137 | This is an example drawing land/sea boundaries, and shows the effect of the 2138 | resolution parameter, creating a low- and high-resolution map of the Isle of 2139 | Skye: 2140 | 2141 | ```python 2142 | fig, ax = plt.subplots(1, 2, figsize=(12, 8)) 2143 | 2144 | for i, res in enumerate(['l', 'h']): 2145 | m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2, 2146 | width=90000, height=120000, resolution=res, ax=ax[i]) 2147 | m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF') 2148 | m.drawmapboundary(fill_color="#DDEEFF") 2149 | m.drawcoastlines() 2150 | ax[i].set_title("resolution='{0}'".format(res)); 2151 | ``` 2152 | 2153 | Low-resolution coastlines are not suitable for this zoom level, but a low 2154 | resolution would suit a global view, and be much faster than loading 2155 | high-resolution border data for the globe. The best approach is to start with 2156 | a fast, low-resolution plot and increase the resolution as needed. 2157 | 2158 | ### Plotting data on maps 2159 | 2160 | Basemap allows overplotting a variety of data onto a map background. For simple 2161 | plotting and text, any `plt` function works on the map; you can use the 2162 | `Basemap` instance to project latitude and longitude coordinates to `(x, y)` 2163 | coordinates for plotting with `plt`. 2164 | 2165 | In addition, there are many map-specific functions available as methods of the 2166 | `Basemap` instance. These work very similarly to their standard Matplotlib 2167 | counterparts, but have an additional Boolean argument `latlon`, which if set 2168 | to `True` allows you to pass raw latitudes and longitudes to the method, 2169 | instead of projected `(x, y)` coordinates. 2170 | 2171 | Some of these map-specific methods are: 2172 | 2173 | * `contour()`/`contourf()` : Draw contour lines or filled contours 2174 | * `imshow()`: Draw an image 2175 | * `pcolor()`/`pcolormesh()` : Draw a pseudocolor plot for irregular/regular meshes 2176 | * `plot()`: Draw lines and/or markers. 2177 | * `scatter()`: Draw points with markers. 2178 | * `quiver()`: Draw vectors. 2179 | * `barbs()`: Draw wind barbs. 2180 | * `drawgreatcircle()`: Draw a great circle. 2181 | 2182 | ## Visualisation with Seaborn 2183 | 2184 | Matplotlib is a useful tool, but it leaves much to be desired: 2185 | 2186 | * the defaults are not the best choices. It was based off MATLAB circa 2187 | 1999, and this often shows. 2188 | * Matplotlib's API is quite low level. Sophisticated visualisation is 2189 | possible, but often requires a lot of boilerplate code. 2190 | * Matplotlib predated pandas, and is not designed for use with 2191 | `DataFrame`s. To visualise data from a `DataFrame`, you must extract 2192 | each `Series` and often concatenate them together in the right format. 2193 | 2194 | Matplotlib is addressing this, with the addition of `plt.style` and is 2195 | starting to handle pandas data more seamlessly, and with a new default 2196 | stylesheet in Matplotlib 2.0. 2197 | 2198 | However, the Seaborn package also answers these problems, providing an 2199 | API on top of Matplotlib that offers good choices for plot style and 2200 | colour defaults, defines simple high-level functions for common 2201 | statistical plot types, and integrates with `DataFrame`s. 2202 | 2203 | ### Seaborn versus Matplotlib 2204 | 2205 | A random walk plot: 2206 | 2207 | ```python 2208 | import matplotlib.pyplot as plt 2209 | plt.style.use('classic') 2210 | %matplotlib inline 2211 | import numpy as np 2212 | import pandas as pd 2213 | 2214 | # Create some data 2215 | rng = np.random.RandomState(0) 2216 | x = np.linspace(0, 10, 500) 2217 | y = np.cumsum(rng.randn(500, 6), 0) 2218 | 2219 | # Plot the data with Matplotlib defaults 2220 | plt.plot(x, y) 2221 | plt.legend('ABCDEF', ncol=2, loc='upper left'); 2222 | ``` 2223 | 2224 | This plot will contain all the information we want to convey, but not in 2225 | an aesthetically pleasing way. 2226 | 2227 | Seaborn has many of its own high-level plotting routines, but can also 2228 | overwrite Matplotlib's default parameters and get simple Matplotlib 2229 | scripts to produce superior output. We can set the style by calling 2230 | Seaborn's `set()` function. 2231 | 2232 | ```python 2233 | import seaborn as sns 2234 | sns.set() 2235 | 2236 | # same plotting code as above! 2237 | plt.plot(x, y) 2238 | plt.legend('ABCDEF', ncol=2, loc='upper left'); 2239 | ``` 2240 | 2241 | ### Exploring Seaborn plots 2242 | 2243 | Seaborn provides high-level commands to create a variety of plot types 2244 | useful for data exploration, and some even for statistical model 2245 | fitting. 2246 | 2247 | Note that the following data plots could be created using Matplotlib, 2248 | but Seaborn makes creating these much simpler. 2249 | 2250 | #### Histograms, KDE and densities 2251 | 2252 | Plotting histograms is simple in Matplotlib: 2253 | 2254 | ```python 2255 | data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], 2256 | size=2000) 2257 | data = pd.DataFrame(data, columns=['x', 'y']) 2258 | 2259 | for col in 'xy': 2260 | plt.hist(data[col], normed=True, alpha=0.5) 2261 | ``` 2262 | 2263 | Rather than a histogram, we can get a smooth estimate of the 2264 | distribution using a kernel density estimation, which Seaborn does with 2265 | `sns.kdeplot`: 2266 | 2267 | ```python 2268 | for col in 'xy': 2269 | sns.kdeplot(data[col], shade=True) 2270 | ``` 2271 | 2272 | Histograms and KDE can be combined using `distplot`: 2273 | 2274 | ```python 2275 | sns.distplot(data['x']) 2276 | sns.distplot(data['y']); 2277 | ``` 2278 | 2279 | If we pass the full two dimensional dataset to `kdeplot`, we get a 2D 2280 | visualisation of the data. 2281 | 2282 | ```python 2283 | sns.kdeplot(data); 2284 | ``` 2285 | 2286 | We can see the joint distribution and the marginal distributions 2287 | together using `sns.jointplot`. 2288 | 2289 | ```python 2290 | with sns.axes_style('white'): # Set white background. 2291 | sns.jointplot("x", "y", data, kind='kde'); 2292 | ``` 2293 | 2294 | `jointplot` takes other parameters. For instance, we can create a hexagonal 2295 | histogram too: 2296 | 2297 | ```python 2298 | with sns.axes_style('white'): 2299 | sns.jointplot("x", "y", data, kind='kde'); 2300 | ``` 2301 | 2302 | #### Pair plots 2303 | 2304 | When you generalise joint plots to datasets of larger dimensions, you 2305 | end up with *pair plots*. This is useful for exploring correlations 2306 | between multidimensional data, when you'd like to plot all pairs of 2307 | values against each other. 2308 | 2309 | Using the iris dataset as an example: 2310 | 2311 | ```python 2312 | iris = sns.load_dataset("iris") 2313 | ``` 2314 | 2315 | Visualizing the multidimensional relationships among the samples is as 2316 | easy as calling `sns.pairplot`: 2317 | 2318 | ```python 2319 | sns.pairplot(iris, hue='species', size=2.5); 2320 | ``` 2321 | 2322 | #### Faceted histograms 2323 | 2324 | Sometimes the best way to view data is via histograms of subsets. 2325 | Seaborn's `FacetGrid` makes this simple. As an example, tip data for 2326 | restaurant staff will be examined: 2327 | 2328 | ```python 2329 | tips = sns.load_dataset('tips') 2330 | 2331 | tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill'] 2332 | 2333 | grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True) 2334 | grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15)); 2335 | ``` 2336 | 2337 | #### Factor plots 2338 | 2339 | Factor plots let you view the distribution of a parameter within bins 2340 | defined by any other parameter. 2341 | 2342 | ```python 2343 | with sns.axes_style(style='ticks'): 2344 | g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box") 2345 | g.set_axis_labels("Day", "Total Bill"); 2346 | ``` 2347 | 2348 | #### Joint distributions 2349 | 2350 | Similar to the pairplot, we can use `sns.jointplot` to show the joint 2351 | distribution between different datasets, along with the associated 2352 | marginal distributions: 2353 | 2354 | ```python 2355 | with sns.axes_style('white'): 2356 | sns.jointplot("total_bill", "tip", data=tips, kind='hex') 2357 | ``` 2358 | 2359 | The joint plot can even do some automatic kernel density estimation and 2360 | regression: 2361 | 2362 | ```python 2363 | sns.jointplot("total_bill", "tip", data=tips, kind='reg'); 2364 | ``` 2365 | 2366 | #### Bar plots 2367 | 2368 | Time series can be plotted using `sns.factorplot`. Using the planets 2369 | data: 2370 | 2371 | ```python 2372 | planets = sns.load_dataset('planets') 2373 | 2374 | with sns.axes_style('white'): 2375 | g = sns.factorplot("year", data=planets, aspect=2, 2376 | kind="count", color='steelblue') 2377 | g.set_xticklabels(step=5) 2378 | ``` 2379 | 2380 | We can learn more by looking at the method of discovery: 2381 | 2382 | ```python 2383 | with sns.axes_style('white'): 2384 | g = sns.factorplot("year", data=planets, aspect=4.0, kind='count', 2385 | hue='method', order=range(2001, 2015)) 2386 | g.set_ylabels('Number of Planets Discovered') 2387 | ``` 2388 | 2389 | #### Other notes 2390 | 2391 | Seaborn can do regression too, using e.g. `regplot()` or `lmplot()`. 2392 | -------------------------------------------------------------------------------- /Chapter_5_Machine_Learning.md: -------------------------------------------------------------------------------- 1 | # Notes on machine learning 2 | 3 | ## What is machine learning? 4 | 5 | Machine learning is often categorised as a subfield of artificial 6 | intelligence, but can be more helpful to think of machine learning as a 7 | means of *building models of data*. 8 | 9 | Involves building mathematical models to help understand data. 10 | "Learning" aspect is when we give these models tunable parameters that 11 | can adapt to observed data: the program "learns" from the data. Once 12 | these models have been fit to previously seen data, can be used to 13 | predict and understand aspects of newly observed data. 14 | 15 | ### Categories of machine learning 16 | 17 | Two main types: supervised and unsupervised. 18 | 19 | *Supervised learning* involves modelling the relationship between 20 | measured features of data and some label associated with the data. Once 21 | this model is determined, can be used to apply labels to new, unknown 22 | data. Further divide supervised learning into *classification*, where 23 | labels are discrete categories, and *regression*, where labels are 24 | continuous quantities. 25 | 26 | *Unsupervised learning* involves modelling the features of a dataset 27 | without reference to any label; "letting the dataset speak for itself". 28 | Models include tasks such as *clustering*, identifying distinct groups 29 | of data and *dimensionality reduction*, searching for more succinct 30 | representations of the data. 31 | 32 | Also, *semi-supervised learning* methods, which fall between supervised 33 | and unsupervised learning. Can be useful if only incomplete labels are 34 | available. 35 | 36 | ### Examples of machine learning applications 37 | 38 | #### Classification 39 | 40 | E.g. you are given a set of labelled points in 2D and want to use these 41 | to classify some unlabelled points; each of these points has one of two 42 | labels. 43 | 44 | Have 2D data: two features for each point, represented by (x, y) 45 | positions of points on a plane. Also have one of two *class labels* for 46 | each point. From those features and labels, want to create a model that 47 | decides how a newly seen point is labelled. 48 | 49 | Simple model might be draw a line that separates the two groups of 50 | points. The *model* is a quantitative version of the statement "a 51 | straight line separates the classes", while the *model parameters* are 52 | the particular numbers describing the location and orientation of that 53 | line for our data. The optimal values for the model parameters are 54 | learned from the data, which is often called *training the model*. 55 | 56 | When a model is trained, can be generalised to new, unlabelled data. Can 57 | take a new set of data, draw the same model line through it and assign 58 | labels to the points based on this model. This stage is usually called 59 | *prediction*. 60 | 61 | This is the basic idea of classification, where "classification" 62 | indicates the data has discrete class labels. May look trivial in a 2D 63 | case, but the machine learning approach can generalise to much larger 64 | datasets in many more dimensions. 65 | 66 | This is similar to the task of automated spam detection for email. Might 67 | use "spam" or "not spam" as labels, and normalised counts of important 68 | words or phrases ("Viagra") as features. For the training set, these 69 | labels might be determined by individual inspection of a small sample of 70 | emails; for the remaining emails, the label would be determined using 71 | the model. For a suitably trained classification algorithm with enough 72 | well-constructed features (typically thousands of millions of words or 73 | phrases), this can be an effective approach. 74 | 75 | #### Regression: predicting continuous labels 76 | 77 | Might instead have two features and continuous labels instead of a 78 | discrete label. 79 | 80 | Might use a number of regression models, but can use linear regression, 81 | using the label as a third dimension and fitting a plane to the data. 82 | 83 | This is similar to computing the distance to galaxies observed through a 84 | telescope. Might use brightness of each galaxy at one of several 85 | wavelengths as features, and distance or redshift of the galaxy as a 86 | label. Distances for a small number of galaxies might be determined 87 | through an independent set of observations, then a regression model used 88 | to estimate this for other galaxies. 89 | 90 | #### Clustering: inferring labels on unlabelled data 91 | 92 | Clustering is a common unsupervised learning task. Data is automatically 93 | assigned to some number of discrete groups. Clustering models use the 94 | intrinsic structure of the data to determine which points are related. 95 | 96 | #### Dimensionality reduction: inferring structure of unlabelled data 97 | 98 | Seeks to pull out some low-dimensionality representation of data that in 99 | some way preserves relevant qualities of the full dataset. For instance, 100 | might have data with two features arranged in a spiral in a 2D plane: 101 | could say that the data is intrinsically only 1D, although this 1D data 102 | is embedded in higher-dimensional space. A suitable dimensionality 103 | reduction model would be sensitive to this nonlinear structure and pull 104 | out the lower dimensionality representation. 105 | 106 | Important for high dimensional cases; can't visualise large number of 107 | dimensions, so one way to make high dimensional data more manageable is 108 | to use dimensionality reduction. 109 | 110 | ## Introducing scikit-learn 111 | 112 | Several Python libraries provide implementations of machine learning 113 | algorithms. scikit-learn is one of the best known, providing efficient 114 | versions of a number of common algorithms. It is characterised by a 115 | clean, uniform and streamlined API, as well as by useful online 116 | documentation. A benefit of this uniformity is that once you understand 117 | the basic use and syntax of scikit-learn for one type of model, 118 | switching to a new model or algorithm is very straightforward. 119 | 120 | ### Data representation in scikit-learn 121 | 122 | Machine learning is about creating models from data; for that reason, 123 | start by discussing how data can be represented in order to be 124 | understood by the computer. Within scikit-learn, the best way to think 125 | about data is in terms of tables. 126 | 127 | #### Data as table 128 | 129 | A basic table is a 2D grid of data, where the rows represent individual 130 | elements of the dataset, and the columns represent attributes related to 131 | these elements. 132 | 133 | For example, the Iris dataset available as a `DataFrame` via Seaborn: 134 | 135 | ```python 136 | import seaborn as sns 137 | iris = sns.load_dataset('iris') 138 | iris.head() 139 | ``` 140 | 141 | ``` 142 | sepal_length sepal_width petal_length petal_width species 143 | 0 5.1 3.5 1.4 0.2 setosa 144 | 1 4.9 3.0 1.4 0.2 setosa 145 | 2 4.7 3.2 1.3 0.2 setosa 146 | 3 4.6 3.1 1.5 0.2 setosa 147 | 4 5.0 3.6 1.4 0.2 setosa 148 | ``` 149 | 150 | Each row of the data refers to a single observed flower, and the number 151 | of rows is the total number of flowers in the dataset. In general, we 152 | will refer to the rows of the matrix as *samples*, and the number of 153 | rows as `n_samples`. 154 | 155 | Likewise, each column refers to a particular quantitative piece of 156 | information that describes each sample. In general, we will refer to the 157 | columns of the matrix as *features* and the number of columns as 158 | `n_features`. 159 | 160 | ##### Features matrix 161 | 162 | The table layout makes clear that the information can be thought of as a 163 | 2D numerical array or matrix, which we will call the features matrix. By 164 | convention, the features matrix is often stored in a variable named `X`. 165 | The features matrix is assumed to be 2D, with shape `[n_samples, 166 | n_features]` and is most often contained in a NumPy array or a pandas 167 | `DataFrame`, though some scikit-learn models also accept SciPy sparse 168 | matrices. 169 | 170 | The samples (rows) refer to the individual objects described by the 171 | dataset. The features (columns) refer to the distinct observations that 172 | describe each sample in a quantitative manner. Features are generally 173 | real-valued, but may be Boolean or discrete-valued in some cases. 174 | 175 | ##### Target array 176 | 177 | In addition to the feature matrix `X`, we work with a *label* or 178 | *target* array, which by convention we call `y`. The target array is 179 | usually 1D, with length `n_samples` and generally contained in a NumPy 180 | array or pandas `Series`. The target array may have continuous numerical 181 | values or discrete classes/labels. While some scikit-learn estimators 182 | handles multiple target values in the form of a 2D `[n_samples, 183 | n_targets]` target array, here, we will be working primarily with the 184 | common case of a 1D target array. 185 | 186 | It can be confused how the target array differs from the other features 187 | columns. The distinguishing feature of the target array is that it is 188 | usually the quantity we want to *predict* from the data; in statistical 189 | terms, it is the dependent variable. E.g. for the iris data, we may wish 190 | to construct a model to predict the flower species from the other 191 | measurements; here, the `species` column would be considered the target 192 | array. 193 | 194 | ### scikit-learn's Estimator API 195 | 196 | scikit-learn's API is designed with the following principles in mind: 197 | 198 | * *Consistency*: All objects share a common interface drawn from a 199 | limited set of methods, with consistent documentation. 200 | * *Inspection*: All specified parameter values are exposed as public 201 | attributes. 202 | * *Limited object hierarchy*: Only algorithms are represented by Python 203 | classes; datasets are represented in standard formats (NumPy arrays, 204 | pandas `DataFrames`, SciPy sparse matrices) and parameter names use 205 | standard Python strings. 206 | * *Composition*: Many machine learning tasks can be expressed as 207 | sequence of more fundamental algorithms, and scikit-learn makes use of 208 | this where possible. 209 | * *Sensible defaults*: When models require user-specified parameters, 210 | the library defines an appropriate default value. 211 | 212 | In practice, these principles make scikit-learn easy to use once the 213 | basic principles are understood. Every machine learning algorithm in 214 | scikit-learn is implemented using the Estimator API, providing a 215 | consistent interface for a wide range of machine learning applications. 216 | 217 | #### Basics of the API 218 | 219 | The steps for using the Estimator API are as follows: 220 | 221 | 1. Choose a class of model by importing the appropriate estimator class 222 | from Scikit-Learn. 223 | 2. Choose model hyperparameters by instantiating this class with desired 224 | values. 225 | 3. Arrange data into a features matrix and target vector following the 226 | discussion above. 227 | 4. Fit the model to your data by calling the ``fit()`` method of the 228 | model instance. 229 | 5. Apply the Model to new data: 230 | - For supervised learning, often we predict labels for unknown data 231 | using the ``predict()`` method. 232 | - For unsupervised learning, we often transform or infer properties 233 | of the data using the ``transform()`` or ``predict()`` method. 234 | 235 | #### Supervised learning example: simple linear regression 236 | 237 | Consider a simple linear regression: fitting a line to (x, y) data. 238 | 239 | Use the following simple data: 240 | 241 | ```python 242 | import matplotlib.pyplot as plt 243 | import numpy as np 244 | 245 | rng = np.random.RandomState(42) 246 | x = 10 * rng.rand(50) 247 | y = 2 * x - 1 + rng.randn(50) 248 | plt.scatter(x, y); 249 | ``` 250 | 251 | ##### Choose a class of model 252 | 253 | Every model in scikit-learn is represented by a class. So, import the 254 | appropriate class: 255 | 256 | ```python 257 | from sklearn.linear_model import LinearRegression 258 | ``` 259 | 260 | ##### Choose model hyperparameters 261 | 262 | A *class* of model is not the same as an *instance* of a model. 263 | 264 | There are still options open to us once we choose a class. We might have 265 | to answer questions like: 266 | 267 | * Would we like to fit for the offset (i.e., *y*-intercept)? 268 | * Would we like the model to be normalized? 269 | * Would we like to preprocess our features to add model flexibility? 270 | * What degree of regularization would we like to use in our model? 271 | * How many model components would we like to use? 272 | 273 | These choices are often represented as *hyperparameters*; parameters 274 | that must be set before the model is fit to data. In scikit-learn, 275 | hyperparameters are chosen by passing values at model instantiation. 276 | 277 | For the linear regression example, can instantiate the 278 | `LinearRegression` class and specify that we would like to fit the 279 | intercept using the `fit_intercept` hyperparameter: 280 | 281 | ```python 282 | model = LinearRegression(fit_intercept=True) 283 | ``` 284 | 285 | When the model is instantiated, the only action is storing these 286 | hyperparameter values. The model is not yet applied to any data; 287 | scikit-learn's API makes a clear distinction between choosing a model, 288 | and applying that model to data. 289 | 290 | ##### Arrange data into a features matrix and target vector 291 | 292 | Our target variable `y` is in the correct form, an array of length 293 | `n_samples`, but we need to adjust `x` to make it a matrix of size 294 | `[n_samples, n_features]`: 295 | 296 | ```python 297 | X = x[:, np.newaxis] 298 | ``` 299 | 300 | ##### Fit the model to data 301 | 302 | Use the `fit()` method of the model: 303 | 304 | ```python 305 | model.fit(X, y) 306 | ``` 307 | 308 | Calling `fit()` causes a number of model-dependent internal computations 309 | to take place, and the results of these computations are stored in 310 | model-specific attributes that the user can explore. In scikit-learn, 311 | by convention, model parameters learned during `fit()` have trailing 312 | underscores, e.g. for this model we have `model.coef_` and 313 | `model.intercept_` representing the slope and intercept of the simple 314 | linear fit to the data. Here, they are close to the input slope of 2 and 315 | intercept of -1. 316 | 317 | NB: scikit-learn does not provide tools to draw conclusions from 318 | internal model parameters themselves: interpreting model parameters is 319 | much more a *statistical modelling* question than a *machine learning* 320 | question. The Statsmodels package is an alternative if you wish to delve 321 | into the meaning of fit parameters. 322 | 323 | ##### Predict labels for unknown data 324 | 325 | Once a model is trained, the main task of supervised machine learning is 326 | to evaluate it based on what it says about new data that was not part of 327 | the training set. In scikit-learn, use the `predict()` method for this. 328 | Here, our "new data" will be a grid of *x* values and we will ask what 329 | *y* values the model predicts: 330 | 331 | ```python 332 | xfit = np.linspace(-1, 11) 333 | Xfit = xfit[:, np.newaxis] # coerce x values into a features matrix 334 | yfit = model.predict(Xfit) 335 | plt.scatter(x, y) 336 | plt.plot(xfit, yfit) 337 | ``` 338 | 339 | #### Supervised learning example: Iris classification 340 | 341 | Given a model trained on Iris data, how well can we predict the 342 | remaining labels. Here, use a simple generative model, Gaussian naive 343 | Bayes, that assumes each class is drawn from an axis-aligned Gaussian 344 | distribution. Because it is fast and has no hyperparameters to choose, 345 | Gaussian naive Bayes is a useful baseline before exploring whether 346 | improvements can be found through more complicated models. 347 | 348 | Would like to evaluate the model on data it has not seen before, so we 349 | split the data into a *training set* and a *testing set*. This could be 350 | done by hand, but scikit-learn has the `train_test_split()` function: 351 | 352 | ```python 353 | from sklearn.cross_validation import train_test_split 354 | Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, 355 | random_state=1) 356 | ``` 357 | 358 | Now we proceed with the general pattern detailed above: 359 | 360 | ```python 361 | from sklearn.naive_bayes import GaussianNB # 1. choose model class 362 | model = GaussianNB() # 2. instantiate model 363 | model.fit(Xtrain, ytrain) # 3. fit model to data 364 | y_model = model.predict(Xtest) # 4. predict on new data 365 | ``` 366 | 367 | Finally, we can use `accuracy_score()` to see the fraction of predicted 368 | labels that match their true value: 369 | 370 | ```python 371 | from sklearn.metrics import accuracy_score 372 | accuracy_score(ytest, y_model) 373 | ``` 374 | 375 | This model actually scores more than 97%; even a naive classification 376 | algorithm works well for this dataset. 377 | 378 | For classification, a confusion matrix is also a useful analysis tool to 379 | see which classes are usually correctly or incorrectly classified; 380 | scikit-learn provides this as `sklearn.metrics.confusion_matrix()` and 381 | is used like `accuracy_score`, i.e. you provide `ytest` and `y_model`. 382 | 383 | This can be plotted nicely with Seaborn's heatmap, i.e.: 384 | 385 | ```python 386 | from sklearn.metrics import confusion_matrix 387 | 388 | mat = confusion_matrix(ytest, y_model) 389 | 390 | sns.heatmap(mat, square=True, annot=True, cbar=False) 391 | plt.xlabel('predicted value') 392 | plt.ylabel('true value'); 393 | ``` 394 | 395 | #### Unsupervised learning example: Iris dimensionality 396 | 397 | As an example, let's reduce the dimensionality of the Iris data to more 398 | easily visualise it. The Iris data is four dimensional: four features 399 | recorded for each sample. 400 | 401 | Dimensionality reduction asks whether there is a suitable lower 402 | dimensional representation that retains the essential features of the 403 | data. This makes it easier to plot the data: two dimensions are easier 404 | to plot, than four or more! 405 | 406 | Here, use principal component analysis (PCA) which is a fast linear 407 | dimensionality reduction technique. We will ask the model to return two 408 | components: a two dimensional representation of the data. 409 | 410 | ```python 411 | from sklearn.decomposition import PCA # 1. Choose the model class 412 | model = PCA(n_components=2) # 2. Instantiate the model with hyperparameters 413 | model.fit(X_iris) # 3. Fit to data. Notice y is not specified! 414 | X_2D = model.transform(X_iris) # 4. Transform the data to two dimensions 415 | ``` 416 | 417 | Can plot quickly by adding the two components to the Iris `DataFrame` 418 | and plot with Seaborn's `lmplot`: 419 | 420 | ```python 421 | iris['PCA1'] = X_2D[:, 0] 422 | iris['PCA2'] = X_2D[:, 1] 423 | sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False) 424 | ``` 425 | 426 | The 2D representation has the species well separated, even though the 427 | PCA algorithm had no knowledge of the species labels. This indicates a 428 | straightforward classification will likely be effective on the dataset, 429 | as we saw above with Gaussian naive Bayes. 430 | 431 | #### Unsupervised learning example: Iris clustering 432 | 433 | Clustering algorithms attempt to find distinct groups of data without 434 | reference to any labels. Here, use a powerful clustering method called a 435 | Gaussian mixture model (GMM); a GMM attempts to model the data as a 436 | collection of Gaussian blobs. 437 | 438 | ```python 439 | from sklearn.mixture import GMM # 1. Choose the model class 440 | model = GMM(n_components=3, 441 | covariance_type='full') # 2. Instantiate the model with hyperparameters 442 | model.fit(X_iris) # 3. Fit to data. Notice y is not specified! 443 | y_gmm = model.predict(X_iris) # 4. Determine cluster labels 444 | ``` 445 | 446 | Add the cluster label to the Iris `DataFrame` and use Seaborn to plot 447 | the results: 448 | 449 | ```python 450 | iris['cluster'] = y_gmm 451 | sns.lmplot("PCA1", "PCA2", data=iris, hue='species', 452 | col='cluster', fit_reg=False); 453 | ``` 454 | 455 | By splitting the data by cluster number, we can see how well the GMM 456 | algorithm recovers the underlying label. Most of the species are well 457 | separated, and there is only a small amount of mixing between versicolor 458 | and virginica species. In other words, the measurements of these flowers 459 | are distinct enough that we could automatically identify these different 460 | species with a simple clustering algorithm. 461 | 462 | ## Hyperparameters 463 | 464 | Parameters not learned from the data, parts of the model that can be 465 | varied, but not from the data, part of initial configuration; choices 466 | made before the model is fit to data. 467 | 468 | ## Validation 469 | 470 | Don't want to test on data trained with: model is likely to perform much 471 | better on this (in some cases, may be perfect, if the model just picks 472 | the closest point). 473 | 474 | Use holdout set: split into training set and holdout set for testing. 475 | 476 | This is better, but still means a lot of data isn't used for training. 477 | Cross-validation can solve this problem by splitting the data: using 478 | part of it as a validation set. 479 | 480 | n-fold means splitting into n groups, using one of them for validation 481 | and the rest for training, e.g. two-fold is split in half and use half 482 | for training, half for validation; five-fold is split into five and use 483 | one of those groups for validation. You don't just train once, but 484 | repeatedly, switching which group is used for validation and which 485 | group(s) are used for training. 486 | 487 | leave-one-out is extreme cross-validation: split into as many groups as 488 | there are samples/observations, and train on all but one, then validate 489 | on the one left out. 490 | 491 | (Validation sets seem to be those used such as in cross-validation, i.e. 492 | it's data that's not used for training directly at that point, but is 493 | part of that corpus used in developing a model; *test* sets seem to be 494 | those used for testing following training and validation, as a benchmark 495 | for completely unseen data to evaluate the model. 496 | 497 | Training sets are used to train the model; validation sets to evaluate 498 | the model while deciding on hyperparameters (and, I assume, features 499 | /training data set size too); test sets to test the model on unseen 500 | data. 501 | 502 | ### Hard negative mining 503 | 504 | Training sets can be iteratively developed; an example of this is hard 505 | negative mining. 506 | 507 | For instance, if you build a face recognition classifier. Testing a 508 | classifier out on new set of images, finding the false positives and 509 | then including those in the training set can help to improve the 510 | classifier. 511 | 512 | ## Model selection 513 | 514 | If our model is underperforming, what can we do? 515 | 516 | * Use a more complicated/more flexible model 517 | * Use a less complicated/less flexible model 518 | * Get more training samples 519 | * Get more data and add features to each sample 520 | 521 | Can be counter-intuitive: more flexible model may be worse; more 522 | training samples may not help. 523 | 524 | ## Bias-variance trade off 525 | 526 | High bias: underfits to data (bias is referring to bias of an estimator: 527 | difference between expected value and true value), e.g. straight line 528 | model trained to data that fits on a curve; e.g. model too simplistic 529 | for the data. 530 | 531 | High variance: overfits the data; model accounts for random errors in 532 | the data as well as the underlying data distribution, reflects the noise 533 | in the data, not the real process creating that data; model fits the 534 | data too closely. 535 | 536 | High bias gives similar performance on validation and training sets. 537 | 538 | High variance model typically performs worse on validation sets than on 539 | training set. 540 | 541 | ### Validation curve 542 | 543 | Plotting scores against complexity (e.g. polynomial degree in polynomial 544 | regression) is a validation curve. 545 | 546 | Increase complexity: training score increases towards some a limit, 547 | validation score increases too (as bias decreases) then drops off as 548 | variance increases. High bias at low complexity. Best model is somewhere 549 | in the middle of complexity. 550 | 551 | Training score higher than validation everywhere: model typically fits 552 | data better if already seen the data. 553 | 554 | Low model complexity (high bias): training data is underfit, so model is 555 | a poor predictor for training and unseen data. 556 | 557 | High model complexity (high variance): training data is overfit, so 558 | model predicts training data well, but fails for unseen data. 559 | 560 | Validation curve at a maximum at some intermediate value: a trade-off 561 | between bias and variance. 562 | 563 | ### Learning curve 564 | 565 | Optimal model may also depend on size of training data as well as 566 | complexity, so validation curve behaviour depends on these too. 567 | 568 | A learning curve is a plot of training/validation score against size of 569 | the training set. 570 | 571 | Larger data can support a more complex model, more resilient to 572 | overfitting. 573 | 574 | Expect that learning curves behave as follows: 575 | 576 | * A model of a given complexity will overfit a small dataset; training 577 | score is relatively high, while validation score is relatively low. 578 | * A model of a given complexity will underfit a large dataset; training 579 | score will decrease, but validation score will increase. 580 | * A model will never, except by chance, give a better score to the 581 | validation set than the training set; so the two curves should get 582 | closer together, but not cross over. 583 | 584 | Also note that learning curve tends to converge to a particular score as 585 | training set size increases: adding more training data doesn't help. In 586 | that case, to increase model performance, need a different (often more 587 | complex) model. 588 | 589 | High complexity models may give a better convergent score than lower 590 | complexity models, but require a larger dataset to prevent overfitting 591 | and get close to that convergent score. 592 | 593 | ### Grid search 594 | 595 | Simple example was polynomial degree, but models often have a number of 596 | knobs to turn, so instead of simple 2D plots, can get multidimensional 597 | surfaces that describe their behaviour. 598 | 599 | Finding best model can be done via grid search: explore a grid of model 600 | features, varying them and calculating score, e.g. in scikit-learn, 601 | GridSearchCV. Can parallelise search, search at random to help find 602 | suitable hyperparameters. 603 | 604 | If you find the best values are at the edges of the grid, might want to 605 | consider extending the grid to ensure those best values are really 606 | optimal. 607 | 608 | ## Feature engineering 609 | 610 | Not all data is numerical. Feature engineering is taking information 611 | that you have and turning it into numbers to build a feature matrix. 612 | Also called vectorisation: turning data into vectors. 613 | 614 | ### Categorical data 615 | 616 | Might be tempted to encode a category as a number, but often for models 617 | implies a numerical ordering, e.g. assigning numbers to places or 618 | colours would imply that "red > blue", for instance or "Manchester < 619 | London". 620 | 621 | The solution is one-hot encoding: so create columns that represent 622 | presence or absence of a category (so they can contribute individually 623 | to some estimator). 624 | 625 | This is explained quite well on [Stack 626 | Overflow](https://stackoverflow.com/questions/17469835/why-does-one-hot-encoding-improve-machine-learning-performance). 627 | 628 | (The difference from conventional numerical features e.g. if you were 629 | looking at e.g. width of petal, then you'd have a weight for that part 630 | of the model, plus a contribution from the magnitude of the observed 631 | value, which gives you some result.) 632 | 633 | Can result in many columns, but can use sparse matrices to store more 634 | efficiently. 635 | 636 | ### Text features 637 | 638 | Can encode text as word counts. Create a feature matrix where each 639 | sample (row) is an individual text, the columns represent words, and the 640 | values represent word counts. 641 | 642 | Can also use term frequency-inverse document frequency (tf-idf) which 643 | weights the word count features by how often they appear in documents. 644 | 645 | ### Image features 646 | 647 | One option is using a pixel value, but may not be optimal. 648 | 649 | ### Derived features 650 | 651 | Create features from input features. 652 | 653 | One useful example is constructing polynomial features from input data 654 | to then perform linear regression. 655 | 656 | Makes linear regression into polynomial regression by transforming the 657 | input: basis function regression. 658 | 659 | For example, take x, y data and create a feature matrix of x^3, x^2, x. 660 | The explanation of the maths behind this is not clear from this book 661 | section, but the clever part here is that linear regression can work for 662 | multiple dimensional inputs, and we choose those dimensions to be x, 663 | x^2, x^3. 664 | 665 | ### Imputation of missing data 666 | 667 | Fill missing data with some value to apply model to data. Different 668 | options, e.g. use mean of column or some other model. 669 | 670 | scikit-learn has `Imputer` class for simple approaches, e.g. mean, 671 | median, most frequent value. Can then feed this into estimator. 672 | 673 | ### scikit-learn pipelines 674 | 675 | Can create pipelines to save manually carrying out multiple steps, e.g. 676 | impute missing values as the mean, transform to quadratic polynomial 677 | features, then fit a linear regression. 678 | 679 | ``` 680 | from sklearn.pipeline import make_pipeline 681 | 682 | model = make_pipeline(Imputer(strategy='mean'), 683 | PolynomialFeatures(degree=2), 684 | LinearRegression()) 685 | ``` 686 | 687 | Can then use that just another scikit-learn model, via `model.fit()`, 688 | `model.predict()` etc. 689 | 690 | ## Naive Bayes 691 | 692 | Fast and simple classification algorithms often suitable for very 693 | high-dimensional datasets. Few tunable parameters, so quick baseline for 694 | classification problems. 695 | 696 | ### Bayesian Classification 697 | 698 | Naive Bayes classifiers are built on Bayesian classification methods, in 699 | turn, relying on Bayes' theorem. 700 | 701 | In Bayesian classification, want to find probability of a label given 702 | some observed features: P(L|features). 703 | 704 | Bayes' theorem allows us to calculate this in terms of quantities we can 705 | calculate more directly: 706 | 707 | ``` 708 | P(L|features) = P(features|L) P(L) 709 | ------------------ 710 | P(features) 711 | ``` 712 | 713 | If trying to decide between two labels — L1 and L2 — then one way to 714 | decide is to calculate the ratio of the posterior probabilities for each 715 | label: 716 | 717 | ``` 718 | P(L1|features) = P(features|L1) P(L1) 719 | -------------- -------------------- 720 | P(L2|features) = P(features|L2) P(L2) 721 | ``` 722 | 723 | Need a model to find P(features|Li) for each label, Li. A model is 724 | generative because it specifies the hypothetical random process that 725 | generates the data. Specifying this model for each label is the main 726 | piece of the training of such a classifier. The general version of this 727 | step is difficult, but can be simplified via assumptions about the form 728 | of the model. 729 | 730 | This is the "naive" part: making naive assumptions about the generative 731 | model for each label, can find a rough approximation of the model and 732 | then proceed with classification. 733 | 734 | Different naive Bayes classifiers rest on different naive assumptions 735 | about the data. 736 | 737 | ### Gaussian naive Bayes 738 | 739 | For this classifier, assume that data from each label is drawn from a 740 | simple Gaussian distribution. 741 | 742 | Find the mean and standard deviation of the points in each label. Builds 743 | a Gaussian generative model with larger probabilities near the centre of 744 | those distributions. Can then calculate the likelihood P(features|Li) 745 | and the posterior ratio to find the label. 746 | 747 | Often get a quadratic boundary in Gaussian naive Bayes. 748 | 749 | Can also use Bayesian classification to estimate probabilities for the 750 | classes. 751 | 752 | ### Multinomial naive Bayes 753 | 754 | Can use multinomial naive Bayes: assume the features are from a 755 | multinomial distribution, one which describes the probability of 756 | observing counts among different categories. 757 | 758 | Particularly appropriate for features representing counts or count 759 | rates, e.g. word counts or frequencies. 760 | 761 | Similar to Gaussian naive Bayes, just with a different generative model. 762 | 763 | ### When to use naive Bayes 764 | 765 | Naive Bayesian classifiers make stringent assumptions about data: 766 | generally don't perform as well as a more complex model. 767 | 768 | Advantages: 769 | 770 | * Fast for training and prediction. 771 | * Straightforward probabilistic prediction. 772 | * Often easy to interpret. 773 | * Few (if any) tunable parameters. 774 | 775 | Good initial choice. If it works well, can use as a fast, easy to 776 | understand classifier for the problem. If not, can use more complex 777 | models with some baseline of how well they should perform. 778 | 779 | Work well in the following situations: 780 | 781 | * When naive assumptions match the data (rarely in practice). 782 | * For very well-separated categories, when model complexity is less 783 | important. 784 | * For very high-dimensional data, when model complexity is less 785 | important. 786 | 787 | The last two items are related: as the dimensions grow, much less likely 788 | for two points to be close together (as they must be close in every 789 | dimension). Clusters in high dimensions tend to be more separated than 790 | those in low dimensions, on average, if the new dimensions add 791 | information. 792 | 793 | So, simplistic classifiers, like naive Bayes, tend to work as well or 794 | better than more complex ones as dimensionality grows: simple models can 795 | be powerful, given enough data. 796 | 797 | ## Linear regression 798 | 799 | Naive Bayes is a good starting point for classification, and linear 800 | regression is a good starting point for regression. These can be fit 801 | quickly and are very interpretable. 802 | 803 | The familiar model is a straight line data fit, but can be extended to 804 | model more complex data. 805 | 806 | Can handle multidimensional linear models, e.g. fitting a plane to 807 | points in three dimensions or hyperplanes to points in higher 808 | dimensions. 809 | 810 | ### Basis function regression 811 | 812 | As mentioned above, can use linear regression for nonlinear 813 | relationships between variables by transforming the data according to 814 | basis functions, replacing the x1, x2, x3... in a linear model with e.g. 815 | x, x^2, x^3. 816 | 817 | The linearity refers to the fact that the coefficients don't multiply or 818 | divide each other. This projects one-dimensional x values into a higher 819 | dimension, to fit more complex relationships between x and y. 820 | 821 | As above, can use polynomial features, but can use other basis 822 | functions, which can be created by a user even if not available in 823 | scikit-learn, e.g. Gaussian basis functions. 824 | 825 | ### Regularisation 826 | 827 | Basis functions can make linear regression models more flexible, but 828 | lead to overfitting, especially if using too many basis functions. What 829 | can happen is that coefficients of basis functions (e.g. adjacent 830 | Gaussian basis functions) can blow up, cancelling each other out. 831 | 832 | Regularisation can limit these spikes by penalising large values of the 833 | model parameters. The penalty parameter should be determined via 834 | cross-validation. 835 | 836 | #### Ridge regression (L2 regularisation) 837 | 838 | Penalises the sum of squares of the model coefficients with a parameter 839 | that multiplies this sum and controls the strength of the penalty. 840 | 841 | As this penalty approaches the limit of zero, we get back the standard linear 842 | regression model; as it approaches the limit of infinity, the model 843 | responses will be suppressed. 844 | 845 | Ridge regression can be computed efficiently, with little cost over the 846 | standard linear regression model. 847 | 848 | #### Lasso regression (L1 regularisation) 849 | 850 | Penalises the sum of absolute values. Similar to ridge regression in 851 | concept, but can given very different results: tends to favour sparse 852 | models, that set model coefficients to zero. 853 | 854 | ## Support vector machines (SVMs) 855 | 856 | SVMs are a powerful and flexible class of supervised algorithms for 857 | classification and regression. Here, we'll discuss their use for 858 | classification. 859 | 860 | ### Motivation 861 | 862 | For Bayesian classification above, we learned a generative model 863 | describing the distribution of each class and used that to predict 864 | labels for new points. That is generative classification. 865 | 866 | Here, we consider discriminative classification: instead of modelling 867 | each class, we find a line or curve (in 2D) or manifold (in multiple 868 | dimensions) that divides the classes. 869 | 870 | For separating two groups of points belonging to different classes, it 871 | may be that a linear discriminative classifier creates a straight line 872 | that divides the two sets of data, and creates a model for 873 | classification. However, there may be more than one line that satisfies 874 | this criterion. Furthermore, different lines may give different 875 | predictions for the same point. 876 | 877 | ### SVMs maximise the margin 878 | 879 | Instead of drawing a zero-width line between classes, draw a margin 880 | around each line of some width, up to the nearest point. 881 | 882 | For SVMs, choose the line that maximises this margin as the optimal 883 | model: a maximum margin estimator. 884 | 885 | Training points that touch the margins are the pivotal elements of the 886 | fit: these are the support vectors. Only the position of the support 887 | vectors matters: points further from the margin on the correct side do 888 | not affect the fit. Those points do not contribute to the loss function 889 | used to fit the model. So, even adding points, provided they're further 890 | away than the margin, won't affect the fit. 891 | 892 | ### Kernel SVM 893 | 894 | Have seen kernels before in basis function regressions above. Projecting 895 | data into higher-dimensional space defined by different basis functions 896 | and then fitting nonlinear relationships with a linear classifier. 897 | 898 | Can do same for SVM. For example, imagine data of two classes with one 899 | class in the centre of two features, and the other class data 900 | surrounding it in a rough circle. A linear separator won't work. 901 | 902 | However, if you add a dimension using a radial basis function, centre 903 | positioned class (with smaller radius) will have different values in 904 | this dimension to the outer class (with larger radius). This makes them 905 | easy to separate by a plane, parallel to x and y, in the middle of the 906 | radial basis function values. 907 | 908 | Depending on how the function was chosen, may not be so cleanly 909 | separable. Choosing the function is a problem: automating this is ideal. 910 | One way is to compute a basis function at every point in the dataset and 911 | let the algorithm choose the best result. This type of basis function 912 | transformation is a kernel transformation, based on a similarity 913 | relationship (or kernel) between each pair of points. 914 | 915 | However, projecting N points into N dimensions could be computationally 916 | expensive. Instead, use the kernel trick, which allows a fit on 917 | kernel-transformed data to be done implicitly, without creating the full 918 | N dimensional representation. 919 | 920 | (For scikit-learn, can apply kernelised SVM by changing the linear 921 | kernel to a radial basis function kernel, via `kernel` hyperparameter.) 922 | 923 | Kernel transformation is often used in machine learning to turn fast 924 | linear methods into fast nonlinear methods, especially for models where 925 | the kernel trick can be used. 926 | 927 | ### Tuning SVM: softening margins 928 | 929 | If no perfect decision boundary exists and data overlaps, can use a 930 | tuning parameter, C that allows points to enter the margin, if that 931 | allows a better fit. For very large C, the margin is hard and points 932 | cannot lie in it. For smaller C, the margin is softer and can grow to 933 | include some points. 934 | 935 | ### Pros and cons of SVM 936 | 937 | SVM is powerful for classification because: 938 | 939 | * Depend on relatively few support vectors so are compact models and 940 | take up little memory. 941 | * When model is trained, prediction is fast. 942 | * Only affected by points near margin, so work well with 943 | high-dimensional data, even data with more dimensions than samples, 944 | which can be challenging for other models. 945 | * Integration with kernel methods makes them adaptable to many types of 946 | data. 947 | 948 | But: 949 | 950 | * Scaling with number of sample is O(N^3) at worst, or O(N^2) for 951 | efficient implementations. For large training sets, can be expensive 952 | computationally. 953 | * Results depend strongly on a suitable choice for softening parameter 954 | C. Need to find this via cross-validation, which can be expensive for 955 | large datasets. 956 | * Results do not have a direct probabilistic interpretation. Can be 957 | estimated via an internal cross-validation, but can be costly. 958 | 959 | Useful if faster, simpler, less tuning-intensive methods are 960 | insufficient. If processing time available, can be an excellent choice. 961 | 962 | ## Decision trees and random forests 963 | 964 | Random forests are a nonparametric algorithm. They are an example of an 965 | ensemble method: relying on aggregating the results of an ensemble of 966 | simpler estimators. With ensemble methods, the sum can be greater than 967 | its parts: a majority vote among estimators can be better than any of 968 | the individual estimators. 969 | 970 | ### Decision trees 971 | 972 | Random forests are ensemble learners built on decision trees. 973 | 974 | Decision trees classify objects by asking a series of questions to 975 | zero-in on their classification, each question with (typically, if not 976 | always) two mutually exclusive answers. 977 | 978 | Binary splitting makes this efficient: each question will cut the number 979 | of options by approximately half. The trick is deciding the questions to 980 | ask at each step. 981 | 982 | In machine learning implementations of decision trees, the questions 983 | generally take the form of axis-aligned splits in the data: each node in 984 | the tree splits the data into two groups using a cutoff value within one 985 | of the features. 986 | 987 | In a simple case of two features, a decision tree iteratively splits the 988 | data along one or other axis according to some quantitative criterion. 989 | At each level, assign the label of the new region according to the 990 | majority vote of the points within it. 991 | 992 | At each level of the tree, each region is split along one or the other 993 | features, unless it wholly contains points of one class (there's no need 994 | to continue splitting them as the voting result is the same). 995 | 996 | `DecisionTreeClassifier` in scikit-learn. 997 | 998 | #### Decision trees and overfitting 999 | 1000 | As depth increases, can get strange shaped regions, due to overfitting. 1001 | 1002 | This is a common property of decision trees: end up fitting the details 1003 | of the data, not the overall property of the distribution the data are 1004 | drawn from. 1005 | 1006 | Get inconsistencies between trees looking at different data. Turns out 1007 | that using information from multiple trees is a way to improve our 1008 | result. 1009 | 1010 | ### Random forests 1011 | 1012 | This leads to the idea of bagging: using an ensemble of parallel 1013 | estimators that each overfit the data, and averages the result. An 1014 | ensemble of randomised decision trees is a random forest. 1015 | 1016 | `RandomForestClassifier` in scikit-learn. Only need to select the number 1017 | of estimators, but can work very well. 1018 | 1019 | #### Random forest regression 1020 | 1021 | Can use `RandomForestRegressor` in scikit-learn for regression. 1022 | 1023 | #### Pros and cons of random forests 1024 | 1025 | Advantages: 1026 | * Training and prediction are fast, because of the simplicity of the 1027 | decision trees. Both training and prediction can be parallelised 1028 | easily as the trees are entirely independent. 1029 | * Multiple trees allow for probabilistic classification: majority vote 1030 | among estimators allows probability to be estimated. 1031 | * Nonparametric model is flexible and can perform well on tasks that are 1032 | underfit by other estimators. 1033 | 1034 | Disadvantage: 1035 | * Not easy to interpret the results; meaning of the classification model 1036 | not easy to draw conclusions from. 1037 | 1038 | ## Principal component analysis (PCA) 1039 | 1040 | A commonly used unsupervised algorithm. 1041 | 1042 | Used for dimensionality reduction, but also for visualisation, noise 1043 | filtering, feature extraction and engineering. 1044 | 1045 | It is a fast and flexible unsupervised method for dimensionality 1046 | reduction in data. Unsupervised learning aims to learn about the 1047 | relationship, e.g. between x and y values, not predict y from x. 1048 | 1049 | For PCA, this relationship is quantified by finding the principal axes 1050 | in the data, and using those to describe the dataset. Can visualise as 1051 | vectors on the input data: using components to define the vector 1052 | direction and the explained variance to define the squared length of the 1053 | vector. The length of the vector indicates the "importance" of that axis 1054 | in describing the data distribution: it is a measure of the variance of 1055 | the data when projected onto that axis. 1056 | 1057 | Transformation from data axes to principal axes is an affine 1058 | transformation: consisting of a translation, rotation and uniform 1059 | scaling. 1060 | 1061 | PCA has lots of uses. 1062 | 1063 | ### PCA as dimensionality reduction 1064 | 1065 | In dimensionality reduction with PCA, zero out one or more of the 1066 | smallest principal components, resulting in a lower dimensional 1067 | projection of the data that preserves the maximal data variance. 1068 | 1069 | Can reduce dimensionality, then inverse transform the reduced data to 1070 | see the effect of this reduction. Loses information along least 1071 | important axis or axes, leaving the components with the highest 1072 | variance. The fraction of variance lost is roughly a measure of the 1073 | information lost in this process. 1074 | 1075 | A reduced dimension dataset can be still useful enough to describe the 1076 | important relationships between data points. 1077 | 1078 | ### PCA for visualisation 1079 | 1080 | In high dimensions, can use a low dimensional representation to 1081 | visualise the data in e.g. two dimensions. As if you find the optimal 1082 | stretch and rotation in that high dimensional space to see the 1083 | projection in two dimensions, and can be done without reference to the 1084 | labels. 1085 | 1086 | Consider digits as 8x8 pixel representations. Could represent in terms 1087 | of pixel values, with a pixel basis where each pixel is an individual 1088 | dimension; could multiply each pixel value by the pixel it describes to 1089 | reconstruct the image. 1090 | 1091 | (I think this is a bit confusing, because a zero value would typically 1092 | be a black pixel, not a white one, but it gets the idea of 1093 | dimensionality reduction across.) 1094 | 1095 | Could also reduce the dimensionality by using the first eight pixels 1096 | only, but would lose a lot of the image and these wouldn't represent it 1097 | very well. 1098 | 1099 | However, can choose a different basis, e.g. functions that include 1100 | pre-defined contributions from each pixel, and add just a few of those 1101 | together to better reconstruct the data. 1102 | 1103 | PCA finds these more efficient basis functions. 1104 | 1105 | ### Choosing number of components 1106 | 1107 | Look at cumulative explained variance ratio as a function of the number 1108 | of components: see how much of the information is kept as number of 1109 | components increases. May be able to keep most of the variance with a 1110 | relatively small number of components. 1111 | 1112 | ### PCA as noise filtering 1113 | 1114 | Components with variance much larger than effect of noise should be 1115 | relatively unaffected by noise, so reconstructing data using the largest 1116 | subset of principal components, you should preferentially keep the 1117 | signal and discard the noise. Can specify how much of variance to retain 1118 | in PCA with scikit-learn. 1119 | 1120 | Perform PCA fit on data, transform the data and then use the inverse of 1121 | the transform to reconstruct the filtered digits. 1122 | 1123 | PCA can be useful for feature selection since you can train a classifier 1124 | on lower dimensional reduction of high dimensional data, which will help 1125 | to filter out random noise. 1126 | 1127 | ### Summary of PCA 1128 | 1129 | Useful in a wide range of contexts. Useful to visualise relationship 1130 | between points, to understand variance in data and to understand the 1131 | dimensionality (looking at how many components you need to represent the 1132 | data). 1133 | 1134 | Weakness is that it can be strongly affected by data outliers. Robust 1135 | variants act to iteratively discard data points that are poorly 1136 | described by the initial components, e.g. `RandomizedPCA` and 1137 | `SparsePCA` in scikit-learn. 1138 | 1139 | ## Manifold learning 1140 | 1141 | PCA can be used to reduce dimensionality: reducing the number of 1142 | features of a dataset while maintaining the essential relationships 1143 | between the points. It is fast and flexible, but does not perform well 1144 | where there are nonlinear relationships within the data. 1145 | 1146 | Can use manifold learning methods instead. These are unsupervised 1147 | estimators that aim to describe datasets as low dimensional manifolds 1148 | embedded in high dimensional spaces. 1149 | 1150 | To think of a manifold, can imagine a sheet of paper: a two dimensional 1151 | object that exists in a three dimensional world, and can be bent or 1152 | rolled in those two dimensions. 1153 | 1154 | Rotating, reorienting or stretching the paper in three dimensional space 1155 | doesn't change the geometry of the paper: these are akin to linear 1156 | embeddings. If you bend, curl or crumple the paper, it is a two 1157 | dimensional manifold still, but the embedding in three dimensional space 1158 | is no longer linear. 1159 | 1160 | Manifold learning algorithms aim to learn about the fundamental two 1161 | dimensional nature of the paper, even as it is contorted to fill the 1162 | three dimensional space. 1163 | 1164 | ### Multidimensional scaling (MDS) 1165 | 1166 | For some data, e.g. points in the shape of the word "HELLO" in 2D, the 1167 | x and y values are not the most fundamental description of the data. 1168 | 1169 | Can rotate or scale the data and the "HELLO" will still be apparent. 1170 | 1171 | What is fundamental is the distance between each point and the other 1172 | points. Can represent using a distance matrix: construct an NxN array 1173 | such that each entry (i, j) contains the distance between point i and 1174 | point j, e.g. scikit-learn's `pairwise_distances` function. 1175 | 1176 | Even if this "HELLO" data is rotated or translated, get the same 1177 | distance matrix. However, the distance matrix is not easily visualised. 1178 | Furthermore, it is difficult to convert the distance matrix back into 1179 | x and y coordinates. 1180 | 1181 | MDS can be used to reconstruct a D-dimensional coordinate representation 1182 | of the data (i.e. the number of coordinates in the distance matrix), 1183 | only given the distance matrix. 1184 | 1185 | #### MDS as manifold learning 1186 | 1187 | Distance matrices can be computed from data in any dimension, and can 1188 | reduce down to fewer components (e.g. 3D to 2D). 1189 | 1190 | Given high dimensional embedded data, it seeks a low dimensional 1191 | representation of the data that preserves certain relationships within 1192 | the data. For MDS, the preserved quantity is the distance between pairs 1193 | of points. 1194 | 1195 | ### Nonlinear embeddings 1196 | 1197 | MDS breaks down when embeddings are nonlinear, e.g. "HELLO" distorted 1198 | into an S-shape in three dimensions, by curving the text around. 1199 | 1200 | MDS cannot unwrap this nonlinear embedding, so lose the relationships 1201 | when it is applied. 1202 | 1203 | #### Locally linear embedding (LLE) 1204 | 1205 | MDS attempts to preserve distances between faraway points. Instead, LLE 1206 | aims to preserve distance between neighbouring points. Can be better at 1207 | recovering well-defined manifolds with little distortion. 1208 | 1209 | ### More on manifold methods 1210 | 1211 | Challenges of manifold learning: 1212 | 1213 | * No good framework for handling missing data, but straightforward 1214 | iterative approaches for this case in PCA. 1215 | * Presence of noise in the data can drastically change the embedding. 1216 | PCA filters noise from the most important components. 1217 | * Manifold embedding result depends on the number of neighbours chosen, 1218 | and no good quantitative way to choose this. PCA doesn't need this 1219 | choice. 1220 | * The globally optimal number of output dimensions is difficult to 1221 | determine. PCA lets you choose this based on explained variance. 1222 | * Meaning of embedded dimensions is not always clear. In PCA, the 1223 | principal components have a clear meaning. 1224 | * Manifold learning methods scale as O(N^2) or O(N^3). For PCA, there 1225 | are randomised approaches that are much faster. 1226 | 1227 | The advantage of manifold learning methods over PCA is that they can 1228 | preserve nonlinear relationships in the data, but PCA is often a better 1229 | first choice. 1230 | 1231 | For toy problems, LLE and variants perform well. For high dimensional 1232 | data from real world sources, IsoMap tends to give more meaningful 1233 | embeddings than LLE. For highly clustered data, t-distributed stochastic 1234 | neighbor embedding can work well, but can be slow. 1235 | 1236 | ## k-means clustering 1237 | 1238 | Clustering algorithms seek to learn from the data an optimal division or 1239 | discrete labelling of the data. 1240 | 1241 | Many clustering algorithms, but k-means is simple to understand. 1242 | 1243 | ### Introducing k-means 1244 | 1245 | Looks for a pre-determined number of clusters within an unlabelled 1246 | multidimensional dataset. Optimal clustering as considered here is 1247 | simple: 1248 | 1249 | * The "cluster centre" is the arithmetic mean of all the points 1250 | belonging to the cluster. 1251 | * Each point is closer to its own centre than to others. 1252 | 1253 | These assumptions are the basis of the k-means model. 1254 | 1255 | Finds clusters quickly, even though the number of possible combinations 1256 | of cluster assignments could be large. However, doesn't use an 1257 | exhaustive search, but expectation-maximisation, an iterative approach. 1258 | 1259 | ### k-means algorithm: expectation-maximisation (E-M) 1260 | 1261 | E-M is a powerful algorithm. k-means is a simple application of it; the 1262 | E-M approach here consists of: 1263 | 1264 | 1. Guess some cluster centres. 1265 | 2. Repeat until converged: 1266 | A. E-step: assign points to the nearest cluster centre 1267 | B. M-step: set the cluster centres to the mean 1268 | 1269 | The E-step is called that because we update our expectation of which 1270 | cluster each point belongs to. The M-step is named such because it 1271 | involves maximising some fitness function that defines the location of 1272 | the cluster centres, here by taking the mean of the data in each 1273 | cluster. 1274 | 1275 | Usually, repeating the E-step and M-step will result in a better 1276 | estimate of the cluster characteristics. 1277 | 1278 | #### Caveats of E-M and k-means 1279 | 1280 | * Although repeated E-M improves the clusters, the globally optimal 1281 | result may not be achieved. Usually run it for multiple starting 1282 | guesses. 1283 | * Need to tell it how many clusters are expected: this can't be learned 1284 | from the data. Whether clustering result is meaningful is difficult to 1285 | answer: can use silhouette analysis. Otherwise, can use other 1286 | clustering algorithms that do have a measure of fitness per number of 1287 | clusters, e.g. Gaussian mixture models, or which can choose a suitable 1288 | number of clusters (DBSCAN, mean-shift or affinity propagation). 1289 | * k-means is limited to linear cluster boundaries; assumes points are 1290 | closer to their cluster centre than others. Can fail if clusters have 1291 | complicated geometries. Can work around this by using kernelised 1292 | k-means, e.g. `SpectralClustering` in scikit-learn: transforms data 1293 | into higher dimensions and then assigns labels using k-means, e.g. 1294 | like linear regression above where data transformed so that more 1295 | complex curves can be used in fitting. 1296 | * Slow for large numbers of samples. Every iteration of k-means accesses 1297 | every data point. Can solve by using a subset of the data to update 1298 | the cluster centres, via batch-based k-means algorithms, e.g. 1299 | `MiniBatchKMeans` in scikit-learn. 1300 | 1301 | ### Other uses of k-means 1302 | 1303 | Aside from clustering for classification, also have novel uses, e.g. 1304 | colour compression by compressing the colour values in pixels to reduce 1305 | the data; replace a group of pixels with a single cluster centre. 1306 | 1307 | ## Gaussian mixture models (GMMs) 1308 | 1309 | k-means is simple and easy to understand, but its simplicity leads to 1310 | practical challenges for its application. In particular, the 1311 | non-probabilistic nature of k-means and its use of simple 1312 | distance from cluster centre to assign cluster membership leads to poor 1313 | performance for many real world situations. 1314 | 1315 | ### Weaknesses of k-means 1316 | 1317 | #### No probability measure 1318 | 1319 | k-means has no intrinsic measure of probability or uncertainty of 1320 | cluster assignments (though may be possible to use a bootstrap approach 1321 | to estimate the uncertainty), e.g. in regions where clusters overlap. 1322 | 1323 | #### Inflexible cluster shape 1324 | 1325 | Can consider k-means as placing a circle (or, in higher dimensions, a 1326 | hypersphere) at the centre of each cluster, with a radius defined by the 1327 | most distant point in the cluster. This radius acts as a hard cutoff for 1328 | cluster assignment within the training set: any point outside the circle 1329 | is not a member of the cluster. 1330 | 1331 | This circular structure means that k-means has no way of accounting or 1332 | oblong or elliptical clusters; circular cluster may be a poor fit, but 1333 | k-means will still try to fit the data to circular clusters. Can get a 1334 | mixing of cluster assignments where these circles overlap. 1335 | 1336 | ### Generalising E-M: Gaussian mixture models 1337 | 1338 | Might imagine addressing these two weaknesses by generalising the 1339 | k-means model, e.g. measure uncertainty in cluster assignment by 1340 | comparing the distances of each point to all cluster centres, not just 1341 | the closest. Might also consider making the cluster boundaries 1342 | elliptical to account for non-circular clusters. These are the essential 1343 | components of an alternative clustering model: Gaussian mixture models. 1344 | 1345 | A GMM attempts to find a mixture of multidimensional Gaussian 1346 | probability distributions that best model any input dataset. 1347 | 1348 | In the simplest case, GMMs can be used to find clusters, just as k-means 1349 | does. But, as it contains a probabilistic model, it can find 1350 | probabilistic cluster assignments (`predict_proba` method in 1351 | scikit-learn), which give the probability of a point belonging to the 1352 | given cluster. 1353 | 1354 | Like k-means, uses expectation-maximisation approach which does the 1355 | following: 1356 | 1357 | 1. Choose starting guesses for the location and shape. 1358 | 2. Repeat until converged: 1359 | A. E-step: for each point, find weights encoding the probability 1360 | of membership in each cluster. 1361 | B. M-step: for each cluster, update its location, normalisation 1362 | and shape based on all data points, making use of the 1363 | weights. 1364 | 1365 | The result is that each cluster is associated with a smooth Gaussian 1366 | model, not a hard-edged sphere. Just as in the k-means E-M approach, 1367 | GMMs can sometimes miss the globally optimal solution, and multiple 1368 | random initialisations are typically used. 1369 | 1370 | #### Choosing the covariance type 1371 | 1372 | Covariance type is a hyperparameter that controls the degrees of freedom 1373 | in the shape of each cluster. This is an important setting. The 1374 | scikit-learn default is `covariance_type='diag'`, which means that the 1375 | size of the cluster along each dimension can be set independently, with 1376 | the resulting ellipse constrained to align with the axes. 1377 | 1378 | A slightly simpler and faster model is `spherical`; this model 1379 | constrains the shape of the cluster such that all dimensions are equal. 1380 | The resulting clustering will have similar characteristics to k-means, 1381 | although it is not entirely equivalent. 1382 | 1383 | A more complicated and more expensive model (especially as the number of 1384 | dimensions increases) is `full` which allows each cluster to be modelled 1385 | as an ellipse with arbitrary orientation. 1386 | 1387 | ### GMM as density estimation 1388 | 1389 | GMM is often thought of as clustering algorithm, but is fundamentally an 1390 | algorithm for density estimation. The result of a GMM fit to data is 1391 | technically not a clustering model, but a generative probabilistic model 1392 | describing the data distribution. 1393 | 1394 | For data not in "nice" clusters, e.g. a distribution shaped in 1395 | two interleaved crescents, using two component GMM gives a poor fit: 1396 | where the distribution is clustered into two halves, ignoring the 1397 | crescents. 1398 | 1399 | Using many more components and ignoring cluster labels, however, get a 1400 | fit closer to the input data. Models the distribution of input data: a 1401 | generative model that we can use to generate new random data distributed 1402 | similarly to the input. 1403 | 1404 | #### How many components? 1405 | 1406 | GMM being a generative model gives a natural way of determining the 1407 | optimal number of components for a given dataset. A generative model is 1408 | inherently a probability distribution for the dataset, so we can 1409 | evaluate the *likelihood* of the data under the model, using 1410 | cross-validation to avoid overfitting (likelihood being roughly: a 1411 | measure of the extent that the data supports the model). 1412 | 1413 | Can also correct for overfitting by adjusting the model likelihoods 1414 | using some criterion like the Akaike information criterion (AIC) or the 1415 | Bayesian information criterion (BIC). scikit-learn's GMM model includes 1416 | methods to computer these. Optimal number of clusters minimises the AIC 1417 | or BIC. This choice of number of components measures GMM's performance 1418 | as a density estimator, not as a clustering algorithm. Better to think 1419 | of GMM as a density estimator, and use it for clustering only when 1420 | warranted with simple datasets. 1421 | 1422 | ### GMM for generating new data 1423 | 1424 | Can use GMM, e.g. with digits data, to synthesise new examples of 1425 | digits. Note that for high dimensional spaces, GMMs can have trouble 1426 | converging, so using PCA and preserving most of the variance (e.g. 99% 1427 | in the cited example) is a useful first processing step to reduce the 1428 | number of dimensions with minimal information loss. 1429 | 1430 | In scikit-learn, use `sample` method for new examples, and then do the 1431 | inverse transform of the PCA to get back data in the form that's useful 1432 | (i.e. to display them). 1433 | 1434 | ## Kernel density estimation (KDE) 1435 | 1436 | A density estimator is an algorithm which takes a D-dimensional dataset 1437 | and produces an estimate of the D-dimensional probability distribution 1438 | that data is taken from. 1439 | 1440 | The GMM algorithm does this by representing the density as a weighted 1441 | sum of Gaussian distributions. KDE goes further and uses a single 1442 | Gaussian component per point, making it essentially a non-parametric 1443 | estimator of density. 1444 | 1445 | ### Histograms 1446 | 1447 | A density estimator models the probability distribution that generated a 1448 | dataset. For 1D data, a histogram is a familiar simple density 1449 | estimator. A histogram divides data into discrete bins, counts the 1450 | points that fall in each bin and visualises the results. 1451 | 1452 | Can create a normalised histogram where the height of the bins reflects 1453 | probability density. However, an issue with using a histogram as density 1454 | estimator is that the choice of bin size and location can lead to 1455 | representations with different features: e.g. small shift in bin 1456 | locations can nudge values into neighbouring bins and modify the 1457 | distribution. 1458 | 1459 | Can also think of histogram as a stack of blocks, where one block is 1460 | stacked within each bin on top of each point in the dataset. Instead of 1461 | stacking the blocks aligned with the bins, can stack the blocks aligned 1462 | with the points they represent. The blocks from different points won't 1463 | be aligned, but can add their contributions at each location along the 1464 | x-axis to find the result. This gives a more robust reflection of the 1465 | data characteristics than the standard histogram, although gives rough 1466 | edges that don't reflect true properties of the data. Smoothing these 1467 | out by using a function like a Gaussian at each point gives a much more 1468 | accurate idea of the shape of the distribution with much less variance 1469 | (changes much less in response to sampling differences). 1470 | 1471 | ### Kernel density estimation in practice 1472 | 1473 | The free KDE parameters are the kernel, specifying the distribution 1474 | shape placed at each point, and the kernel bandwidth, which controls the 1475 | size of the kernel at each point. There are many kernels you might use 1476 | for a KDE. 1477 | 1478 | #### Selecting bandwidth via cross-validation 1479 | 1480 | Bandwidth of KDE is important to finding a suitable density estimate, 1481 | and controls the bias-variance trade-off: too narrow a bandwidth leads 1482 | to a high variance estimate (overfitting), where the presence or absence 1483 | of a single point makes a large difference; too wide a bandwidth leads 1484 | to a high bias estimate (underfitting) where the structure in the data 1485 | is washed out by the wide kernel. 1486 | 1487 | Can find this via cross-validation. 1488 | 1489 | ### KDE uses 1490 | 1491 | #### Visualisation 1492 | 1493 | Instead of plotting individual points representing individual 1494 | observations on a map, may plot kernel density instead to give a clearer 1495 | picture. 1496 | 1497 | #### Bayesian generative classification 1498 | 1499 | Can do Bayesian classification but removing the "naive" part, part by 1500 | using a more sophisticated generative model for each class. 1501 | 1502 | The approach is generally: 1503 | 1504 | 1. Split the training data by label. 1505 | 2. For each set, fit a KDE to give a model of the data. This allows you 1506 | for an observation x and label y to compute a likelihood P(x|y). 1507 | 3. From the examples of each class in the training set, compute the 1508 | class prior P(y). 1509 | 4. For an unknown point x, the posterior probability for each class is 1510 | P(y|x) ∝ P(x|y)P(y); the class maximising this posterior is the label 1511 | assigned to the point. 1512 | -------------------------------------------------------------------------------- /LICENSE-TEXT: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | Attribution-NonCommercial-NoDerivs 3.0 Unported 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR 10 | DAMAGES RESULTING FROM ITS USE. 11 | 12 | License 13 | 14 | THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE 15 | COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY 16 | COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS 17 | AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED. 18 | 19 | BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE 20 | TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY 21 | BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS 22 | CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND 23 | CONDITIONS. 24 | 25 | 1. Definitions 26 | 27 | a. "Adaptation" means a work based upon the Work, or upon the Work and 28 | other pre-existing works, such as a translation, adaptation, 29 | derivative work, arrangement of music or other alterations of a 30 | literary or artistic work, or phonogram or performance and includes 31 | cinematographic adaptations or any other form in which the Work may be 32 | recast, transformed, or adapted including in any form recognizably 33 | derived from the original, except that a work that constitutes a 34 | Collection will not be considered an Adaptation for the purpose of 35 | this License. For the avoidance of doubt, where the Work is a musical 36 | work, performance or phonogram, the synchronization of the Work in 37 | timed-relation with a moving image ("synching") will be considered an 38 | Adaptation for the purpose of this License. 39 | b. "Collection" means a collection of literary or artistic works, such as 40 | encyclopedias and anthologies, or performances, phonograms or 41 | broadcasts, or other works or subject matter other than works listed 42 | in Section 1(f) below, which, by reason of the selection and 43 | arrangement of their contents, constitute intellectual creations, in 44 | which the Work is included in its entirety in unmodified form along 45 | with one or more other contributions, each constituting separate and 46 | independent works in themselves, which together are assembled into a 47 | collective whole. A work that constitutes a Collection will not be 48 | considered an Adaptation (as defined above) for the purposes of this 49 | License. 50 | c. "Distribute" means to make available to the public the original and 51 | copies of the Work through sale or other transfer of ownership. 52 | d. "Licensor" means the individual, individuals, entity or entities that 53 | offer(s) the Work under the terms of this License. 54 | e. "Original Author" means, in the case of a literary or artistic work, 55 | the individual, individuals, entity or entities who created the Work 56 | or if no individual or entity can be identified, the publisher; and in 57 | addition (i) in the case of a performance the actors, singers, 58 | musicians, dancers, and other persons who act, sing, deliver, declaim, 59 | play in, interpret or otherwise perform literary or artistic works or 60 | expressions of folklore; (ii) in the case of a phonogram the producer 61 | being the person or legal entity who first fixes the sounds of a 62 | performance or other sounds; and, (iii) in the case of broadcasts, the 63 | organization that transmits the broadcast. 64 | f. "Work" means the literary and/or artistic work offered under the terms 65 | of this License including without limitation any production in the 66 | literary, scientific and artistic domain, whatever may be the mode or 67 | form of its expression including digital form, such as a book, 68 | pamphlet and other writing; a lecture, address, sermon or other work 69 | of the same nature; a dramatic or dramatico-musical work; a 70 | choreographic work or entertainment in dumb show; a musical 71 | composition with or without words; a cinematographic work to which are 72 | assimilated works expressed by a process analogous to cinematography; 73 | a work of drawing, painting, architecture, sculpture, engraving or 74 | lithography; a photographic work to which are assimilated works 75 | expressed by a process analogous to photography; a work of applied 76 | art; an illustration, map, plan, sketch or three-dimensional work 77 | relative to geography, topography, architecture or science; a 78 | performance; a broadcast; a phonogram; a compilation of data to the 79 | extent it is protected as a copyrightable work; or a work performed by 80 | a variety or circus performer to the extent it is not otherwise 81 | considered a literary or artistic work. 82 | g. "You" means an individual or entity exercising rights under this 83 | License who has not previously violated the terms of this License with 84 | respect to the Work, or who has received express permission from the 85 | Licensor to exercise rights under this License despite a previous 86 | violation. 87 | h. "Publicly Perform" means to perform public recitations of the Work and 88 | to communicate to the public those public recitations, by any means or 89 | process, including by wire or wireless means or public digital 90 | performances; to make available to the public Works in such a way that 91 | members of the public may access these Works from a place and at a 92 | place individually chosen by them; to perform the Work to the public 93 | by any means or process and the communication to the public of the 94 | performances of the Work, including by public digital performance; to 95 | broadcast and rebroadcast the Work by any means including signs, 96 | sounds or images. 97 | i. "Reproduce" means to make copies of the Work by any means including 98 | without limitation by sound or visual recordings and the right of 99 | fixation and reproducing fixations of the Work, including storage of a 100 | protected performance or phonogram in digital form or other electronic 101 | medium. 102 | 103 | 2. Fair Dealing Rights. Nothing in this License is intended to reduce, 104 | limit, or restrict any uses free from copyright or rights arising from 105 | limitations or exceptions that are provided for in connection with the 106 | copyright protection under copyright law or other applicable laws. 107 | 108 | 3. License Grant. Subject to the terms and conditions of this License, 109 | Licensor hereby grants You a worldwide, royalty-free, non-exclusive, 110 | perpetual (for the duration of the applicable copyright) license to 111 | exercise the rights in the Work as stated below: 112 | 113 | a. to Reproduce the Work, to incorporate the Work into one or more 114 | Collections, and to Reproduce the Work as incorporated in the 115 | Collections; and, 116 | b. to Distribute and Publicly Perform the Work including as incorporated 117 | in Collections. 118 | 119 | The above rights may be exercised in all media and formats whether now 120 | known or hereafter devised. The above rights include the right to make 121 | such modifications as are technically necessary to exercise the rights in 122 | other media and formats, but otherwise you have no rights to make 123 | Adaptations. Subject to 8(f), all rights not expressly granted by Licensor 124 | are hereby reserved, including but not limited to the rights set forth in 125 | Section 4(d). 126 | 127 | 4. Restrictions. The license granted in Section 3 above is expressly made 128 | subject to and limited by the following restrictions: 129 | 130 | a. You may Distribute or Publicly Perform the Work only under the terms 131 | of this License. You must include a copy of, or the Uniform Resource 132 | Identifier (URI) for, this License with every copy of the Work You 133 | Distribute or Publicly Perform. You may not offer or impose any terms 134 | on the Work that restrict the terms of this License or the ability of 135 | the recipient of the Work to exercise the rights granted to that 136 | recipient under the terms of the License. You may not sublicense the 137 | Work. You must keep intact all notices that refer to this License and 138 | to the disclaimer of warranties with every copy of the Work You 139 | Distribute or Publicly Perform. When You Distribute or Publicly 140 | Perform the Work, You may not impose any effective technological 141 | measures on the Work that restrict the ability of a recipient of the 142 | Work from You to exercise the rights granted to that recipient under 143 | the terms of the License. This Section 4(a) applies to the Work as 144 | incorporated in a Collection, but this does not require the Collection 145 | apart from the Work itself to be made subject to the terms of this 146 | License. If You create a Collection, upon notice from any Licensor You 147 | must, to the extent practicable, remove from the Collection any credit 148 | as required by Section 4(c), as requested. 149 | b. You may not exercise any of the rights granted to You in Section 3 150 | above in any manner that is primarily intended for or directed toward 151 | commercial advantage or private monetary compensation. The exchange of 152 | the Work for other copyrighted works by means of digital file-sharing 153 | or otherwise shall not be considered to be intended for or directed 154 | toward commercial advantage or private monetary compensation, provided 155 | there is no payment of any monetary compensation in connection with 156 | the exchange of copyrighted works. 157 | c. If You Distribute, or Publicly Perform the Work or Collections, You 158 | must, unless a request has been made pursuant to Section 4(a), keep 159 | intact all copyright notices for the Work and provide, reasonable to 160 | the medium or means You are utilizing: (i) the name of the Original 161 | Author (or pseudonym, if applicable) if supplied, and/or if the 162 | Original Author and/or Licensor designate another party or parties 163 | (e.g., a sponsor institute, publishing entity, journal) for 164 | attribution ("Attribution Parties") in Licensor's copyright notice, 165 | terms of service or by other reasonable means, the name of such party 166 | or parties; (ii) the title of the Work if supplied; (iii) to the 167 | extent reasonably practicable, the URI, if any, that Licensor 168 | specifies to be associated with the Work, unless such URI does not 169 | refer to the copyright notice or licensing information for the Work. 170 | The credit required by this Section 4(c) may be implemented in any 171 | reasonable manner; provided, however, that in the case of a 172 | Collection, at a minimum such credit will appear, if a credit for all 173 | contributing authors of Collection appears, then as part of these 174 | credits and in a manner at least as prominent as the credits for the 175 | other contributing authors. For the avoidance of doubt, You may only 176 | use the credit required by this Section for the purpose of attribution 177 | in the manner set out above and, by exercising Your rights under this 178 | License, You may not implicitly or explicitly assert or imply any 179 | connection with, sponsorship or endorsement by the Original Author, 180 | Licensor and/or Attribution Parties, as appropriate, of You or Your 181 | use of the Work, without the separate, express prior written 182 | permission of the Original Author, Licensor and/or Attribution 183 | Parties. 184 | d. For the avoidance of doubt: 185 | 186 | i. Non-waivable Compulsory License Schemes. In those jurisdictions in 187 | which the right to collect royalties through any statutory or 188 | compulsory licensing scheme cannot be waived, the Licensor 189 | reserves the exclusive right to collect such royalties for any 190 | exercise by You of the rights granted under this License; 191 | ii. Waivable Compulsory License Schemes. In those jurisdictions in 192 | which the right to collect royalties through any statutory or 193 | compulsory licensing scheme can be waived, the Licensor reserves 194 | the exclusive right to collect such royalties for any exercise by 195 | You of the rights granted under this License if Your exercise of 196 | such rights is for a purpose or use which is otherwise than 197 | noncommercial as permitted under Section 4(b) and otherwise waives 198 | the right to collect royalties through any statutory or compulsory 199 | licensing scheme; and, 200 | iii. Voluntary License Schemes. The Licensor reserves the right to 201 | collect royalties, whether individually or, in the event that the 202 | Licensor is a member of a collecting society that administers 203 | voluntary licensing schemes, via that society, from any exercise 204 | by You of the rights granted under this License that is for a 205 | purpose or use which is otherwise than noncommercial as permitted 206 | under Section 4(b). 207 | e. Except as otherwise agreed in writing by the Licensor or as may be 208 | otherwise permitted by applicable law, if You Reproduce, Distribute or 209 | Publicly Perform the Work either by itself or as part of any 210 | Collections, You must not distort, mutilate, modify or take other 211 | derogatory action in relation to the Work which would be prejudicial 212 | to the Original Author's honor or reputation. 213 | 214 | 5. Representations, Warranties and Disclaimer 215 | 216 | UNLESS OTHERWISE MUTUALLY AGREED BY THE PARTIES IN WRITING, LICENSOR 217 | OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY 218 | KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, 219 | INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY, 220 | FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF 221 | LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, 222 | WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION 223 | OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU. 224 | 225 | 6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE 226 | LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR 227 | ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES 228 | ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS 229 | BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 230 | 231 | 7. Termination 232 | 233 | a. This License and the rights granted hereunder will terminate 234 | automatically upon any breach by You of the terms of this License. 235 | Individuals or entities who have received Collections from You under 236 | this License, however, will not have their licenses terminated 237 | provided such individuals or entities remain in full compliance with 238 | those licenses. Sections 1, 2, 5, 6, 7, and 8 will survive any 239 | termination of this License. 240 | b. Subject to the above terms and conditions, the license granted here is 241 | perpetual (for the duration of the applicable copyright in the Work). 242 | Notwithstanding the above, Licensor reserves the right to release the 243 | Work under different license terms or to stop distributing the Work at 244 | any time; provided, however that any such election will not serve to 245 | withdraw this License (or any other license that has been, or is 246 | required to be, granted under the terms of this License), and this 247 | License will continue in full force and effect unless terminated as 248 | stated above. 249 | 250 | 8. Miscellaneous 251 | 252 | a. Each time You Distribute or Publicly Perform the Work or a Collection, 253 | the Licensor offers to the recipient a license to the Work on the same 254 | terms and conditions as the license granted to You under this License. 255 | b. If any provision of this License is invalid or unenforceable under 256 | applicable law, it shall not affect the validity or enforceability of 257 | the remainder of the terms of this License, and without further action 258 | by the parties to this agreement, such provision shall be reformed to 259 | the minimum extent necessary to make such provision valid and 260 | enforceable. 261 | c. No term or provision of this License shall be deemed waived and no 262 | breach consented to unless such waiver or consent shall be in writing 263 | and signed by the party to be charged with such waiver or consent. 264 | d. This License constitutes the entire agreement between the parties with 265 | respect to the Work licensed here. There are no understandings, 266 | agreements or representations with respect to the Work not specified 267 | here. Licensor shall not be bound by any additional provisions that 268 | may appear in any communication from You. This License may not be 269 | modified without the mutual written agreement of the Licensor and You. 270 | e. The rights granted under, and the subject matter referenced, in this 271 | License were drafted utilizing the terminology of the Berne Convention 272 | for the Protection of Literary and Artistic Works (as amended on 273 | September 28, 1979), the Rome Convention of 1961, the WIPO Copyright 274 | Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 275 | and the Universal Copyright Convention (as revised on July 24, 1971). 276 | These rights and subject matter take effect in the relevant 277 | jurisdiction in which the License terms are sought to be enforced 278 | according to the corresponding provisions of the implementation of 279 | those treaty provisions in the applicable national law. If the 280 | standard suite of rights granted under applicable copyright law 281 | includes additional rights not granted under this License, such 282 | additional rights are deemed to be included in the License; this 283 | License is not intended to restrict the license of any rights under 284 | applicable law. 285 | 286 | 287 | Creative Commons Notice 288 | 289 | Creative Commons is not a party to this License, and makes no warranty 290 | whatsoever in connection with the Work. Creative Commons will not be 291 | liable to You or any party on any legal theory for any damages 292 | whatsoever, including without limitation any general, special, 293 | incidental or consequential damages arising in connection to this 294 | license. Notwithstanding the foregoing two (2) sentences, if Creative 295 | Commons has expressly identified itself as the Licensor hereunder, it 296 | shall have all rights and obligations of Licensor. 297 | 298 | Except for the limited purpose of indicating to the public that the 299 | Work is licensed under the CCPL, Creative Commons does not authorize 300 | the use by either party of the trademark "Creative Commons" or any 301 | related trademark or logo of Creative Commons without the prior 302 | written consent of Creative Commons. Any permitted use will be in 303 | compliance with Creative Commons' then-current trademark usage 304 | guidelines, as may be published on its website or otherwise made 305 | available upon request from time to time. For the avoidance of doubt, 306 | this trademark restriction does not form part of this License. 307 | 308 | Creative Commons may be contacted at https://creativecommons.org/. 309 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Python Data Science Handbook notes 2 | 3 | ## Details 4 | 5 | The book by Jake VanderPlas is freely available at 6 | [GitHub](https://github.com/jakevdp/PythonDataScienceHandbook). 7 | 8 | These are my notes made while reading, either abbreviated or with parts 9 | taken verbatim. 10 | 11 | Therefore, these notes are made available under the same 12 | [CC-BY-NC-ND 3.0 license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode). 13 | 14 | ## The chapter notes 15 | 16 | [Chapter 1: IPython](Chapter_1_IPython.md) 17 | [Chapter 2: NumPy](Chapter_2_NumPy.md) 18 | [Chapter 3: pandas](Chapter_3_pandas.md) 19 | [Chapter 4: Matplotlib](Chapter_4_Matplotlib.md) 20 | [Chapter 5: Machine Learning](Chapter_5_Machine_Learning.md) 21 | --------------------------------------------------------------------------------