├── data ├── sample.hdf5 └── iris.csv ├── images ├── star.png ├── blaze_med.png ├── conversions.png └── star.dot ├── README.md ├── 06-Blaze-with-into.ipynb ├── 03-into-Design.ipynb ├── 05-Blaze-with-SQL.ipynb ├── 04-Blaze-Introduction.ipynb ├── 01-into-Introduction.ipynb ├── 00-Motivation.ipynb └── 02-into-Datatypes.ipynb /data/sample.hdf5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/data/sample.hdf5 -------------------------------------------------------------------------------- /images/star.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/star.png -------------------------------------------------------------------------------- /images/blaze_med.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/blaze_med.png -------------------------------------------------------------------------------- /images/conversions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/conversions.png -------------------------------------------------------------------------------- /images/star.dot: -------------------------------------------------------------------------------- 1 | graph { 2 | overlap=False; 3 | splines=True; 4 | A -- Central 5 | B -- Central 6 | C -- Central 7 | D -- Central 8 | E -- Central 9 | F -- Central 10 | G -- Central 11 | H -- Central 12 | I -- Central 13 | J -- Central 14 | K -- Central 15 | } 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Blaze Tutorial 2 | ============== 3 | 4 | This repository contains notebooks and data for a tutorial for 5 | [Blaze](http://blaze.pydata.org) a library to compute on foreign data from 6 | within Python. 7 | 8 | This is a work in progress. 9 | 10 | Outline 11 | ------- 12 | 13 | 0. [Motivation](00-Motivation.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/00-Motivation.ipynb)) 14 | 15 | ### Into 16 | 17 | We present most Blaze fundamentals while discussing the simpler topic of data 18 | migration using the `into` project. 19 | 20 | 1. [Basics](01-into-Introduction.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/01-into-Introduction.ipynb)) 21 | 2. [Datatypes](02-into-Datatypes.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/02-into-Datatypes.ipynb)) 22 | 3. [Internal Design](03-into-Design.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/03-into-Design.ipynb)) 23 | 24 | 25 | ### Blaze Queries 26 | 27 | 1. [Basics](04-Blaze-Introduction.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/04-Blaze-Introduction.ipynb)) 28 | 2. [Databases](05-Blaze-with-SQL.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/05-Blaze-with-SQL.ipynb)) 29 | 3. [Storing Results](06-Blaze-with-into.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/06-Blaze-with-into.ipynb)) 30 | 31 | 32 | Downloads 33 | --------- 34 | 35 | This tutorial depends on the 36 | [Lahman Baseball statistics database](https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite) 37 | -------------------------------------------------------------------------------- /06-Blaze-with-into.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:96746f6bc03e6fec9c102fd49c4dffc99ad3edefff7266618321c39d3d020abb" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting Started with Blaze\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 3. Storing results with `into`\n", 30 | "\n", 31 | "We just played with some interesting queries on baseball statistics" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "collapsed": false, 37 | "input": [ 38 | "from blaze import Data, into, by, join\n", 39 | "db = Data('sqlite:///data/lahman2013.sqlite')\n", 40 | "joined = join(db.Salaries, db.Teams)\n", 41 | "result = by(joined[['name', 'yearID']], avg=joined.salary.mean())\n", 42 | "result" 43 | ], 44 | "language": "python", 45 | "metadata": {}, 46 | "outputs": [] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "How do we now store this result or use it with other libraries?\n", 53 | "\n", 54 | "The result itself is a Blaze expression, not terribly useful if we're not using Blaze." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "collapsed": false, 60 | "input": [ 61 | "type(result)" 62 | ], 63 | "language": "python", 64 | "metadata": {}, 65 | "outputs": [] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### Use normal Python collections like `list` or `np.array`\n", 72 | "\n", 73 | "Blaze follows normal conventions and so can be converted by standard constructors" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "collapsed": false, 79 | "input": [ 80 | "list(result)" 81 | ], 82 | "language": "python", 83 | "metadata": {}, 84 | "outputs": [] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "collapsed": false, 89 | "input": [ 90 | "import numpy as np\n", 91 | "np.array(result)" 92 | ], 93 | "language": "python", 94 | "metadata": {}, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Use `into`\n", 102 | "\n", 103 | "Alternatively, Blaze has registered itself into the `into` project and so can migrate its results to any of those formats." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "collapsed": false, 109 | "input": [ 110 | "into('salaries.csv', result)" 111 | ], 112 | "language": "python", 113 | "metadata": {}, 114 | "outputs": [] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "collapsed": false, 119 | "input": [ 120 | "!head salaries.csv" 121 | ], 122 | "language": "python", 123 | "metadata": {}, 124 | "outputs": [] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "#### Exercise\n", 131 | "\n", 132 | "Dump `results` into the following formats" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "collapsed": false, 138 | "input": [ 139 | "# Dump results into a Python set\n" 140 | ], 141 | "language": "python", 142 | "metadata": {}, 143 | "outputs": [] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "collapsed": false, 148 | "input": [ 149 | "# Dump results into a Pandas DataFrame\n" 150 | ], 151 | "language": "python", 152 | "metadata": {}, 153 | "outputs": [] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "collapsed": false, 158 | "input": [ 159 | "# Dump results into a JSON file, inspect the file to make sure that it came out ok\n" 160 | ], 161 | "language": "python", 162 | "metadata": {}, 163 | "outputs": [] 164 | } 165 | ], 166 | "metadata": {} 167 | } 168 | ] 169 | } -------------------------------------------------------------------------------- /data/iris.csv: -------------------------------------------------------------------------------- 1 | sepal_length,sepal_width,petal_length,petal_width,species 2 | 5.1,3.5,1.4,0.2,Iris-setosa 3 | 4.9,3.0,1.4,0.2,Iris-setosa 4 | 4.7,3.2,1.3,0.2,Iris-setosa 5 | 4.6,3.1,1.5,0.2,Iris-setosa 6 | 5.0,3.6,1.4,0.2,Iris-setosa 7 | 5.4,3.9,1.7,0.4,Iris-setosa 8 | 4.6,3.4,1.4,0.3,Iris-setosa 9 | 5.0,3.4,1.5,0.2,Iris-setosa 10 | 4.4,2.9,1.4,0.2,Iris-setosa 11 | 4.9,3.1,1.5,0.1,Iris-setosa 12 | 5.4,3.7,1.5,0.2,Iris-setosa 13 | 4.8,3.4,1.6,0.2,Iris-setosa 14 | 4.8,3.0,1.4,0.1,Iris-setosa 15 | 4.3,3.0,1.1,0.1,Iris-setosa 16 | 5.8,4.0,1.2,0.2,Iris-setosa 17 | 5.7,4.4,1.5,0.4,Iris-setosa 18 | 5.4,3.9,1.3,0.4,Iris-setosa 19 | 5.1,3.5,1.4,0.3,Iris-setosa 20 | 5.7,3.8,1.7,0.3,Iris-setosa 21 | 5.1,3.8,1.5,0.3,Iris-setosa 22 | 5.4,3.4,1.7,0.2,Iris-setosa 23 | 5.1,3.7,1.5,0.4,Iris-setosa 24 | 4.6,3.6,1.0,0.2,Iris-setosa 25 | 5.1,3.3,1.7,0.5,Iris-setosa 26 | 4.8,3.4,1.9,0.2,Iris-setosa 27 | 5.0,3.0,1.6,0.2,Iris-setosa 28 | 5.0,3.4,1.6,0.4,Iris-setosa 29 | 5.2,3.5,1.5,0.2,Iris-setosa 30 | 5.2,3.4,1.4,0.2,Iris-setosa 31 | 4.7,3.2,1.6,0.2,Iris-setosa 32 | 4.8,3.1,1.6,0.2,Iris-setosa 33 | 5.4,3.4,1.5,0.4,Iris-setosa 34 | 5.2,4.1,1.5,0.1,Iris-setosa 35 | 5.5,4.2,1.4,0.2,Iris-setosa 36 | 4.9,3.1,1.5,0.2,Iris-setosa 37 | 5.0,3.2,1.2,0.2,Iris-setosa 38 | 5.5,3.5,1.3,0.2,Iris-setosa 39 | 4.9,3.6,1.4,0.1,Iris-setosa 40 | 4.4,3.0,1.3,0.2,Iris-setosa 41 | 5.1,3.4,1.5,0.2,Iris-setosa 42 | 5.0,3.5,1.3,0.3,Iris-setosa 43 | 4.5,2.3,1.3,0.3,Iris-setosa 44 | 4.4,3.2,1.3,0.2,Iris-setosa 45 | 5.0,3.5,1.6,0.6,Iris-setosa 46 | 5.1,3.8,1.9,0.4,Iris-setosa 47 | 4.8,3.0,1.4,0.3,Iris-setosa 48 | 5.1,3.8,1.6,0.2,Iris-setosa 49 | 4.6,3.2,1.4,0.2,Iris-setosa 50 | 5.3,3.7,1.5,0.2,Iris-setosa 51 | 5.0,3.3,1.4,0.2,Iris-setosa 52 | 7.0,3.2,4.7,1.4,Iris-versicolor 53 | 6.4,3.2,4.5,1.5,Iris-versicolor 54 | 6.9,3.1,4.9,1.5,Iris-versicolor 55 | 5.5,2.3,4.0,1.3,Iris-versicolor 56 | 6.5,2.8,4.6,1.5,Iris-versicolor 57 | 5.7,2.8,4.5,1.3,Iris-versicolor 58 | 6.3,3.3,4.7,1.6,Iris-versicolor 59 | 4.9,2.4,3.3,1.0,Iris-versicolor 60 | 6.6,2.9,4.6,1.3,Iris-versicolor 61 | 5.2,2.7,3.9,1.4,Iris-versicolor 62 | 5.0,2.0,3.5,1.0,Iris-versicolor 63 | 5.9,3.0,4.2,1.5,Iris-versicolor 64 | 6.0,2.2,4.0,1.0,Iris-versicolor 65 | 6.1,2.9,4.7,1.4,Iris-versicolor 66 | 5.6,2.9,3.6,1.3,Iris-versicolor 67 | 6.7,3.1,4.4,1.4,Iris-versicolor 68 | 5.6,3.0,4.5,1.5,Iris-versicolor 69 | 5.8,2.7,4.1,1.0,Iris-versicolor 70 | 6.2,2.2,4.5,1.5,Iris-versicolor 71 | 5.6,2.5,3.9,1.1,Iris-versicolor 72 | 5.9,3.2,4.8,1.8,Iris-versicolor 73 | 6.1,2.8,4.0,1.3,Iris-versicolor 74 | 6.3,2.5,4.9,1.5,Iris-versicolor 75 | 6.1,2.8,4.7,1.2,Iris-versicolor 76 | 6.4,2.9,4.3,1.3,Iris-versicolor 77 | 6.6,3.0,4.4,1.4,Iris-versicolor 78 | 6.8,2.8,4.8,1.4,Iris-versicolor 79 | 6.7,3.0,5.0,1.7,Iris-versicolor 80 | 6.0,2.9,4.5,1.5,Iris-versicolor 81 | 5.7,2.6,3.5,1.0,Iris-versicolor 82 | 5.5,2.4,3.8,1.1,Iris-versicolor 83 | 5.5,2.4,3.7,1.0,Iris-versicolor 84 | 5.8,2.7,3.9,1.2,Iris-versicolor 85 | 6.0,2.7,5.1,1.6,Iris-versicolor 86 | 5.4,3.0,4.5,1.5,Iris-versicolor 87 | 6.0,3.4,4.5,1.6,Iris-versicolor 88 | 6.7,3.1,4.7,1.5,Iris-versicolor 89 | 6.3,2.3,4.4,1.3,Iris-versicolor 90 | 5.6,3.0,4.1,1.3,Iris-versicolor 91 | 5.5,2.5,4.0,1.3,Iris-versicolor 92 | 5.5,2.6,4.4,1.2,Iris-versicolor 93 | 6.1,3.0,4.6,1.4,Iris-versicolor 94 | 5.8,2.6,4.0,1.2,Iris-versicolor 95 | 5.0,2.3,3.3,1.0,Iris-versicolor 96 | 5.6,2.7,4.2,1.3,Iris-versicolor 97 | 5.7,3.0,4.2,1.2,Iris-versicolor 98 | 5.7,2.9,4.2,1.3,Iris-versicolor 99 | 6.2,2.9,4.3,1.3,Iris-versicolor 100 | 5.1,2.5,3.0,1.1,Iris-versicolor 101 | 5.7,2.8,4.1,1.3,Iris-versicolor 102 | 6.3,3.3,6.0,2.5,Iris-virginica 103 | 5.8,2.7,5.1,1.9,Iris-virginica 104 | 7.1,3.0,5.9,2.1,Iris-virginica 105 | 6.3,2.9,5.6,1.8,Iris-virginica 106 | 6.5,3.0,5.8,2.2,Iris-virginica 107 | 7.6,3.0,6.6,2.1,Iris-virginica 108 | 4.9,2.5,4.5,1.7,Iris-virginica 109 | 7.3,2.9,6.3,1.8,Iris-virginica 110 | 6.7,2.5,5.8,1.8,Iris-virginica 111 | 7.2,3.6,6.1,2.5,Iris-virginica 112 | 6.5,3.2,5.1,2.0,Iris-virginica 113 | 6.4,2.7,5.3,1.9,Iris-virginica 114 | 6.8,3.0,5.5,2.1,Iris-virginica 115 | 5.7,2.5,5.0,2.0,Iris-virginica 116 | 5.8,2.8,5.1,2.4,Iris-virginica 117 | 6.4,3.2,5.3,2.3,Iris-virginica 118 | 6.5,3.0,5.5,1.8,Iris-virginica 119 | 7.7,3.8,6.7,2.2,Iris-virginica 120 | 7.7,2.6,6.9,2.3,Iris-virginica 121 | 6.0,2.2,5.0,1.5,Iris-virginica 122 | 6.9,3.2,5.7,2.3,Iris-virginica 123 | 5.6,2.8,4.9,2.0,Iris-virginica 124 | 7.7,2.8,6.7,2.0,Iris-virginica 125 | 6.3,2.7,4.9,1.8,Iris-virginica 126 | 6.7,3.3,5.7,2.1,Iris-virginica 127 | 7.2,3.2,6.0,1.8,Iris-virginica 128 | 6.2,2.8,4.8,1.8,Iris-virginica 129 | 6.1,3.0,4.9,1.8,Iris-virginica 130 | 6.4,2.8,5.6,2.1,Iris-virginica 131 | 7.2,3.0,5.8,1.6,Iris-virginica 132 | 7.4,2.8,6.1,1.9,Iris-virginica 133 | 7.9,3.8,6.4,2.0,Iris-virginica 134 | 6.4,2.8,5.6,2.2,Iris-virginica 135 | 6.3,2.8,5.1,1.5,Iris-virginica 136 | 6.1,2.6,5.6,1.4,Iris-virginica 137 | 7.7,3.0,6.1,2.3,Iris-virginica 138 | 6.3,3.4,5.6,2.4,Iris-virginica 139 | 6.4,3.1,5.5,1.8,Iris-virginica 140 | 6.0,3.0,4.8,1.8,Iris-virginica 141 | 6.9,3.1,5.4,2.1,Iris-virginica 142 | 6.7,3.1,5.6,2.4,Iris-virginica 143 | 6.9,3.1,5.1,2.3,Iris-virginica 144 | 5.8,2.7,5.1,1.9,Iris-virginica 145 | 6.8,3.2,5.9,2.3,Iris-virginica 146 | 6.7,3.3,5.7,2.5,Iris-virginica 147 | 6.7,3.0,5.2,2.3,Iris-virginica 148 | 6.3,2.5,5.0,1.9,Iris-virginica 149 | 6.5,3.0,5.2,2.0,Iris-virginica 150 | 6.2,3.4,5.4,2.3,Iris-virginica 151 | 5.9,3.0,5.1,1.8,Iris-virginica 152 | -------------------------------------------------------------------------------- /03-into-Design.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:ecd1889f30dd086fa191899741e3d2e3f75ef8a3fb15f463c88c0def7e579e3f" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting started with `into`\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 3 Design\n", 30 | "\n", 31 | "Into is a network of pair-wise conversions\n", 32 | "\n", 33 | "\"into\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "Nodes represent data formats. Arrows represent functions that migrate data from one format to another. Red nodes are possibly larger than memory.\n", 41 | "\n", 42 | "This differs from most data migration systems, which always migrate data through a common format.\n", 43 | "\n", 44 | "![](images/star.png)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "This common strategy is simpler in design and easier to get right, so why does `into` use a more complex design?" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "### Some edges are very fast\n", 59 | "\n", 60 | "For example transforming an `np.ndarray` to a `pd.Series` or a `list` to `Iterator` only requires us to manipulate metadata which can be done quickly; the bytes of data remain untouched. In many cases transfering data through a common format can be much slower.\n", 61 | "\n", 62 | "For example consider `CSV -> SQL` migration. Using Python iterators as a common central format we're bound by SQLAlchemy's insertion code which maxes out at around 2000 records per second. Using CSV loaders native to the database (e.g. PostgreSQL Copy) we achieve more than 50000 records per second, turning an overnight task into 20 minutes.\n", 63 | "\n", 64 | "Efficient data migration *is intrinsically messy* in practice. The into graph reflects this mess." 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### How does `into` use this graph?\n", 72 | "\n", 73 | "When you convert one dataset into another, Into walks the graph above and finds the minimum path. Each edge corresponds to a single function and that edge is weighted according to relative cost. E.g. if we transform a CSV file into a set we can see the stages through which our data moves." 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "collapsed": false, 79 | "input": [ 80 | "from into import into, convert, append, CSV\n", 81 | "path = convert.path(CSV, set)\n", 82 | "\n", 83 | "for source, target, func in path:\n", 84 | " print '%25s -> %-25s via %s()' % (source.__name__, target.__name__, func.__name__)" 85 | ], 86 | "language": "python", 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "output_type": "stream", 91 | "stream": "stdout", 92 | "text": [ 93 | " CSV -> Chunks_pandas_DataFrame via CSV_to_chunks_of_dataframes()\n", 94 | " Chunks_pandas_DataFrame -> Iterator via numpy_chunks_to_iterator()\n", 95 | " Iterator -> list via iterator_to_list()\n", 96 | " list -> set via iterable_to_set()\n" 97 | ] 98 | } 99 | ], 100 | "prompt_number": 1 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "### Chunks of in-memory data\n", 107 | "\n", 108 | "The red nodes in our graph represent data that might be larger than memory. We we migrate between two red nodes we restrict ourselves to the subgraph of only red nodes so that we never blow up.\n", 109 | "\n", 110 | "Yet we still want to use NumPy and Pandas in these migrations (they're very helpful intermediaries) so we partition our data into a sequence of chunks such that each chunks fit in memory. We describe this data with parametrized types like `chunks(np.ndarray)` or `chunks(pd.DataFrame)`." 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "### `into` = `convert` + `append`\n", 118 | "\n", 119 | "Recall the two modes of into" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "collapsed": false, 125 | "input": [ 126 | "# Given type: Convert source to new object\n", 127 | "into(list, (1, 2, 3))" 128 | ], 129 | "language": "python", 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "metadata": {}, 134 | "output_type": "pyout", 135 | "prompt_number": 2, 136 | "text": [ 137 | "[1, 2, 3]" 138 | ] 139 | } 140 | ], 141 | "prompt_number": 2 142 | }, 143 | { 144 | "cell_type": "code", 145 | "collapsed": false, 146 | "input": [ 147 | "# Given object: Append source to that object\n", 148 | "L = [1, 2, 3]\n", 149 | "into(L, (4, 5, 6))\n", 150 | "L" 151 | ], 152 | "language": "python", 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "metadata": {}, 157 | "output_type": "pyout", 158 | "prompt_number": 3, 159 | "text": [ 160 | "[1, 2, 3, 4, 5, 6]" 161 | ] 162 | } 163 | ], 164 | "prompt_number": 3 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "These two modes are actually separate functions, `convert` and `append`. Into uses both depending on the situation. A single `into` call may engage both functions.\n", 171 | "\n", 172 | "You should use `into` by default, it does some other checks. For the sake of being explicit however, here are examples using `convert` and `append` directly." 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "collapsed": false, 178 | "input": [ 179 | "from into import convert, append\n", 180 | "\n", 181 | "convert(list, (1, 2, 3))" 182 | ], 183 | "language": "python", 184 | "metadata": {}, 185 | "outputs": [ 186 | { 187 | "metadata": {}, 188 | "output_type": "pyout", 189 | "prompt_number": 1, 190 | "text": [ 191 | "[1, 2, 3]" 192 | ] 193 | } 194 | ], 195 | "prompt_number": 1 196 | }, 197 | { 198 | "cell_type": "code", 199 | "collapsed": false, 200 | "input": [ 201 | "L = [1, 2, 3]\n", 202 | "append(L, (4, 5, 6))" 203 | ], 204 | "language": "python", 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "metadata": {}, 209 | "output_type": "pyout", 210 | "prompt_number": 3, 211 | "text": [ 212 | "[1, 2, 3, 4, 5, 6]" 213 | ] 214 | } 215 | ], 216 | "prompt_number": 3 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "### How to extend `into`\n", 223 | "\n", 224 | "When we encounter new data formats we may wish to connect them to the `into` graph. We do this by implementing new versions of `discover`, `convert`, and `append` (if we support appending).\n", 225 | "\n", 226 | "We register new implementations of an operation like convert by creating a new Python function and decorating it with types and a cost.\n", 227 | "\n", 228 | "#### Example\n", 229 | "\n", 230 | "Here we define how to convert from a DataFrame to a NumPy array" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "collapsed": false, 236 | "input": [ 237 | "import numpy as np\n", 238 | "import pandas as pd\n", 239 | "\n", 240 | "# target type, source data, cost\n", 241 | "@convert.register(np.ndarray, pd.DataFrame, cost=1.0)\n", 242 | "def dataframe_to_numpy(df, **kwargs):\n", 243 | " return df.to_records(index=False)" 244 | ], 245 | "language": "python", 246 | "metadata": {}, 247 | "outputs": [], 248 | "prompt_number": 4 249 | } 250 | ], 251 | "metadata": {} 252 | } 253 | ] 254 | } -------------------------------------------------------------------------------- /05-Blaze-with-SQL.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:d66bda4ad596e7d1038472946b691f727760aa36c6a5bcb7a9e63f8067624882" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting Started with Blaze\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 2. Using Blaze with Foreign Systems\n", 30 | "\n", 31 | "Blaze gives us an interactive experience over data living in foreign systems, like a SQL database." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "collapsed": false, 37 | "input": [ 38 | "from blaze import Data, join, by, transform\n", 39 | "\n", 40 | "db = Data('sqlite:///data/lahman2013.sqlite')\n", 41 | "db.Teams" 42 | ], 43 | "language": "python", 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "metadata": {}, 48 | "output_type": "pyout", 49 | "prompt_number": 2, 50 | "text": [ 51 | " yearID lgID teamID franchID divID Rank G Ghome W L ... \\\n", 52 | "0 1871 NA BS1 BNA None 3 31 NaN 20 10 ... \n", 53 | "1 1871 NA CH1 CNA None 2 28 NaN 19 9 ... \n", 54 | "2 1871 NA CL1 CFC None 8 29 NaN 10 19 ... \n", 55 | "3 1871 NA FW1 KEK None 7 19 NaN 7 12 ... \n", 56 | "4 1871 NA NY2 NNA None 5 33 NaN 16 17 ... \n", 57 | "5 1871 NA PH1 PNA None 1 28 NaN 21 7 ... \n", 58 | "6 1871 NA RC1 ROK None 9 25 NaN 4 21 ... \n", 59 | "7 1871 NA TRO TRO None 6 29 NaN 13 15 ... \n", 60 | "8 1871 NA WS3 OLY None 4 32 NaN 15 15 ... \n", 61 | "9 1872 NA BL1 BLC None 2 58 NaN 35 19 ... \n", 62 | "10 1872 NA BR1 ECK None 9 29 NaN 3 26 ... \n", 63 | "\n", 64 | " DP FP name park \\\n", 65 | "0 NaN 0.83 Boston Red Stockings South End Grounds I \n", 66 | "1 NaN 0.82 Chicago White Stockings Union Base-Ball Grounds \n", 67 | "2 NaN 0.81 Cleveland Forest Citys National Association Grounds \n", 68 | "3 NaN 0.80 Fort Wayne Kekiongas Hamilton Field \n", 69 | "4 NaN 0.83 New York Mutuals Union Grounds (Brooklyn) \n", 70 | "5 NaN 0.84 Philadelphia Athletics Jefferson Street Grounds \n", 71 | "6 NaN 0.82 Rockford Forest Citys Agricultural Society Fair Grounds \n", 72 | "7 NaN 0.84 Troy Haymakers Haymakers' Grounds \n", 73 | "8 NaN 0.85 Washington Olympics Olympics Grounds \n", 74 | "9 NaN 0.82 Baltimore Canaries Newington Park \n", 75 | "10 NaN 0.79 Brooklyn Eckfords Union Grounds \n", 76 | "\n", 77 | " attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro \n", 78 | "0 NaN 103 98 BOS BS1 BS1 \n", 79 | "1 NaN 104 102 CHI CH1 CH1 \n", 80 | "2 NaN 96 100 CLE CL1 CL1 \n", 81 | "3 NaN 101 107 KEK FW1 FW1 \n", 82 | "4 NaN 90 88 NYU NY2 NY2 \n", 83 | "5 NaN 102 98 ATH PH1 PH1 \n", 84 | "6 NaN 97 99 ROK RC1 RC1 \n", 85 | "7 NaN 101 100 TRO TRO TRO \n", 86 | "8 NaN 94 98 OLY WS3 WS3 \n", 87 | "9 NaN 106 102 BAL BL1 BL1 \n", 88 | "10 NaN 87 96 ECK BR1 BR1 \n", 89 | "\n", 90 | "..." 91 | ] 92 | } 93 | ], 94 | "prompt_number": 2 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "We can use all of our standard queries" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "collapsed": false, 106 | "input": [ 107 | "db.Teams[db.Teams.name == 'Chicago White Stockings'][['name', 'yearID', 'park']]" 108 | ], 109 | "language": "python", 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "metadata": {}, 114 | "output_type": "pyout", 115 | "prompt_number": 5, 116 | "text": [ 117 | " name yearID park\n", 118 | "0 Chicago White Stockings 1871 Union Base-Ball Grounds\n", 119 | "1 Chicago White Stockings 1874 23rd Street Grounds\n", 120 | "2 Chicago White Stockings 1875 23rd Street Grounds\n", 121 | "3 Chicago White Stockings 1876 23rd Street Grounds\n", 122 | "4 Chicago White Stockings 1877 23rd Street Grounds\n", 123 | "5 Chicago White Stockings 1878 Lake Front Park I\n", 124 | "6 Chicago White Stockings 1879 Lake Front Park I\n", 125 | "7 Chicago White Stockings 1880 Lake Front Park I\n", 126 | "8 Chicago White Stockings 1881 Lake Front Park I\n", 127 | "9 Chicago White Stockings 1882 Lake Front Park I/Lake Front Park II\n", 128 | "..." 129 | ] 130 | } 131 | ], 132 | "prompt_number": 5 133 | }, 134 | { 135 | "cell_type": "code", 136 | "collapsed": false, 137 | "input": [ 138 | "by(db.Teams.name, start_year=db.Teams.yearID.min(), \n", 139 | " end_year=db.Teams.yearID.max()).sort('end_year', ascending=False)" 140 | ], 141 | "language": "python", 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "metadata": {}, 146 | "output_type": "pyout", 147 | "prompt_number": 6, 148 | "text": [ 149 | " name end_year start_year\n", 150 | "0 Arizona Diamondbacks 2013 1998\n", 151 | "1 Atlanta Braves 2013 1966\n", 152 | "2 Baltimore Orioles 2013 1882\n", 153 | "3 Boston Red Sox 2013 1908\n", 154 | "4 Chicago Cubs 2013 1903\n", 155 | "5 Chicago White Sox 2013 1901\n", 156 | "6 Cincinnati Reds 2013 1876\n", 157 | "7 Cleveland Indians 2013 1915\n", 158 | "8 Colorado Rockies 2013 1993\n", 159 | "9 Detroit Tigers 2013 1901\n", 160 | "..." 161 | ] 162 | } 163 | ], 164 | "prompt_number": 6 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "### These queries happen in the database\n", 171 | "\n", 172 | "Note that the Blaze software doesn't perform computations, the database does. Blaze is just a friendly intermediary between you and your database.\n", 173 | "\n", 174 | "Here we see what the generated SQL looks like. You generally don't need to do this at home." 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "collapsed": false, 180 | "input": [ 181 | "from blaze import compute # compute is internal API, not intended for users\n", 182 | "\n", 183 | "expr = db.Teams[db.Teams.name == 'Chicago White Stockings'][['name', 'yearID', 'park']]\n", 184 | "print compute(expr, post_compute=False)" 185 | ], 186 | "language": "python", 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "output_type": "stream", 191 | "stream": "stdout", 192 | "text": [ 193 | "SELECT \"Teams\".name, \"Teams\".\"yearID\", \"Teams\".park \n", 194 | "FROM \"Teams\" \n", 195 | "WHERE \"Teams\".name = ?\n" 196 | ] 197 | } 198 | ], 199 | "prompt_number": 7 200 | }, 201 | { 202 | "cell_type": "code", 203 | "collapsed": false, 204 | "input": [ 205 | "expr = by(db.Teams.name, start_year=db.Teams.yearID.min(), \n", 206 | " end_year=db.Teams.yearID.max()).sort('end_year', ascending=False)\n", 207 | "print compute(expr, post_compute=False)" 208 | ], 209 | "language": "python", 210 | "metadata": {}, 211 | "outputs": [ 212 | { 213 | "output_type": "stream", 214 | "stream": "stdout", 215 | "text": [ 216 | "SELECT \"Teams\".name, max(\"Teams\".\"yearID\") AS end_year, min(\"Teams\".\"yearID\") AS start_year \n", 217 | "FROM \"Teams\" GROUP BY \"Teams\".name ORDER BY end_year DESC\n" 218 | ] 219 | } 220 | ], 221 | "prompt_number": 8 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Exercises\n", 228 | "\n", 229 | "First, look at the `db.dshape` to get an idea of what we have in the database. This will tell you what tables are in the database as well as what columns we have in each table." 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "collapsed": false, 235 | "input": [ 236 | "db.dshape" 237 | ], 238 | "language": "python", 239 | "metadata": {}, 240 | "outputs": [] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "Play for a while with these tables inspecting data and writing a few queries that might interest you.\n", 247 | "\n", 248 | "If you're having trouble finding interesting queries you might try the following:\n", 249 | "\n", 250 | "* How many unique pitchers were there in each year?\n", 251 | "* In what years did Hank Aaron play? When was his last year? (Hank has the playerID `aaronha01`)\n", 252 | "* What was the average salary per year. Has it increased or decreased over time?" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "collapsed": false, 258 | "input": [], 259 | "language": "python", 260 | "metadata": {}, 261 | "outputs": [] 262 | } 263 | ], 264 | "metadata": {} 265 | } 266 | ] 267 | } -------------------------------------------------------------------------------- /04-Blaze-Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:d825a277965c84a256612bb6ff63f606b5bba263c90ac5407506f08370ba339b" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting Started with Blaze\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 1. Basic Queries\n", 30 | "\n", 31 | "For basic tabular queries, Blaze shares the same syntax as Pandas." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "collapsed": false, 37 | "input": [ 38 | "from blaze import Data, by, join, transform" 39 | ], 40 | "language": "python", 41 | "metadata": {}, 42 | "outputs": [], 43 | "prompt_number": 1 44 | }, 45 | { 46 | "cell_type": "code", 47 | "collapsed": false, 48 | "input": [ 49 | "bank = Data([[1, 'Alice', 100],\n", 50 | " [2, 'Bob', -200],\n", 51 | " [3, 'Charlie', 300],\n", 52 | " [4, 'Dennis', 400],\n", 53 | " [5, 'Edith', -500]], columns=['id', 'name', 'amount'])" 54 | ], 55 | "language": "python", 56 | "metadata": {}, 57 | "outputs": [], 58 | "prompt_number": 2 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### Arithmetic and Reductions" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "collapsed": false, 70 | "input": [ 71 | "bank.amount" 72 | ], 73 | "language": "python", 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "metadata": {}, 78 | "output_type": "pyout", 79 | "prompt_number": 3, 80 | "text": [ 81 | " amount\n", 82 | "0 100\n", 83 | "1 -200\n", 84 | "2 300\n", 85 | "3 400\n", 86 | "4 -500" 87 | ] 88 | } 89 | ], 90 | "prompt_number": 3 91 | }, 92 | { 93 | "cell_type": "code", 94 | "collapsed": false, 95 | "input": [ 96 | "bank.amount / 100" 97 | ], 98 | "language": "python", 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "metadata": {}, 103 | "output_type": "pyout", 104 | "prompt_number": 4, 105 | "text": [ 106 | " amount\n", 107 | "0 1\n", 108 | "1 -2\n", 109 | "2 3\n", 110 | "3 4\n", 111 | "4 -5" 112 | ] 113 | } 114 | ], 115 | "prompt_number": 4 116 | }, 117 | { 118 | "cell_type": "code", 119 | "collapsed": false, 120 | "input": [ 121 | "(bank.amount / 100).mean()" 122 | ], 123 | "language": "python", 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "metadata": {}, 128 | "output_type": "pyout", 129 | "prompt_number": 5, 130 | "text": [ 131 | "0.2" 132 | ] 133 | } 134 | ], 135 | "prompt_number": 5 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "### Multiple columns and sorting" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "collapsed": false, 147 | "input": [ 148 | "bank[['name', 'amount']].sort('amount')" 149 | ], 150 | "language": "python", 151 | "metadata": {}, 152 | "outputs": [] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### Selections\n", 159 | "\n", 160 | "We select subsets of data by indexing one expression with another" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "collapsed": false, 166 | "input": [ 167 | "bank[bank.amount < 0]" 168 | ], 169 | "language": "python", 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "metadata": {}, 174 | "output_type": "pyout", 175 | "prompt_number": 6, 176 | "text": [ 177 | " id name amount\n", 178 | "0 2 Bob -200\n", 179 | "1 5 Edith -500" 180 | ] 181 | } 182 | ], 183 | "prompt_number": 6 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "### Combining Operations\n", 190 | "\n", 191 | "We can combine these sorts of operations with each other" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "collapsed": false, 197 | "input": [ 198 | "bank[bank.amount < 0].amount / 100" 199 | ], 200 | "language": "python", 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "metadata": {}, 205 | "output_type": "pyout", 206 | "prompt_number": 7, 207 | "text": [ 208 | " amount\n", 209 | "0 -2\n", 210 | "1 -5" 211 | ] 212 | } 213 | ], 214 | "prompt_number": 7 215 | }, 216 | { 217 | "cell_type": "code", 218 | "collapsed": false, 219 | "input": [ 220 | "bank[bank.amount < 0].name" 221 | ], 222 | "language": "python", 223 | "metadata": {}, 224 | "outputs": [ 225 | { 226 | "metadata": {}, 227 | "output_type": "pyout", 228 | "prompt_number": 8, 229 | "text": [ 230 | " name\n", 231 | "0 Bob\n", 232 | "1 Edith" 233 | ] 234 | } 235 | ], 236 | "prompt_number": 8 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "#### Exercises\n", 243 | "\n", 244 | "Write expressions to answer the following questions" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "collapsed": false, 250 | "input": [ 251 | "# What are the IDs of everyone with a positive amount?\n" 252 | ], 253 | "language": "python", 254 | "metadata": {}, 255 | "outputs": [], 256 | "prompt_number": 9 257 | }, 258 | { 259 | "cell_type": "code", 260 | "collapsed": false, 261 | "input": [ 262 | "# What is the name of the person with amount 400?\n" 263 | ], 264 | "language": "python", 265 | "metadata": {}, 266 | "outputs": [], 267 | "prompt_number": 10 268 | }, 269 | { 270 | "cell_type": "code", 271 | "collapsed": false, 272 | "input": [ 273 | "# What is the difference between the minimum and maximum amounts?\n" 274 | ], 275 | "language": "python", 276 | "metadata": {}, 277 | "outputs": [], 278 | "prompt_number": 11 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "## 2. More complex queries\n", 285 | "\n" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "First, we need a more interesting dataset. We open the standard *iris* dataset, a table of 150 measurements of flowers in the iris genus. We find this dataset in a CSV file in the `data/` directory. " 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "collapsed": false, 298 | "input": [ 299 | "iris = Data('data/iris.csv')\n", 300 | "iris" 301 | ], 302 | "language": "python", 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "html": [ 307 | "\n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
" 409 | ], 410 | "metadata": {}, 411 | "output_type": "pyout", 412 | "prompt_number": 12, 413 | "text": [ 414 | " sepal_length sepal_width petal_length petal_width species\n", 415 | "0 5.1 3.5 1.4 0.2 Iris-setosa\n", 416 | "1 4.9 3.0 1.4 0.2 Iris-setosa\n", 417 | "2 4.7 3.2 1.3 0.2 Iris-setosa\n", 418 | "3 4.6 3.1 1.5 0.2 Iris-setosa\n", 419 | "4 5.0 3.6 1.4 0.2 Iris-setosa\n", 420 | "5 5.4 3.9 1.7 0.4 Iris-setosa\n", 421 | "6 4.6 3.4 1.4 0.3 Iris-setosa\n", 422 | "7 5.0 3.4 1.5 0.2 Iris-setosa\n", 423 | "8 4.4 2.9 1.4 0.2 Iris-setosa\n", 424 | "9 4.9 3.1 1.5 0.1 Iris-setosa\n", 425 | "..." 426 | ] 427 | } 428 | ], 429 | "prompt_number": 12 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "The `blaze.Data` function operates on all of the file types that we saw in the previous sections on `into`. Blaze expressions use functions like `discover` to get datashapes that help them interact with you." 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "collapsed": false, 441 | "input": [ 442 | "iris.dshape" 443 | ], 444 | "language": "python", 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "metadata": {}, 449 | "output_type": "pyout", 450 | "prompt_number": 13, 451 | "text": [ 452 | "dshape(\"\"\"var * {\n", 453 | " sepal_length: float64,\n", 454 | " sepal_width: float64,\n", 455 | " petal_length: float64,\n", 456 | " petal_width: float64,\n", 457 | " species: string\n", 458 | " }\"\"\")" 459 | ] 460 | } 461 | ], 462 | "prompt_number": 13 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "### Distinct\n", 469 | "\n", 470 | "Now some more queries. Distinct finds unique entries" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "collapsed": false, 476 | "input": [ 477 | "iris.species.distinct()" 478 | ], 479 | "language": "python", 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "metadata": {}, 484 | "output_type": "pyout", 485 | "prompt_number": 14, 486 | "text": [ 487 | " species\n", 488 | "0 Iris-setosa\n", 489 | "1 Iris-versicolor\n", 490 | "2 Iris-virginica" 491 | ] 492 | } 493 | ], 494 | "prompt_number": 14 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "Or count the number of distinct entries" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "collapsed": false, 506 | "input": [ 507 | "iris.species.nunique()" 508 | ], 509 | "language": "python", 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "metadata": {}, 514 | "output_type": "pyout", 515 | "prompt_number": 15, 516 | "text": [ 517 | "3" 518 | ] 519 | } 520 | ], 521 | "prompt_number": 15 522 | }, 523 | { 524 | "cell_type": "code", 525 | "collapsed": false, 526 | "input": [ 527 | "iris.sepal_length.nunique()" 528 | ], 529 | "language": "python", 530 | "metadata": {}, 531 | "outputs": [ 532 | { 533 | "metadata": {}, 534 | "output_type": "pyout", 535 | "prompt_number": 16, 536 | "text": [ 537 | "35" 538 | ] 539 | } 540 | ], 541 | "prompt_number": 16 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "### Transform\n", 548 | "\n", 549 | "Transform adds new columns based on old ones" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "collapsed": false, 555 | "input": [ 556 | "transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,\n", 557 | " petal_ratio=iris.petal_length / iris.petal_width)" 558 | ], 559 | "language": "python", 560 | "metadata": {}, 561 | "outputs": [ 562 | { 563 | "metadata": {}, 564 | "output_type": "pyout", 565 | "prompt_number": 17, 566 | "text": [ 567 | " sepal_length sepal_width petal_length petal_width species \\\n", 568 | "0 5.1 3.5 1.4 0.2 Iris-setosa \n", 569 | "1 4.9 3.0 1.4 0.2 Iris-setosa \n", 570 | "2 4.7 3.2 1.3 0.2 Iris-setosa \n", 571 | "3 4.6 3.1 1.5 0.2 Iris-setosa \n", 572 | "4 5.0 3.6 1.4 0.2 Iris-setosa \n", 573 | "5 5.4 3.9 1.7 0.4 Iris-setosa \n", 574 | "6 4.6 3.4 1.4 0.3 Iris-setosa \n", 575 | "7 5.0 3.4 1.5 0.2 Iris-setosa \n", 576 | "8 4.4 2.9 1.4 0.2 Iris-setosa \n", 577 | "9 4.9 3.1 1.5 0.1 Iris-setosa \n", 578 | "10 5.4 3.7 1.5 0.2 Iris-setosa \n", 579 | "\n", 580 | " sepal_ratio petal_ratio \n", 581 | "0 1.457143 7.000000 \n", 582 | "1 1.633333 7.000000 \n", 583 | "2 1.468750 6.500000 \n", 584 | "3 1.483871 7.500000 \n", 585 | "4 1.388889 7.000000 \n", 586 | "5 1.384615 4.250000 \n", 587 | "6 1.352941 4.666667 \n", 588 | "7 1.470588 7.500000 \n", 589 | "8 1.517241 7.000000 \n", 590 | "9 1.580645 15.000000 \n", 591 | "..." 592 | ] 593 | } 594 | ], 595 | "prompt_number": 17 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "### Split-apply-combine -- `by`\n", 602 | "\n", 603 | "Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group. We express these queries in blaze with the `by` operator\n", 604 | "\n", 605 | " by(column-on-which-to-split, result_name=reduction_on_group())" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "How many measurements do we have per species?" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "collapsed": false, 618 | "input": [ 619 | "by(iris.species, count=iris.species.count())" 620 | ], 621 | "language": "python", 622 | "metadata": {}, 623 | "outputs": [ 624 | { 625 | "metadata": {}, 626 | "output_type": "pyout", 627 | "prompt_number": 18, 628 | "text": [ 629 | " species count\n", 630 | "0 Iris-setosa 50\n", 631 | "1 Iris-versicolor 50\n", 632 | "2 Iris-virginica 50" 633 | ] 634 | } 635 | ], 636 | "prompt_number": 18 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": {}, 641 | "source": [ 642 | "How many measurements do we have per species and what is the longest petal length per species?" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "collapsed": false, 648 | "input": [ 649 | "by(iris.species, count=iris.species.count(), \n", 650 | " longest_petal=iris.petal_length.max())" 651 | ], 652 | "language": "python", 653 | "metadata": {}, 654 | "outputs": [ 655 | { 656 | "metadata": {}, 657 | "output_type": "pyout", 658 | "prompt_number": 19, 659 | "text": [ 660 | " species count longest_petal\n", 661 | "0 Iris-setosa 50 1.9\n", 662 | "1 Iris-versicolor 50 5.1\n", 663 | "2 Iris-virginica 50 6.9" 664 | ] 665 | } 666 | ], 667 | "prompt_number": 19 668 | }, 669 | { 670 | "cell_type": "markdown", 671 | "metadata": {}, 672 | "source": [ 673 | "#### Exercise\n", 674 | "\n", 675 | "Write queries to answer the following questions" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "collapsed": false, 681 | "input": [ 682 | "# What are the longest and shortest sepal_lengths per species?\n" 683 | ], 684 | "language": "python", 685 | "metadata": {}, 686 | "outputs": [], 687 | "prompt_number": 20 688 | }, 689 | { 690 | "cell_type": "code", 691 | "collapsed": false, 692 | "input": [ 693 | "# What is the difference of longest to shortest sepal length per species\n" 694 | ], 695 | "language": "python", 696 | "metadata": {}, 697 | "outputs": [], 698 | "prompt_number": 21 699 | }, 700 | { 701 | "cell_type": "markdown", 702 | "metadata": {}, 703 | "source": [ 704 | "### This is similar to how we solve these problems in Pandas\n", 705 | "\n", 706 | "So far, everything we've seen is similar to solving problems in Pandas" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "collapsed": false, 712 | "input": [ 713 | "import pandas as pd\n", 714 | "df = pd.read_csv('data/iris.csv')\n", 715 | "df.groupby(df.species).sepal_length.min()" 716 | ], 717 | "language": "python", 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "metadata": {}, 722 | "output_type": "pyout", 723 | "prompt_number": 22, 724 | "text": [ 725 | "species\n", 726 | "Iris-setosa 4.3\n", 727 | "Iris-versicolor 4.9\n", 728 | "Iris-virginica 4.9\n", 729 | "Name: sepal_length, dtype: float64" 730 | ] 731 | } 732 | ], 733 | "prompt_number": 22 734 | }, 735 | { 736 | "cell_type": "markdown", 737 | "metadata": {}, 738 | "source": [ 739 | "In fact, for small CSV files like this, Blaze *uses Pandas*, so one might consider just using Pandas directly.\n", 740 | "\n", 741 | "Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section." 742 | ] 743 | } 744 | ], 745 | "metadata": {} 746 | } 747 | ] 748 | } -------------------------------------------------------------------------------- /01-into-Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:05ae0cb667432cc3cca0cfc359be4b95108d55951367bf9229e536a93c4eba5c" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting started with `into`\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "# 0. Introduction\n", 30 | "\n", 31 | "## Describing Data\n", 32 | "\n", 33 | "We describe \"data\" with a few attributes\n", 34 | "\n", 35 | "1. Where that data lives (in a file on your laptop, in the cloud, etc..)\n", 36 | "2. What data-types we use (we represent names as `strings` and balances as `float32`s)\n", 37 | "3. What storage format we use (we store in CSV, in a PostgreSQL database, in JSON)\n", 38 | "4. The semantic values that those bits represent (Barack Obama, or the number 3)\n", 39 | "\n", 40 | "As analysts we care only about point 4, the values that our data represent. Points 1-3 are incidental to how we use computers; these points only get in the way of analysis.\n", 41 | "\n", 42 | "As computationalists though we care very deeply about points 1-3. The choice of format, location, and datatype *strongly* impact the efficiency and correctness of our computations. Good choices here can mean the difference between waiting overnight and using our data *interactively*.\n", 43 | "\n", 44 | "Unfortunately points 1-3 encompass a lot of complexity and change more quickly than most analysts care to manage.\n", 45 | "\n", 46 | "The `into` project alleviates the pain of dealing with the first three points by providing intuitive description and transfer between data formats and storage systems. This allows analysts to quickly reason about and migrate their data to efficient, correct, and resilient formats." 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "### Motivating Example\n", 54 | "\n", 55 | "Before we start small with the tutorial we give a more comprehensive example.\n", 56 | "\n", 57 | "We have a small CSV file holding the iris data" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "collapsed": false, 63 | "input": [ 64 | "!head -5 data/iris.csv" 65 | ], 66 | "language": "python", 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "output_type": "stream", 71 | "stream": "stdout", 72 | "text": [ 73 | "sepal_length,sepal_width,petal_length,petal_width,species\r\n", 74 | "5.1,3.5,1.4,0.2,Iris-setosa\r\n", 75 | "4.9,3.0,1.4,0.2,Iris-setosa\r\n", 76 | "4.7,3.2,1.3,0.2,Iris-setosa\r\n", 77 | "4.6,3.1,1.5,0.2,Iris-setosa\r\n" 78 | ] 79 | } 80 | ], 81 | "prompt_number": 1 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "We put this data into a list, a NumPy array, and a SQLite database. We move data to three very different technologies with the same abstraction." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "collapsed": false, 93 | "input": [ 94 | "from into import into\n", 95 | "into(list, 'data/iris.csv')[:5]" 96 | ], 97 | "language": "python", 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "metadata": {}, 102 | "output_type": "pyout", 103 | "prompt_number": 2, 104 | "text": [ 105 | "[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n", 106 | " (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n", 107 | " (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n", 108 | " (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n", 109 | " (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')]" 110 | ] 111 | } 112 | ], 113 | "prompt_number": 2 114 | }, 115 | { 116 | "cell_type": "code", 117 | "collapsed": false, 118 | "input": [ 119 | "import numpy as np\n", 120 | "into(np.ndarray, 'data/iris.csv')[:5]" 121 | ], 122 | "language": "python", 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "metadata": {}, 127 | "output_type": "pyout", 128 | "prompt_number": 3, 129 | "text": [ 130 | "rec.array([(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n", 131 | " (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n", 132 | " (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n", 133 | " (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n", 134 | " (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')], \n", 135 | " dtype=[('sepal_length', ', nullable=False), Column('sepal_width', FLOAT(), table=, nullable=False), Column('petal_length', FLOAT(), table=, nullable=False), Column('petal_width', FLOAT(), table=, nullable=False), Column('species', TEXT(), table=, nullable=False), schema=None)" 156 | ] 157 | } 158 | ], 159 | "prompt_number": 4 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "metadata": {}, 164 | "source": [ 165 | "### Outline\n", 166 | "\n", 167 | "Into moves data between formats intuitively.\n", 168 | "\n", 169 | "We structure this tutorial as follows:\n", 170 | "\n", 171 | "1. **Basic How-to**: on how to use this library effectively\n", 172 | "2. **Datatypes**: to enhance performance\n", 173 | "3. **Internals**" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "`into` is available on conda\n", 181 | "\n", 182 | " conda install into\n", 183 | " or \n", 184 | " conda install into -c blaze # Up-to-date version\n", 185 | " \n", 186 | "or on PyPI\n", 187 | "\n", 188 | " pip install into\n", 189 | " or \n", 190 | " pip install git+http://github.com/ContinuumIO/into.git # Up-to-date version\n", 191 | " " 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "# 1. Basic Usage" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "## 1.1 `into`" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "collapsed": false, 211 | "input": [ 212 | "from into import into" 213 | ], 214 | "language": "python", 215 | "metadata": {}, 216 | "outputs": [], 217 | "prompt_number": 5 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Into takes two arguments, a target and a source\n", 224 | "\n", 225 | " into(target, source)\n", 226 | " \n", 227 | "And it turns the source into something like the target" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### The `target` can be a type\n", 235 | "\n", 236 | "In which case it makes a new object of that type" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "collapsed": false, 242 | "input": [ 243 | "import numpy as np\n", 244 | "into(np.ndarray, [1, 2, 3])" 245 | ], 246 | "language": "python", 247 | "metadata": {}, 248 | "outputs": [ 249 | { 250 | "metadata": {}, 251 | "output_type": "pyout", 252 | "prompt_number": 6, 253 | "text": [ 254 | "array([1, 2, 3])" 255 | ] 256 | } 257 | ], 258 | "prompt_number": 6 259 | }, 260 | { 261 | "cell_type": "code", 262 | "collapsed": false, 263 | "input": [ 264 | "into(set, [1, 2, 3])" 265 | ], 266 | "language": "python", 267 | "metadata": {}, 268 | "outputs": [ 269 | { 270 | "metadata": {}, 271 | "output_type": "pyout", 272 | "prompt_number": 7, 273 | "text": [ 274 | "{1, 2, 3}" 275 | ] 276 | } 277 | ], 278 | "prompt_number": 7 279 | }, 280 | { 281 | "cell_type": "code", 282 | "collapsed": false, 283 | "input": [ 284 | "import pandas as pd\n", 285 | "into(pd.Series, (10, 20, 30))" 286 | ], 287 | "language": "python", 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "metadata": {}, 292 | "output_type": "pyout", 293 | "prompt_number": 8, 294 | "text": [ 295 | "0 10\n", 296 | "1 20\n", 297 | "2 30\n", 298 | "dtype: int64" 299 | ] 300 | } 301 | ], 302 | "prompt_number": 8 303 | }, 304 | { 305 | "cell_type": "markdown", 306 | "metadata": {}, 307 | "source": [ 308 | "#### Exercise\n", 309 | "\n", 310 | "Use into to turn the following DataFrame into an `np.ndarray` and a `list`" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "collapsed": false, 316 | "input": [ 317 | "df = pd.DataFrame([['Alice', 100],\n", 318 | " ['Bob', 200],\n", 319 | " ['Charlie', 300]], columns=['name', 'balance'])" 320 | ], 321 | "language": "python", 322 | "metadata": {}, 323 | "outputs": [], 324 | "prompt_number": 9 325 | }, 326 | { 327 | "cell_type": "code", 328 | "collapsed": false, 329 | "input": [ 330 | "# Turn df into an np.ndarray\n", 331 | "# into(..., ...)" 332 | ], 333 | "language": "python", 334 | "metadata": {}, 335 | "outputs": [], 336 | "prompt_number": 10 337 | }, 338 | { 339 | "cell_type": "code", 340 | "collapsed": false, 341 | "input": [ 342 | "# Turn df into a list\n", 343 | "# into(..., ...)" 344 | ], 345 | "language": "python", 346 | "metadata": {}, 347 | "outputs": [], 348 | "prompt_number": 11 349 | }, 350 | { 351 | "cell_type": "code", 352 | "collapsed": false, 353 | "input": [], 354 | "language": "python", 355 | "metadata": {}, 356 | "outputs": [], 357 | "prompt_number": 11 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "### The `target` can be an object\n", 364 | "\n", 365 | "In which we append the source to that object." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "collapsed": false, 371 | "input": [ 372 | "target = []\n", 373 | "into(target, (1, 2, 3))\n", 374 | "into(target, (1, 2, 3))\n", 375 | "into(target, (1, 2, 3))\n", 376 | "target" 377 | ], 378 | "language": "python", 379 | "metadata": {}, 380 | "outputs": [ 381 | { 382 | "metadata": {}, 383 | "output_type": "pyout", 384 | "prompt_number": 12, 385 | "text": [ 386 | "[1, 2, 3, 1, 2, 3, 1, 2, 3]" 387 | ] 388 | } 389 | ], 390 | "prompt_number": 12 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "#### Exercise\n", 397 | "\n", 398 | "Use `into` to make a set holding all the data in the following list of DataFrames." 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "collapsed": false, 404 | "input": [ 405 | "L = [pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}),\n", 406 | " pd.DataFrame({'name': ['Charlie', 'Dan'], 'balance': [300, 400]}),\n", 407 | " pd.DataFrame({'name': ['Edith', 'Frank'], 'balance': [500, 600]})]" 408 | ], 409 | "language": "python", 410 | "metadata": {}, 411 | "outputs": [], 412 | "prompt_number": 13 413 | }, 414 | { 415 | "cell_type": "code", 416 | "collapsed": false, 417 | "input": [ 418 | "s = set()\n", 419 | "# Use into and some kind of for loop to put all of the data in L into the set s\n" 420 | ], 421 | "language": "python", 422 | "metadata": {}, 423 | "outputs": [], 424 | "prompt_number": 14 425 | }, 426 | { 427 | "cell_type": "code", 428 | "collapsed": false, 429 | "input": [], 430 | "language": "python", 431 | "metadata": {}, 432 | "outputs": [], 433 | "prompt_number": 14 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "#### Exercise\n", 440 | "\n", 441 | "Repeat the last exercise but append all of the data onto a `tuple`. What do you expect to happen?" 442 | ] 443 | }, 444 | { 445 | "cell_type": "code", 446 | "collapsed": false, 447 | "input": [ 448 | "t = tuple()\n" 449 | ], 450 | "language": "python", 451 | "metadata": {}, 452 | "outputs": [], 453 | "prompt_number": 15 454 | }, 455 | { 456 | "cell_type": "code", 457 | "collapsed": false, 458 | "input": [], 459 | "language": "python", 460 | "metadata": {}, 461 | "outputs": [], 462 | "prompt_number": 15 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "It's important to know that `into` has limitations. " 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### The target can be a string\n", 476 | "\n", 477 | "Many data sources external to Python (like a CSV file) don't have a Python object that we can put as the source or target. In these cases we use string URIs. Examples of strings include\n", 478 | "\n", 479 | " myfile.csv\n", 480 | " myfile.json\n", 481 | " myfile.hdf5\n", 482 | " myfile.hdf5::/data\n", 483 | " sqlite:///myfile.db::table-name\n", 484 | " postgresql:///user:password@host:port::table-name\n", 485 | " ...\n", 486 | " \n", 487 | "These can go either in the source or target inputs.\n", 488 | "\n", 489 | "Here we write our dataframe to a CSV file" 490 | ] 491 | }, 492 | { 493 | "cell_type": "code", 494 | "collapsed": false, 495 | "input": [ 496 | "# Write DataFrame to CSV file\n", 497 | "into('accounts.csv', df)" 498 | ], 499 | "language": "python", 500 | "metadata": {}, 501 | "outputs": [ 502 | { 503 | "metadata": {}, 504 | "output_type": "pyout", 505 | "prompt_number": 16, 506 | "text": [ 507 | "" 508 | ] 509 | } 510 | ], 511 | "prompt_number": 16 512 | }, 513 | { 514 | "cell_type": "code", 515 | "collapsed": false, 516 | "input": [ 517 | "# print out text in accounts.csv\n", 518 | "!head accounts.csv" 519 | ], 520 | "language": "python", 521 | "metadata": {}, 522 | "outputs": [ 523 | { 524 | "output_type": "stream", 525 | "stream": "stdout", 526 | "text": [ 527 | "name,balance\r\n", 528 | "Alice,100\r\n", 529 | "Bob,200\r\n", 530 | "Charlie,300\r\n" 531 | ] 532 | } 533 | ], 534 | "prompt_number": 17 535 | }, 536 | { 537 | "cell_type": "code", 538 | "collapsed": false, 539 | "input": [ 540 | "# Read CSV file into memory as list\n", 541 | "into(list, 'accounts.csv')" 542 | ], 543 | "language": "python", 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "metadata": {}, 548 | "output_type": "pyout", 549 | "prompt_number": 18, 550 | "text": [ 551 | "[(u'Alice', 100), (u'Bob', 200), (u'Charlie', 300)]" 552 | ] 553 | } 554 | ], 555 | "prompt_number": 18 556 | }, 557 | { 558 | "cell_type": "markdown", 559 | "metadata": {}, 560 | "source": [ 561 | "#### Exercise\n", 562 | "\n", 563 | "Read the contents of the file `'data/iris.csv'` into a `pd.DataFrame`. Then write that dataframe to `'data/iris.json'`. Inspect the json data to ensure that it came out correctly." 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "collapsed": false, 569 | "input": [], 570 | "language": "python", 571 | "metadata": {}, 572 | "outputs": [], 573 | "prompt_number": 18 574 | }, 575 | { 576 | "cell_type": "code", 577 | "collapsed": false, 578 | "input": [], 579 | "language": "python", 580 | "metadata": {}, 581 | "outputs": [], 582 | "prompt_number": 18 583 | }, 584 | { 585 | "cell_type": "markdown", 586 | "metadata": {}, 587 | "source": [ 588 | "#### Exercise\n", 589 | "\n", 590 | "Write the contents of your json file to a SQLite database using the following URI as target\n", 591 | "\n", 592 | " `sqlite:///data/my.db::iris'\n", 593 | " \n", 594 | "Then read data from that SQLite database into Python to make sure that it arrived safely." 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "collapsed": false, 600 | "input": [], 601 | "language": "python", 602 | "metadata": {}, 603 | "outputs": [], 604 | "prompt_number": 18 605 | }, 606 | { 607 | "cell_type": "code", 608 | "collapsed": false, 609 | "input": [], 610 | "language": "python", 611 | "metadata": {}, 612 | "outputs": [], 613 | "prompt_number": 18 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "## 1.2 `resource`\n", 620 | "\n", 621 | "Much interesting data lives *outside* of Python. As we just saw, we often use URIs to specify this kind of data.\n", 622 | "\n", 623 | "Here we load a bit of a SQL database on baseball statistics ([download here](https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite)) into memory as a list" 624 | ] 625 | }, 626 | { 627 | "cell_type": "code", 628 | "collapsed": false, 629 | "input": [ 630 | "into(list, 'sqlite:///data/lahman2013.sqlite::BattingPost')[:5]" 631 | ], 632 | "language": "python", 633 | "metadata": {}, 634 | "outputs": [ 635 | { 636 | "metadata": {}, 637 | "output_type": "pyout", 638 | "prompt_number": 19, 639 | "text": [ 640 | "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n", 641 | " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n", 642 | " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n", 643 | " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n", 644 | " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]" 645 | ] 646 | } 647 | ], 648 | "prompt_number": 19 649 | }, 650 | { 651 | "cell_type": "markdown", 652 | "metadata": {}, 653 | "source": [ 654 | "Now we learn how these strings work so that we can specify many types of external data.\n", 655 | "\n", 656 | "Internally `into` uses the function `resource` to turn a string into a Python proxy object. Usually these objects don't hold the data themselves. They just serve as useful pointers to where the data lives. In most cases we use other Python projects for proxy objects.\n", 657 | "\n", 658 | "In the case of SQL tables, resource returns a `sqlalchemy.Table` object." 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "collapsed": false, 664 | "input": [ 665 | "from into import resource\n", 666 | "t = resource('sqlite:///data/lahman2013.sqlite::BattingPost')" 667 | ], 668 | "language": "python", 669 | "metadata": {}, 670 | "outputs": [], 671 | "prompt_number": 20 672 | }, 673 | { 674 | "cell_type": "code", 675 | "collapsed": false, 676 | "input": [ 677 | "type(t)" 678 | ], 679 | "language": "python", 680 | "metadata": {}, 681 | "outputs": [ 682 | { 683 | "metadata": {}, 684 | "output_type": "pyout", 685 | "prompt_number": 21, 686 | "text": [ 687 | "sqlalchemy.sql.schema.Table" 688 | ] 689 | } 690 | ], 691 | "prompt_number": 21 692 | }, 693 | { 694 | "cell_type": "code", 695 | "collapsed": false, 696 | "input": [ 697 | "t" 698 | ], 699 | "language": "python", 700 | "metadata": {}, 701 | "outputs": [ 702 | { 703 | "metadata": {}, 704 | "output_type": "pyout", 705 | "prompt_number": 22, 706 | "text": [ 707 | "Table('BattingPost', MetaData(bind=Engine(sqlite:///data/lahman2013.sqlite)), Column('yearID', INTEGER(), table=), Column('round', TEXT(), table=), Column('playerID', TEXT(), table=), Column('teamID', TEXT(), table=), Column('lgID', TEXT(), table=), Column('G', INTEGER(), table=), Column('AB', INTEGER(), table=), Column('R', INTEGER(), table=), Column('H', INTEGER(), table=), Column('2B', INTEGER(), table=), Column('3B', INTEGER(), table=), Column('HR', INTEGER(), table=), Column('RBI', INTEGER(), table=), Column('SB', INTEGER(), table=), Column('CS', INTEGER(), table=), Column('BB', INTEGER(), table=), Column('SO', INTEGER(), table=), Column('IBB', INTEGER(), table=), Column('HBP', INTEGER(), table=), Column('SH', INTEGER(), table=), Column('SF', INTEGER(), table=), Column('GIDP', INTEGER(), table=), schema=None)" 708 | ] 709 | } 710 | ], 711 | "prompt_number": 22 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "We use *this* object as the `into` source." 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "collapsed": false, 723 | "input": [ 724 | "into(list, t)[:5]" 725 | ], 726 | "language": "python", 727 | "metadata": {}, 728 | "outputs": [ 729 | { 730 | "metadata": {}, 731 | "output_type": "pyout", 732 | "prompt_number": 23, 733 | "text": [ 734 | "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n", 735 | " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n", 736 | " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n", 737 | " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n", 738 | " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]" 739 | ] 740 | } 741 | ], 742 | "prompt_number": 23 743 | }, 744 | { 745 | "cell_type": "markdown", 746 | "metadata": {}, 747 | "source": [ 748 | "So when you write\n", 749 | "\n", 750 | "```python\n", 751 | "into(list, 'some-string')\n", 752 | "```\n", 753 | "\n", 754 | "It is actually just shorthand for\n", 755 | "\n", 756 | "```python\n", 757 | "into(list, resource('some-string'))\n", 758 | "```" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": {}, 764 | "source": [ 765 | "#### Exercise\n", 766 | "\n", 767 | "We have some data in the `data/` directory. Use `resource` on each of the following strings to see what it returns.\n", 768 | "\n", 769 | " sqlite:///data/lahman2013.sqlite::BattingPost\n", 770 | " sqlite:///data/lahman2013.sqlite::Salaries\n", 771 | " sqlite:///data/lahman2013.sqlite\n", 772 | " data/sample.hdf5::/points\n", 773 | " data/sample.hdf5\n", 774 | " data/iris.csv" 775 | ] 776 | }, 777 | { 778 | "cell_type": "code", 779 | "collapsed": false, 780 | "input": [], 781 | "language": "python", 782 | "metadata": {}, 783 | "outputs": [] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "collapsed": false, 788 | "input": [], 789 | "language": "python", 790 | "metadata": {}, 791 | "outputs": [] 792 | }, 793 | { 794 | "cell_type": "code", 795 | "collapsed": false, 796 | "input": [], 797 | "language": "python", 798 | "metadata": {}, 799 | "outputs": [] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "collapsed": false, 804 | "input": [], 805 | "language": "python", 806 | "metadata": {}, 807 | "outputs": [] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "collapsed": false, 812 | "input": [], 813 | "language": "python", 814 | "metadata": {}, 815 | "outputs": [] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "collapsed": false, 820 | "input": [], 821 | "language": "python", 822 | "metadata": {}, 823 | "outputs": [] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "metadata": {}, 828 | "source": [ 829 | "### The `::` separator\n", 830 | "\n", 831 | "In the last exercise we saw the following URI for a table in a SQLite database\n", 832 | "\n", 833 | " sqlite:///data/my.db::iris\n", 834 | " \n", 835 | "We deconstruct this URI to make it more clear. First we separate the URI on `::` to separate out the database from the table name\n", 836 | "\n", 837 | " Database: sqlite:///data/my.db\n", 838 | " Table name: iris\n", 839 | " \n", 840 | "We use the `::` separator whenever datasets live within some nested structure like a database or HDF5 file." 841 | ] 842 | }, 843 | { 844 | "cell_type": "markdown", 845 | "metadata": {}, 846 | "source": [ 847 | "### Specifying protocols with `://`\n", 848 | "\n", 849 | "The database string `sqlite:///data/my.db` is specific to SQLAlchemy, but follows a common format, notably\n", 850 | "\n", 851 | " Protocol: sqlite://\n", 852 | " Filename: data/my.db\n", 853 | " \n", 854 | "Into also uses protocols in many cases to give extra hints on how to handle your data. For example Python has a few different libraries to handle HDF5 files (`h5py`, `pytables`, `pandas.HDFStore`). By default when we see a URI like `myfile.hdf5` we currently use `h5py`. To override this behavior you can specify a protocol string like `hdfstore://myfile.hdf5` to specify that you want to use the special `pandas.HDFStore` format.\n", 855 | "\n", 856 | "*Note:* SQLAlchemy strings are a little odd in that they use three slashes by default (e.g. `sqlite:///my.db`) and *four* slashes when using absolute paths (e.g. `sqlite:////Users/Alice/data/my.db`)." 857 | ] 858 | }, 859 | { 860 | "cell_type": "markdown", 861 | "metadata": {}, 862 | "source": [ 863 | "### Exercise\n", 864 | "\n", 865 | "People use the `.json` extension in two ways. \n", 866 | "\n", 867 | "1. The entire file is one JSON blob, often a list of dictionaries. (Traditional JSON)\n", 868 | "2. Each line of the file is one JSON blob (Line-delimited JSON or JSON-lines.)\n", 869 | "\n", 870 | "Parsers have a hard time figuring out which case is which. When reading an existing file `into` can usually figure out if the file is line-delimited or not. When createing a file however we don't know what your intention is. You can specify your intention by adding either a `json://` or `jsonlines://` protocol." 871 | ] 872 | }, 873 | { 874 | "cell_type": "markdown", 875 | "metadata": {}, 876 | "source": [ 877 | "Here we write our dataframe to a JSON file." 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "collapsed": true, 883 | "input": [ 884 | "into('accounts.json', df)\n", 885 | "!head accounts.json" 886 | ], 887 | "language": "python", 888 | "metadata": {}, 889 | "outputs": [ 890 | { 891 | "output_type": "stream", 892 | "stream": "stdout", 893 | "text": [ 894 | "[{\"balance\": 100, \"name\": \"Alice\"}, {\"balance\": 200, \"name\": \"Bob\"}, {\"balance\": 300, \"name\": \"Charlie\"}]" 895 | ] 896 | } 897 | ], 898 | "prompt_number": 24 899 | }, 900 | { 901 | "cell_type": "code", 902 | "collapsed": false, 903 | "input": [ 904 | "!rm accounts.json # Remove old file" 905 | ], 906 | "language": "python", 907 | "metadata": {}, 908 | "outputs": [], 909 | "prompt_number": 25 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "This is the traditional single-JSON-blob-per-file format. \n", 916 | "\n", 917 | "Instead write our DataFrame in the line-delimited format by adding a `jsonlines://` protocol to the target string. Inspect the result to make sure that each line is a separate valid JSON blob." 918 | ] 919 | }, 920 | { 921 | "cell_type": "code", 922 | "collapsed": false, 923 | "input": [], 924 | "language": "python", 925 | "metadata": {}, 926 | "outputs": [] 927 | }, 928 | { 929 | "cell_type": "code", 930 | "collapsed": false, 931 | "input": [], 932 | "language": "python", 933 | "metadata": {}, 934 | "outputs": [] 935 | } 936 | ], 937 | "metadata": {} 938 | } 939 | ] 940 | } -------------------------------------------------------------------------------- /00-Motivation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:f41f038d7ff89a4de13d4540c9f9b364d8886a157a3b10a3162b144b3437a86c" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "# 0. Motivation\n", 16 | "\n", 17 | "Where we store data and how we interact with it are unfortunately related." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## Access SQL Database\n", 25 | "\n", 26 | "For example, we often access SQL databases using SQL query strings" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "collapsed": false, 32 | "input": [ 33 | "import sqlalchemy\n", 34 | "engine = sqlalchemy.create_engine('sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db')\n", 35 | "engine" 36 | ], 37 | "language": "python", 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "metadata": {}, 42 | "output_type": "pyout", 43 | "prompt_number": 1, 44 | "text": [ 45 | "Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)" 46 | ] 47 | } 48 | ], 49 | "prompt_number": 1 50 | }, 51 | { 52 | "cell_type": "code", 53 | "collapsed": false, 54 | "input": [ 55 | "conn = engine.connect()\n", 56 | "list(conn.execute('''SELECT petal_length, petal_width, sepal_length, sepal_width, species\n", 57 | " FROM iris\n", 58 | " LIMIT 10'''))" 59 | ], 60 | "language": "python", 61 | "metadata": {}, 62 | "outputs": [ 63 | { 64 | "metadata": {}, 65 | "output_type": "pyout", 66 | "prompt_number": 2, 67 | "text": [ 68 | "[(1.4, 0.2, 5.1, 3.5, u'Iris-setosa'),\n", 69 | " (1.4, 0.2, 4.9, 3.0, u'Iris-setosa'),\n", 70 | " (1.3, 0.2, 4.7, 3.2, u'Iris-setosa'),\n", 71 | " (1.5, 0.2, 4.6, 3.1, u'Iris-setosa'),\n", 72 | " (1.4, 0.2, 5.0, 3.6, u'Iris-setosa'),\n", 73 | " (1.7, 0.4, 5.4, 3.9, u'Iris-setosa'),\n", 74 | " (1.4, 0.3, 4.6, 3.4, u'Iris-setosa'),\n", 75 | " (1.5, 0.2, 5.0, 3.4, u'Iris-setosa'),\n", 76 | " (1.4, 0.2, 4.4, 2.9, u'Iris-setosa'),\n", 77 | " (1.5, 0.1, 4.9, 3.1, u'Iris-setosa')]" 78 | ] 79 | } 80 | ], 81 | "prompt_number": 2 82 | }, 83 | { 84 | "cell_type": "code", 85 | "collapsed": false, 86 | "input": [ 87 | "list(conn.execute('''SELECT avg(petal_length), max(petal_width), species\n", 88 | " FROM iris\n", 89 | " GROUP BY species'''))" 90 | ], 91 | "language": "python", 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "metadata": {}, 96 | "output_type": "pyout", 97 | "prompt_number": 3, 98 | "text": [ 99 | "[(1.4620000000000002, 0.6, u'Iris-setosa'),\n", 100 | " (4.26, 1.8, u'Iris-versicolor'),\n", 101 | " (5.552, 2.5, u'Iris-virginica')]" 102 | ] 103 | } 104 | ], 105 | "prompt_number": 3 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "## Load database into memory, interact with Pandas\n", 112 | "\n", 113 | "Many Python users prefer the interactive Pandas DataFrame. We can use Pandas on this data by copying the entire table (actually quite small in this case) into memory and manipulate it directly with Pandas syntax and algorithms." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "collapsed": false, 119 | "input": [ 120 | "import pandas as pd\n", 121 | "df = pd.read_sql('SELECT * FROM iris', engine)" 122 | ], 123 | "language": "python", 124 | "metadata": {}, 125 | "outputs": [], 126 | "prompt_number": 4 127 | }, 128 | { 129 | "cell_type": "code", 130 | "collapsed": false, 131 | "input": [ 132 | "df" 133 | ], 134 | "language": "python", 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "html": [ 139 | "
\n", 140 | "\n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
21 5.1 3.7 1.5 0.4 Iris-setosa
22 4.6 3.6 1.0 0.2 Iris-setosa
23 5.1 3.3 1.7 0.5 Iris-setosa
24 4.8 3.4 1.9 0.2 Iris-setosa
25 5.0 3.0 1.6 0.2 Iris-setosa
26 5.0 3.4 1.6 0.4 Iris-setosa
27 5.2 3.5 1.5 0.2 Iris-setosa
28 5.2 3.4 1.4 0.2 Iris-setosa
29 4.7 3.2 1.6 0.2 Iris-setosa
..................
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
127 6.1 3.0 4.9 1.8 Iris-virginica
128 6.4 2.8 5.6 2.1 Iris-virginica
129 7.2 3.0 5.8 1.6 Iris-virginica
130 7.4 2.8 6.1 1.9 Iris-virginica
131 7.9 3.8 6.4 2.0 Iris-virginica
132 6.4 2.8 5.6 2.2 Iris-virginica
133 6.3 2.8 5.1 1.5 Iris-virginica
134 6.1 2.6 5.6 1.4 Iris-virginica
135 7.7 3.0 6.1 2.3 Iris-virginica
136 6.3 3.4 5.6 2.4 Iris-virginica
137 6.4 3.1 5.5 1.8 Iris-virginica
138 6.0 3.0 4.8 1.8 Iris-virginica
139 6.9 3.1 5.4 2.1 Iris-virginica
140 6.7 3.1 5.6 2.4 Iris-virginica
141 6.9 3.1 5.1 2.3 Iris-virginica
142 5.8 2.7 5.1 1.9 Iris-virginica
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
\n", 642 | "

150 rows \u00d7 5 columns

\n", 643 | "
" 644 | ], 645 | "metadata": {}, 646 | "output_type": "pyout", 647 | "prompt_number": 5, 648 | "text": [ 649 | " sepal_length sepal_width petal_length petal_width species\n", 650 | "0 5.1 3.5 1.4 0.2 Iris-setosa\n", 651 | "1 4.9 3.0 1.4 0.2 Iris-setosa\n", 652 | "2 4.7 3.2 1.3 0.2 Iris-setosa\n", 653 | "3 4.6 3.1 1.5 0.2 Iris-setosa\n", 654 | "4 5.0 3.6 1.4 0.2 Iris-setosa\n", 655 | "5 5.4 3.9 1.7 0.4 Iris-setosa\n", 656 | "6 4.6 3.4 1.4 0.3 Iris-setosa\n", 657 | "7 5.0 3.4 1.5 0.2 Iris-setosa\n", 658 | "8 4.4 2.9 1.4 0.2 Iris-setosa\n", 659 | "9 4.9 3.1 1.5 0.1 Iris-setosa\n", 660 | "10 5.4 3.7 1.5 0.2 Iris-setosa\n", 661 | "11 4.8 3.4 1.6 0.2 Iris-setosa\n", 662 | "12 4.8 3.0 1.4 0.1 Iris-setosa\n", 663 | "13 4.3 3.0 1.1 0.1 Iris-setosa\n", 664 | "14 5.8 4.0 1.2 0.2 Iris-setosa\n", 665 | "15 5.7 4.4 1.5 0.4 Iris-setosa\n", 666 | "16 5.4 3.9 1.3 0.4 Iris-setosa\n", 667 | "17 5.1 3.5 1.4 0.3 Iris-setosa\n", 668 | "18 5.7 3.8 1.7 0.3 Iris-setosa\n", 669 | "19 5.1 3.8 1.5 0.3 Iris-setosa\n", 670 | "20 5.4 3.4 1.7 0.2 Iris-setosa\n", 671 | "21 5.1 3.7 1.5 0.4 Iris-setosa\n", 672 | "22 4.6 3.6 1.0 0.2 Iris-setosa\n", 673 | "23 5.1 3.3 1.7 0.5 Iris-setosa\n", 674 | "24 4.8 3.4 1.9 0.2 Iris-setosa\n", 675 | "25 5.0 3.0 1.6 0.2 Iris-setosa\n", 676 | "26 5.0 3.4 1.6 0.4 Iris-setosa\n", 677 | "27 5.2 3.5 1.5 0.2 Iris-setosa\n", 678 | "28 5.2 3.4 1.4 0.2 Iris-setosa\n", 679 | "29 4.7 3.2 1.6 0.2 Iris-setosa\n", 680 | ".. ... ... ... ... ...\n", 681 | "120 6.9 3.2 5.7 2.3 Iris-virginica\n", 682 | "121 5.6 2.8 4.9 2.0 Iris-virginica\n", 683 | "122 7.7 2.8 6.7 2.0 Iris-virginica\n", 684 | "123 6.3 2.7 4.9 1.8 Iris-virginica\n", 685 | "124 6.7 3.3 5.7 2.1 Iris-virginica\n", 686 | "125 7.2 3.2 6.0 1.8 Iris-virginica\n", 687 | "126 6.2 2.8 4.8 1.8 Iris-virginica\n", 688 | "127 6.1 3.0 4.9 1.8 Iris-virginica\n", 689 | "128 6.4 2.8 5.6 2.1 Iris-virginica\n", 690 | "129 7.2 3.0 5.8 1.6 Iris-virginica\n", 691 | "130 7.4 2.8 6.1 1.9 Iris-virginica\n", 692 | "131 7.9 3.8 6.4 2.0 Iris-virginica\n", 693 | "132 6.4 2.8 5.6 2.2 Iris-virginica\n", 694 | "133 6.3 2.8 5.1 1.5 Iris-virginica\n", 695 | "134 6.1 2.6 5.6 1.4 Iris-virginica\n", 696 | "135 7.7 3.0 6.1 2.3 Iris-virginica\n", 697 | "136 6.3 3.4 5.6 2.4 Iris-virginica\n", 698 | "137 6.4 3.1 5.5 1.8 Iris-virginica\n", 699 | "138 6.0 3.0 4.8 1.8 Iris-virginica\n", 700 | "139 6.9 3.1 5.4 2.1 Iris-virginica\n", 701 | "140 6.7 3.1 5.6 2.4 Iris-virginica\n", 702 | "141 6.9 3.1 5.1 2.3 Iris-virginica\n", 703 | "142 5.8 2.7 5.1 1.9 Iris-virginica\n", 704 | "143 6.8 3.2 5.9 2.3 Iris-virginica\n", 705 | "144 6.7 3.3 5.7 2.5 Iris-virginica\n", 706 | "145 6.7 3.0 5.2 2.3 Iris-virginica\n", 707 | "146 6.3 2.5 5.0 1.9 Iris-virginica\n", 708 | "147 6.5 3.0 5.2 2.0 Iris-virginica\n", 709 | "148 6.2 3.4 5.4 2.3 Iris-virginica\n", 710 | "149 5.9 3.0 5.1 1.8 Iris-virginica\n", 711 | "\n", 712 | "[150 rows x 5 columns]" 713 | ] 714 | } 715 | ], 716 | "prompt_number": 5 717 | }, 718 | { 719 | "cell_type": "code", 720 | "collapsed": false, 721 | "input": [], 722 | "language": "python", 723 | "metadata": {}, 724 | "outputs": [], 725 | "prompt_number": 5 726 | }, 727 | { 728 | "cell_type": "code", 729 | "collapsed": false, 730 | "input": [], 731 | "language": "python", 732 | "metadata": {}, 733 | "outputs": [], 734 | "prompt_number": 5 735 | }, 736 | { 737 | "cell_type": "markdown", 738 | "metadata": {}, 739 | "source": [ 740 | "## Larger databases\n", 741 | "\n", 742 | "Pandas provides both a great user experience *and* fast in-memory algorithms. When those algorithms become obsolete (e.g. when datasets grow large) then we're forced to throw away the great user experience and switch back to using SQL.\n", 743 | "\n", 744 | "Here we connect to Hive, a database backed by Hadoop MapReduce." 745 | ] 746 | }, 747 | { 748 | "cell_type": "code", 749 | "collapsed": false, 750 | "input": [ 751 | "engine = sqlalchemy.create_engine('hive://hdfs@54.91.57.226:10000/default')\n", 752 | "conn = engine.connect()\n", 753 | "list(conn.execute('''SELECT * \n", 754 | " FROM iris \n", 755 | " LIMIT 10''')) # Imagine that this was big" 756 | ], 757 | "language": "python", 758 | "metadata": {}, 759 | "outputs": [ 760 | { 761 | "metadata": {}, 762 | "output_type": "pyout", 763 | "prompt_number": 6, 764 | "text": [ 765 | "[(5.099999904632568, 3.5, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n", 766 | " (4.900000095367432, 3.0, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n", 767 | " (4.699999809265137, 3.200000047683716, 1.2999999523162842, 0.20000000298023224, u'Iris-setosa'),\n", 768 | " (4.599999904632568, 3.0999999046325684, 1.5, 0.20000000298023224, u'Iris-setosa'),\n", 769 | " (5.0, 3.5999999046325684, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n", 770 | " (5.400000095367432, 3.9000000953674316, 1.7000000476837158, 0.4000000059604645, u'Iris-setosa'),\n", 771 | " (4.599999904632568, 3.4000000953674316, 1.399999976158142, 0.30000001192092896, u'Iris-setosa'),\n", 772 | " (5.0, 3.4000000953674316, 1.5, 0.20000000298023224, u'Iris-setosa'),\n", 773 | " (4.400000095367432, 2.9000000953674316, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n", 774 | " (4.900000095367432, 3.0999999046325684, 1.5, 0.10000000149011612, u'Iris-setosa')]" 775 | ] 776 | } 777 | ], 778 | "prompt_number": 6 779 | }, 780 | { 781 | "cell_type": "markdown", 782 | "metadata": {}, 783 | "source": [ 784 | "### Blaze provides a Pandas-like experience over foreign data\n", 785 | "\n", 786 | "We get the best of both worlds\n", 787 | "\n", 788 | "1. The scalable computation of Hive\n", 789 | "2. A Pandas-like interactive feel" 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "collapsed": false, 795 | "input": [ 796 | "from blaze import Data, by\n", 797 | "d = Data('hive://hdfs@54.91.57.226:10000/default')\n", 798 | "d.iris" 799 | ], 800 | "language": "python", 801 | "metadata": {}, 802 | "outputs": [ 803 | { 804 | "html": [ 805 | "\n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
" 907 | ], 908 | "metadata": {}, 909 | "output_type": "pyout", 910 | "prompt_number": 7, 911 | "text": [ 912 | " sepal_length sepal_width petal_length petal_width species\n", 913 | "0 5.1 3.5 1.4 0.2 Iris-setosa\n", 914 | "1 4.9 3.0 1.4 0.2 Iris-setosa\n", 915 | "2 4.7 3.2 1.3 0.2 Iris-setosa\n", 916 | "3 4.6 3.1 1.5 0.2 Iris-setosa\n", 917 | "4 5.0 3.6 1.4 0.2 Iris-setosa\n", 918 | "5 5.4 3.9 1.7 0.4 Iris-setosa\n", 919 | "6 4.6 3.4 1.4 0.3 Iris-setosa\n", 920 | "7 5.0 3.4 1.5 0.2 Iris-setosa\n", 921 | "8 4.4 2.9 1.4 0.2 Iris-setosa\n", 922 | "9 4.9 3.1 1.5 0.1 Iris-setosa\n", 923 | "..." 924 | ] 925 | } 926 | ], 927 | "prompt_number": 7 928 | }, 929 | { 930 | "cell_type": "code", 931 | "collapsed": false, 932 | "input": [ 933 | "by(d.iris.species, largest=d.iris.sepal_length.max(),\n", 934 | " smallest=d.iris.sepal_length.min())" 935 | ], 936 | "language": "python", 937 | "metadata": {}, 938 | "outputs": [ 939 | { 940 | "html": [ 941 | "\n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | "
specieslargestsmallest
0 Iris-setosa 5.8 4.3
1 Iris-versicolor 7.0 4.9
2 Iris-virginica 7.9 4.9
" 971 | ], 972 | "metadata": {}, 973 | "output_type": "pyout", 974 | "prompt_number": 8, 975 | "text": [ 976 | " species largest smallest\n", 977 | "0 Iris-setosa 5.8 4.3\n", 978 | "1 Iris-versicolor 7.0 4.9\n", 979 | "2 Iris-virginica 7.9 4.9" 980 | ] 981 | } 982 | ], 983 | "prompt_number": 8 984 | }, 985 | { 986 | "cell_type": "markdown", 987 | "metadata": {}, 988 | "source": [ 989 | "### Blaze doesn't move the data back and forth, it moves your query back and forth\n", 990 | "\n", 991 | "Here we use the internal API to show the translated Blaze query that Blaze sends to the Hive database." 992 | ] 993 | }, 994 | { 995 | "cell_type": "code", 996 | "collapsed": false, 997 | "input": [ 998 | "from blaze import compute\n", 999 | "\n", 1000 | "query = by(d.iris.species, largest=d.iris.sepal_length.max(),\n", 1001 | " smallest=d.iris.sepal_length.min())\n", 1002 | "\n", 1003 | "print compute(query)" 1004 | ], 1005 | "language": "python", 1006 | "metadata": {}, 1007 | "outputs": [ 1008 | { 1009 | "output_type": "stream", 1010 | "stream": "stdout", 1011 | "text": [ 1012 | "SELECT iris.species, max(iris.sepal_length) AS largest, min(iris.sepal_length) AS smallest \n", 1013 | "FROM iris GROUP BY iris.species\n" 1014 | ] 1015 | } 1016 | ], 1017 | "prompt_number": 9 1018 | } 1019 | ], 1020 | "metadata": {} 1021 | } 1022 | ] 1023 | } -------------------------------------------------------------------------------- /02-into-Datatypes.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:8f94d037b8b415d618faa18cc783a5bdf6c8d4830a454dd9ee2686367f9d24b9" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "\"Blaze\n", 18 | "\n", 19 | "# Getting started with `into`\n", 20 | "\n", 21 | "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n", 22 | "* Install software with `conda install -c blaze blaze`" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 2. Datatypes for Performance tuning\n", 30 | "\n", 31 | "* *Do we store these as integers or floats? `int32` or `int64`? `int8`?*\n", 32 | "* *Do we store times as datetimes or as strings?*\n", 33 | "* *Do we store these strings as variable length or fixed length*?\n", 34 | "* *Do we know how large this array will be?*\n", 35 | "\n", 36 | "As we encode values as bits we make choices; those choices can affect performance. We encode how to convert values to bits and back as a *datatype*. You've seen data types before in many forms including C types like `long`, `double` and `double[100]`, numpy dtypes like `i4` and `f8` or Python types like `int`, and `float`. Other systems like SQL, HDF5, etc. have similar datatype systems with different names.\n", 37 | "\n", 38 | "To manage datatypes across different systems we use `datashape` a datatype system that maps cleanly on to all systems with which `into` interacts. This one system can translate into any of the others.\n", 39 | "\n", 40 | "In this section we'll talk about the following\n", 41 | "\n", 42 | "1. How to discover the datatype of your data, no matter how it is stored\n", 43 | "2. Minor tweaks you can do to that datatype to improve performance in certain storage systems\n", 44 | "3. How to create new datasets easily" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## 2.1 DataShape and `discover`\n", 52 | "\n", 53 | "We introduce datashape, an all-encompassing datatype language, and `discover`, a function that does all of the work for you.\n", 54 | "\n", 55 | "The discover function returns the datashape of an object. Lets look at a few examples." 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "collapsed": false, 61 | "input": [ 62 | "from into import discover, into, resource\n", 63 | "discover(1)" 64 | ], 65 | "language": "python", 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "metadata": {}, 70 | "output_type": "pyout", 71 | "prompt_number": 1, 72 | "text": [ 73 | "ctype(\"int64\")" 74 | ] 75 | } 76 | ], 77 | "prompt_number": 1 78 | }, 79 | { 80 | "cell_type": "code", 81 | "collapsed": false, 82 | "input": [ 83 | "discover([1, 2, 3])" 84 | ], 85 | "language": "python", 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "metadata": {}, 90 | "output_type": "pyout", 91 | "prompt_number": 2, 92 | "text": [ 93 | "dshape(\"3 * int64\")" 94 | ] 95 | } 96 | ], 97 | "prompt_number": 2 98 | }, 99 | { 100 | "cell_type": "code", 101 | "collapsed": false, 102 | "input": [ 103 | "discover([[1, 2, 3],\n", 104 | " [4, 5, 6]])" 105 | ], 106 | "language": "python", 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "metadata": {}, 111 | "output_type": "pyout", 112 | "prompt_number": 3, 113 | "text": [ 114 | "dshape(\"2 * 3 * int64\")" 115 | ] 116 | } 117 | ], 118 | "prompt_number": 3 119 | }, 120 | { 121 | "cell_type": "code", 122 | "collapsed": false, 123 | "input": [ 124 | "discover([{'x': 1, 'y': 1.0},\n", 125 | " {'x': 2, 'y': 2.0},\n", 126 | " {'x': 3, 'y': 3.0}])" 127 | ], 128 | "language": "python", 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "metadata": {}, 133 | "output_type": "pyout", 134 | "prompt_number": 4, 135 | "text": [ 136 | "dshape(\"3 * {x: int64, y: float64}\")" 137 | ] 138 | } 139 | ], 140 | "prompt_number": 4 141 | }, 142 | { 143 | "cell_type": "code", 144 | "collapsed": false, 145 | "input": [ 146 | "import pandas as pd\n", 147 | "df = pd.DataFrame([['Alice', 100],\n", 148 | " ['Bob', 200],\n", 149 | " ['Charlie', 300]], columns=['name', 'balance'])\n", 150 | "discover(df)" 151 | ], 152 | "language": "python", 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "metadata": {}, 157 | "output_type": "pyout", 158 | "prompt_number": 5, 159 | "text": [ 160 | "dshape(\"3 * {name: string, balance: int64}\")" 161 | ] 162 | } 163 | ], 164 | "prompt_number": 5 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "By looking closely at these examples we see the structure of datashape. Elements have types like `int64` or `string`. Records/structs/groups of elements have record dtypes like `{x: int64, y: float64}`. Lengths of collections are encoded by numbers like `3 * ` for \"three of\" or `2 * 3 * ` for \"a two-by-three grid of\"." 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "metadata": {}, 176 | "source": [ 177 | "#### Exercise\n", 178 | "\n", 179 | "Construct data with the following datashapes. Use any container type (e.g. `list`, `pd.DataFrame`, `np.ndarray`).\n", 180 | "\n", 181 | " 2 * int64\n", 182 | " 2 * string\n", 183 | " {name: string, id: int}\n", 184 | " datetime\n", 185 | " {name: string, id: int, payments: 2 * datetime}\n", 186 | " 2 * {name: string, id: int, payments: 2 * datetime}\n", 187 | " 5 * 5 * 5 * float32\n", 188 | " \n", 189 | "Use `discover` to verify your answers." 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "collapsed": false, 195 | "input": [ 196 | "# Should be 2 * int64\n", 197 | "discover([1, 2]) " 198 | ], 199 | "language": "python", 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "metadata": {}, 204 | "output_type": "pyout", 205 | "prompt_number": 6, 206 | "text": [ 207 | "dshape(\"2 * int64\")" 208 | ] 209 | } 210 | ], 211 | "prompt_number": 6 212 | }, 213 | { 214 | "cell_type": "code", 215 | "collapsed": false, 216 | "input": [], 217 | "language": "python", 218 | "metadata": {}, 219 | "outputs": [], 220 | "prompt_number": 6 221 | }, 222 | { 223 | "cell_type": "code", 224 | "collapsed": false, 225 | "input": [], 226 | "language": "python", 227 | "metadata": {}, 228 | "outputs": [], 229 | "prompt_number": 6 230 | }, 231 | { 232 | "cell_type": "code", 233 | "collapsed": false, 234 | "input": [], 235 | "language": "python", 236 | "metadata": {}, 237 | "outputs": [], 238 | "prompt_number": 6 239 | }, 240 | { 241 | "cell_type": "code", 242 | "collapsed": false, 243 | "input": [], 244 | "language": "python", 245 | "metadata": {}, 246 | "outputs": [], 247 | "prompt_number": 6 248 | }, 249 | { 250 | "cell_type": "code", 251 | "collapsed": false, 252 | "input": [], 253 | "language": "python", 254 | "metadata": {}, 255 | "outputs": [], 256 | "prompt_number": 6 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "### Discover doesn't care about storage format\n", 263 | "\n", 264 | "The `discover` function doesn't care if your data lives in a Python list, Pandas DataFrame, NumPy Array, CSV file, PySpark RDD, or SQL database. \n", 265 | "\n", 266 | "In other words, using `into` preserves datashape." 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "collapsed": false, 272 | "input": [ 273 | "discover(df)" 274 | ], 275 | "language": "python", 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "metadata": {}, 280 | "output_type": "pyout", 281 | "prompt_number": 7, 282 | "text": [ 283 | "dshape(\"3 * {name: string, balance: int64}\")" 284 | ] 285 | } 286 | ], 287 | "prompt_number": 7 288 | }, 289 | { 290 | "cell_type": "code", 291 | "collapsed": false, 292 | "input": [ 293 | "import numpy as np\n", 294 | "x = into(np.ndarray, df)\n", 295 | "discover(x) # different container, same datashape" 296 | ], 297 | "language": "python", 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "metadata": {}, 302 | "output_type": "pyout", 303 | "prompt_number": 8, 304 | "text": [ 305 | "dshape(\"3 * {name: string, balance: int64}\")" 306 | ] 307 | } 308 | ], 309 | "prompt_number": 8 310 | }, 311 | { 312 | "cell_type": "code", 313 | "collapsed": false, 314 | "input": [ 315 | "t = into('sqlite:///:memory:::mydf', df)\n", 316 | "discover(t) # different container, mostly the same datashape" 317 | ], 318 | "language": "python", 319 | "metadata": {}, 320 | "outputs": [ 321 | { 322 | "metadata": {}, 323 | "output_type": "pyout", 324 | "prompt_number": 9, 325 | "text": [ 326 | "dshape(\"var * {name: string, balance: int64}\")" 327 | ] 328 | } 329 | ], 330 | "prompt_number": 9 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "### Discover works on nested structures\n", 337 | "\n", 338 | "Call discover on a single table of our baseball statistics database." 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "collapsed": false, 344 | "input": [ 345 | "salaries = resource('sqlite:///data/lahman2013.sqlite::Salaries')\n", 346 | "discover(salaries)" 347 | ], 348 | "language": "python", 349 | "metadata": {}, 350 | "outputs": [ 351 | { 352 | "metadata": {}, 353 | "output_type": "pyout", 354 | "prompt_number": 10, 355 | "text": [ 356 | "dshape(\"\"\"var * {\n", 357 | " yearID: ?int32,\n", 358 | " teamID: ?string,\n", 359 | " lgID: ?string,\n", 360 | " playerID: ?string,\n", 361 | " salary: ?float64\n", 362 | " }\"\"\")" 363 | ] 364 | } 365 | ], 366 | "prompt_number": 10 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "And then call it on the entire database" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "collapsed": false, 378 | "input": [ 379 | "db = resource('sqlite:///data/lahman2013.sqlite')\n", 380 | "discover(db)" 381 | ], 382 | "language": "python", 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "metadata": {}, 387 | "output_type": "pyout", 388 | "prompt_number": 11, 389 | "text": [ 390 | "dshape(\"\"\"{\n", 391 | " AllstarFull: var * {\n", 392 | " playerID: ?string,\n", 393 | " yearID: ?int32,\n", 394 | " gameNum: ?int32,\n", 395 | " gameID: ?string,\n", 396 | " teamID: ?string,\n", 397 | " lgID: ?string,\n", 398 | " GP: ?int32,\n", 399 | " startingPos: ?int32\n", 400 | " },\n", 401 | " Appearances: var * {\n", 402 | " yearID: ?int32,\n", 403 | " teamID: ?string,\n", 404 | " lgID: ?string,\n", 405 | " playerID: ?string,\n", 406 | " G_all: ?int32,\n", 407 | " GS: ?int32,\n", 408 | " G_batting: ?int32,\n", 409 | " G_defense: ?int32,\n", 410 | " G_p: ?int32,\n", 411 | " G_c: ?int32,\n", 412 | " G_1b: ?int32,\n", 413 | " G_2b: ?int32,\n", 414 | " G_3b: ?int32,\n", 415 | " G_ss: ?int32,\n", 416 | " G_lf: ?int32,\n", 417 | " G_cf: ?int32,\n", 418 | " G_rf: ?int32,\n", 419 | " G_of: ?int32,\n", 420 | " G_dh: ?int32,\n", 421 | " G_ph: ?int32,\n", 422 | " G_pr: ?int32\n", 423 | " },\n", 424 | " AwardsManagers: var * {\n", 425 | " playerID: ?string,\n", 426 | " awardID: ?string,\n", 427 | " yearID: ?int32,\n", 428 | " lgID: ?string,\n", 429 | " tie: ?string,\n", 430 | " notes: ?string\n", 431 | " },\n", 432 | " AwardsPlayers: var * {\n", 433 | " playerID: ?string,\n", 434 | " awardID: ?string,\n", 435 | " yearID: ?int32,\n", 436 | " lgID: ?string,\n", 437 | " tie: ?string,\n", 438 | " notes: ?string\n", 439 | " },\n", 440 | " AwardsShareManagers: var * {\n", 441 | " awardID: ?string,\n", 442 | " yearID: ?int32,\n", 443 | " lgID: ?string,\n", 444 | " playerID: ?string,\n", 445 | " pointsWon: ?int32,\n", 446 | " pointsMax: ?int32,\n", 447 | " votesFirst: ?int32\n", 448 | " },\n", 449 | " AwardsSharePlayers: var * {\n", 450 | " awardID: ?string,\n", 451 | " yearID: ?int32,\n", 452 | " lgID: ?string,\n", 453 | " playerID: ?string,\n", 454 | " pointsWon: ?float64,\n", 455 | " pointsMax: ?int32,\n", 456 | " votesFirst: ?float64\n", 457 | " },\n", 458 | " Batting: var * {\n", 459 | " playerID: ?string,\n", 460 | " yearID: ?int32,\n", 461 | " stint: ?int32,\n", 462 | " teamID: ?string,\n", 463 | " lgID: ?string,\n", 464 | " G: ?int32,\n", 465 | " G_batting: ?int32,\n", 466 | " AB: ?int32,\n", 467 | " R: ?int32,\n", 468 | " H: ?int32,\n", 469 | " 2B: ?int32,\n", 470 | " 3B: ?int32,\n", 471 | " HR: ?int32,\n", 472 | " RBI: ?int32,\n", 473 | " SB: ?int32,\n", 474 | " CS: ?int32,\n", 475 | " BB: ?int32,\n", 476 | " SO: ?int32,\n", 477 | " IBB: ?int32,\n", 478 | " HBP: ?int32,\n", 479 | " SH: ?int32,\n", 480 | " SF: ?int32,\n", 481 | " GIDP: ?int32,\n", 482 | " G_old: ?int32\n", 483 | " },\n", 484 | " BattingPost: var * {\n", 485 | " yearID: ?int32,\n", 486 | " round: ?string,\n", 487 | " playerID: ?string,\n", 488 | " teamID: ?string,\n", 489 | " lgID: ?string,\n", 490 | " G: ?int32,\n", 491 | " AB: ?int32,\n", 492 | " R: ?int32,\n", 493 | " H: ?int32,\n", 494 | " 2B: ?int32,\n", 495 | " 3B: ?int32,\n", 496 | " HR: ?int32,\n", 497 | " RBI: ?int32,\n", 498 | " SB: ?int32,\n", 499 | " CS: ?int32,\n", 500 | " BB: ?int32,\n", 501 | " SO: ?int32,\n", 502 | " IBB: ?int32,\n", 503 | " HBP: ?int32,\n", 504 | " SH: ?int32,\n", 505 | " SF: ?int32,\n", 506 | " GIDP: ?int32\n", 507 | " },\n", 508 | " Fielding: var * {\n", 509 | " playerID: ?string,\n", 510 | " yearID: ?int32,\n", 511 | " stint: ?int32,\n", 512 | " teamID: ?string,\n", 513 | " lgID: ?string,\n", 514 | " POS: ?string,\n", 515 | " G: ?int32,\n", 516 | " GS: ?int32,\n", 517 | " InnOuts: ?int32,\n", 518 | " PO: ?int32,\n", 519 | " A: ?int32,\n", 520 | " E: ?int32,\n", 521 | " DP: ?int32,\n", 522 | " PB: ?int32,\n", 523 | " WP: ?int32,\n", 524 | " SB: ?int32,\n", 525 | " CS: ?int32,\n", 526 | " ZR: ?float64\n", 527 | " },\n", 528 | " FieldingOF: var * {\n", 529 | " playerID: ?string,\n", 530 | " yearID: ?int32,\n", 531 | " stint: ?int32,\n", 532 | " Glf: ?int32,\n", 533 | " Gcf: ?int32,\n", 534 | " Grf: ?int32\n", 535 | " },\n", 536 | " FieldingPost: var * {\n", 537 | " playerID: ?string,\n", 538 | " yearID: ?int32,\n", 539 | " teamID: ?string,\n", 540 | " lgID: ?string,\n", 541 | " round: ?string,\n", 542 | " POS: ?string,\n", 543 | " G: ?int32,\n", 544 | " GS: ?int32,\n", 545 | " InnOuts: ?int32,\n", 546 | " PO: ?int32,\n", 547 | " A: ?int32,\n", 548 | " E: ?int32,\n", 549 | " DP: ?int32,\n", 550 | " TP: ?int32,\n", 551 | " PB: ?int32,\n", 552 | " SB: ?int32,\n", 553 | " CS: ?int32\n", 554 | " },\n", 555 | " HallOfFame: var * {\n", 556 | " playerID: ?string,\n", 557 | " yearid: ?int32,\n", 558 | " votedBy: ?string,\n", 559 | " ballots: ?int32,\n", 560 | " needed: ?int32,\n", 561 | " votes: ?int32,\n", 562 | " inducted: ?string,\n", 563 | " category: ?string,\n", 564 | " needed_note: ?string\n", 565 | " },\n", 566 | " Managers: var * {\n", 567 | " playerID: ?string,\n", 568 | " yearID: ?int32,\n", 569 | " teamID: ?string,\n", 570 | " lgID: ?string,\n", 571 | " inseason: ?int32,\n", 572 | " G: ?int32,\n", 573 | " W: ?int32,\n", 574 | " L: ?int32,\n", 575 | " rank: ?int32,\n", 576 | " plyrMgr: ?string\n", 577 | " },\n", 578 | " ManagersHalf: var * {\n", 579 | " playerID: ?string,\n", 580 | " yearID: ?int32,\n", 581 | " teamID: ?string,\n", 582 | " lgID: ?string,\n", 583 | " inseason: ?int32,\n", 584 | " half: ?int32,\n", 585 | " G: ?int32,\n", 586 | " W: ?int32,\n", 587 | " L: ?int32,\n", 588 | " rank: ?int32\n", 589 | " },\n", 590 | " Master: var * {\n", 591 | " playerID: ?string,\n", 592 | " birthYear: ?int32,\n", 593 | " birthMonth: ?int32,\n", 594 | " birthDay: ?int32,\n", 595 | " birthCountry: ?string,\n", 596 | " birthState: ?string,\n", 597 | " birthCity: ?string,\n", 598 | " deathYear: ?int32,\n", 599 | " deathMonth: ?int32,\n", 600 | " deathDay: ?int32,\n", 601 | " deathCountry: ?string,\n", 602 | " deathState: ?string,\n", 603 | " deathCity: ?string,\n", 604 | " nameFirst: ?string,\n", 605 | " nameLast: ?string,\n", 606 | " nameGiven: ?string,\n", 607 | " weight: ?int32,\n", 608 | " height: ?float64,\n", 609 | " bats: ?string,\n", 610 | " throws: ?string,\n", 611 | " debut: ?float64,\n", 612 | " finalGame: ?float64,\n", 613 | " retroID: ?string,\n", 614 | " bbrefID: ?string\n", 615 | " },\n", 616 | " Pitching: var * {\n", 617 | " playerID: ?string,\n", 618 | " yearID: ?int32,\n", 619 | " stint: ?int32,\n", 620 | " teamID: ?string,\n", 621 | " lgID: ?string,\n", 622 | " W: ?int32,\n", 623 | " L: ?int32,\n", 624 | " G: ?int32,\n", 625 | " GS: ?int32,\n", 626 | " CG: ?int32,\n", 627 | " SHO: ?int32,\n", 628 | " SV: ?int32,\n", 629 | " IPouts: ?int32,\n", 630 | " H: ?int32,\n", 631 | " ER: ?int32,\n", 632 | " HR: ?int32,\n", 633 | " BB: ?int32,\n", 634 | " SO: ?int32,\n", 635 | " BAOpp: ?float64,\n", 636 | " ERA: ?float64,\n", 637 | " IBB: ?int32,\n", 638 | " WP: ?int32,\n", 639 | " HBP: ?int32,\n", 640 | " BK: ?int32,\n", 641 | " BFP: ?int32,\n", 642 | " GF: ?int32,\n", 643 | " R: ?int32,\n", 644 | " SH: ?int32,\n", 645 | " SF: ?int32,\n", 646 | " GIDP: ?int32\n", 647 | " },\n", 648 | " PitchingPost: var * {\n", 649 | " playerID: ?string,\n", 650 | " yearID: ?int32,\n", 651 | " round: ?string,\n", 652 | " teamID: ?string,\n", 653 | " lgID: ?string,\n", 654 | " W: ?int32,\n", 655 | " L: ?int32,\n", 656 | " G: ?int32,\n", 657 | " GS: ?int32,\n", 658 | " CG: ?int32,\n", 659 | " SHO: ?int32,\n", 660 | " SV: ?int32,\n", 661 | " IPouts: ?int32,\n", 662 | " H: ?int32,\n", 663 | " ER: ?int32,\n", 664 | " HR: ?int32,\n", 665 | " BB: ?int32,\n", 666 | " SO: ?int32,\n", 667 | " BAOpp: ?float64,\n", 668 | " ERA: ?float64,\n", 669 | " IBB: ?int32,\n", 670 | " WP: ?int32,\n", 671 | " HBP: ?int32,\n", 672 | " BK: ?int32,\n", 673 | " BFP: ?int32,\n", 674 | " GF: ?int32,\n", 675 | " R: ?int32,\n", 676 | " SH: ?int32,\n", 677 | " SF: ?int32,\n", 678 | " GIDP: ?int32\n", 679 | " },\n", 680 | " Salaries: var * {\n", 681 | " yearID: ?int32,\n", 682 | " teamID: ?string,\n", 683 | " lgID: ?string,\n", 684 | " playerID: ?string,\n", 685 | " salary: ?float64\n", 686 | " },\n", 687 | " Schools: var * {\n", 688 | " schoolID: ?string,\n", 689 | " schoolName: ?string,\n", 690 | " schoolCity: ?string,\n", 691 | " schoolState: ?string,\n", 692 | " schoolNick: ?string\n", 693 | " },\n", 694 | " SchoolsPlayers: var * {\n", 695 | " playerID: ?string,\n", 696 | " schoolID: ?string,\n", 697 | " yearMin: ?int32,\n", 698 | " yearMax: ?int32\n", 699 | " },\n", 700 | " SeriesPost: var * {\n", 701 | " yearID: ?int32,\n", 702 | " round: ?string,\n", 703 | " teamIDwinner: ?string,\n", 704 | " lgIDwinner: ?string,\n", 705 | " teamIDloser: ?string,\n", 706 | " lgIDloser: ?string,\n", 707 | " wins: ?int32,\n", 708 | " losses: ?int32,\n", 709 | " ties: ?int32\n", 710 | " },\n", 711 | " Teams: var * {\n", 712 | " yearID: ?int32,\n", 713 | " lgID: ?string,\n", 714 | " teamID: ?string,\n", 715 | " franchID: ?string,\n", 716 | " divID: ?string,\n", 717 | " Rank: ?int32,\n", 718 | " G: ?int32,\n", 719 | " Ghome: ?int32,\n", 720 | " W: ?int32,\n", 721 | " L: ?int32,\n", 722 | " DivWin: ?string,\n", 723 | " WCWin: ?string,\n", 724 | " LgWin: ?string,\n", 725 | " WSWin: ?string,\n", 726 | " R: ?int32,\n", 727 | " AB: ?int32,\n", 728 | " H: ?int32,\n", 729 | " 2B: ?int32,\n", 730 | " 3B: ?int32,\n", 731 | " HR: ?int32,\n", 732 | " BB: ?int32,\n", 733 | " SO: ?int32,\n", 734 | " SB: ?int32,\n", 735 | " CS: ?int32,\n", 736 | " HBP: ?int32,\n", 737 | " SF: ?int32,\n", 738 | " RA: ?int32,\n", 739 | " ER: ?int32,\n", 740 | " ERA: ?float64,\n", 741 | " CG: ?int32,\n", 742 | " SHO: ?int32,\n", 743 | " SV: ?int32,\n", 744 | " IPouts: ?int32,\n", 745 | " HA: ?int32,\n", 746 | " HRA: ?int32,\n", 747 | " BBA: ?int32,\n", 748 | " SOA: ?int32,\n", 749 | " E: ?int32,\n", 750 | " DP: ?int32,\n", 751 | " FP: ?float64,\n", 752 | " name: ?string,\n", 753 | " park: ?string,\n", 754 | " attendance: ?int32,\n", 755 | " BPF: ?int32,\n", 756 | " PPF: ?int32,\n", 757 | " teamIDBR: ?string,\n", 758 | " teamIDlahman45: ?string,\n", 759 | " teamIDretro: ?string\n", 760 | " },\n", 761 | " TeamsFranchises: var * {\n", 762 | " franchID: ?string,\n", 763 | " franchName: ?string,\n", 764 | " active: ?string,\n", 765 | " NAassoc: ?string\n", 766 | " },\n", 767 | " TeamsHalf: var * {\n", 768 | " yearID: ?int32,\n", 769 | " lgID: ?string,\n", 770 | " teamID: ?string,\n", 771 | " Half: ?string,\n", 772 | " divID: ?string,\n", 773 | " DivWin: ?string,\n", 774 | " Rank: ?int32,\n", 775 | " G: ?int32,\n", 776 | " W: ?int32,\n", 777 | " L: ?int32\n", 778 | " },\n", 779 | " temp: var * {ID: ?int32, namefull: ?string, born: ?float64}\n", 780 | " }\"\"\")" 781 | ] 782 | } 783 | ], 784 | "prompt_number": 11 785 | }, 786 | { 787 | "cell_type": "markdown", 788 | "metadata": {}, 789 | "source": [ 790 | "## 2.2 Tweaking Datashapes for Performance\n", 791 | "\n", 792 | "Some storage systems don't cleanly support some datashapes. For example\n", 793 | "\n", 794 | "1. SQL doesn't support nested data\n", 795 | "2. HDF5 doesn't cleanly support datetimes\n", 796 | "3. Variable length strings are often costly in binary stores\n", 797 | "4. NumPy has poor missing value support\n", 798 | "\n", 799 | "Because of this we sometimes want to slightly change a datashape during migration." 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "### Fixed Length Strings\n", 807 | "\n", 808 | "A common example is the use of strings with a known maximum length, called \"fixed length strings,\" and strings with particular character encodings, like ASCII or UTF-8.\n", 809 | "\n", 810 | "Consider moving the following data, including strings, into a numpy array." 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "collapsed": false, 816 | "input": [ 817 | "data = [{'name': 'Alice', 'balance': 100},\n", 818 | " {'name': 'Bob', 'balance': 200}]\n", 819 | "into(np.ndarray, data)" 820 | ], 821 | "language": "python", 822 | "metadata": {}, 823 | "outputs": [ 824 | { 825 | "metadata": {}, 826 | "output_type": "pyout", 827 | "prompt_number": 12, 828 | "text": [ 829 | "array([(100, 'Alice'), (200, 'Bob')], \n", 830 | " dtype=[('balance', '\n", 1005 | "\n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | "
01
0 33.1 -89.2
1 37.0-141.5
2 41.0-120.5
\n", 1031 | "" 1032 | ], 1033 | "metadata": {}, 1034 | "output_type": "pyout", 1035 | "prompt_number": 17, 1036 | "text": [ 1037 | " 0 1\n", 1038 | "0 33.1 -89.2\n", 1039 | "1 37.0 -141.5\n", 1040 | "2 41.0 -120.5" 1041 | ] 1042 | } 1043 | ], 1044 | "prompt_number": 17 1045 | }, 1046 | { 1047 | "cell_type": "markdown", 1048 | "metadata": {}, 1049 | "source": [ 1050 | "Note that, because our original data was stored in a format that didn't include the column names, the output lacks them as well. We complement our data with a datashape." 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "collapsed": false, 1056 | "input": [ 1057 | "ds = dshape('var * {lat: float64, long: float64}')\n", 1058 | "into(pd.DataFrame, data, dshape=ds)" 1059 | ], 1060 | "language": "python", 1061 | "metadata": {}, 1062 | "outputs": [ 1063 | { 1064 | "html": [ 1065 | "
\n", 1066 | "\n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | " \n", 1073 | " \n", 1074 | " \n", 1075 | " \n", 1076 | " \n", 1077 | " \n", 1078 | " \n", 1079 | " \n", 1080 | " \n", 1081 | " \n", 1082 | " \n", 1083 | " \n", 1084 | " \n", 1085 | " \n", 1086 | " \n", 1087 | " \n", 1088 | " \n", 1089 | " \n", 1090 | " \n", 1091 | "
latlong
0 33.1 -89.2
1 37.0-141.5
2 41.0-120.5
\n", 1092 | "
" 1093 | ], 1094 | "metadata": {}, 1095 | "output_type": "pyout", 1096 | "prompt_number": 18, 1097 | "text": [ 1098 | " lat long\n", 1099 | "0 33.1 -89.2\n", 1100 | "1 37.0 -141.5\n", 1101 | "2 41.0 -120.5" 1102 | ] 1103 | } 1104 | ], 1105 | "prompt_number": 18 1106 | }, 1107 | { 1108 | "cell_type": "markdown", 1109 | "metadata": {}, 1110 | "source": [ 1111 | "## Create new datasets with `resource` and datashape\n", 1112 | "\n", 1113 | "In the last section we used `resource` to acquire existing datasets from string URIs. \n", 1114 | "\n", 1115 | "We also use the `resource` function to create new datasets given a URI and a datashape." 1116 | ] 1117 | }, 1118 | { 1119 | "cell_type": "markdown", 1120 | "metadata": {}, 1121 | "source": [ 1122 | "#### Example\n", 1123 | "\n", 1124 | "We create a new HDF5 dataset to store 100 by 100 integers." 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "code", 1129 | "collapsed": false, 1130 | "input": [ 1131 | "ds = dshape('100 * 100 * int64')\n", 1132 | "dset = resource('myfile.hdf5::/x', dshape=ds)\n", 1133 | "dset" 1134 | ], 1135 | "language": "python", 1136 | "metadata": {}, 1137 | "outputs": [ 1138 | { 1139 | "metadata": {}, 1140 | "output_type": "pyout", 1141 | "prompt_number": 19, 1142 | "text": [ 1143 | "" 1144 | ] 1145 | } 1146 | ], 1147 | "prompt_number": 19 1148 | }, 1149 | { 1150 | "cell_type": "markdown", 1151 | "metadata": {}, 1152 | "source": [ 1153 | "#### Exercise\n", 1154 | "\n", 1155 | "Create a new SQLite table named `transactions` in `'data/my.db'` with the following datashape\n", 1156 | "\n", 1157 | " var * {name: string, balance: int, timestamp: datetime}\n", 1158 | " \n", 1159 | "Here `var` stands for \"variable length\" or generally \"a dimension to which we don't know a fixed size.\"" 1160 | ] 1161 | }, 1162 | { 1163 | "cell_type": "code", 1164 | "collapsed": false, 1165 | "input": [], 1166 | "language": "python", 1167 | "metadata": {}, 1168 | "outputs": [], 1169 | "prompt_number": 19 1170 | }, 1171 | { 1172 | "cell_type": "code", 1173 | "collapsed": false, 1174 | "input": [], 1175 | "language": "python", 1176 | "metadata": {}, 1177 | "outputs": [], 1178 | "prompt_number": 19 1179 | }, 1180 | { 1181 | "cell_type": "markdown", 1182 | "metadata": {}, 1183 | "source": [ 1184 | "Note that you could also have built this table using raw SQLAlchemy code. Knowing datashape lets you skip learning many libraries like SQLAlchemy and H5Py for simple tasks. `into` serves as a single interface over many useful libraries." 1185 | ] 1186 | }, 1187 | { 1188 | "cell_type": "code", 1189 | "collapsed": false, 1190 | "input": [ 1191 | "import sqlalchemy as sa\n", 1192 | "\n", 1193 | "engine = sa.create_engine('sqlite:///data/my.db')\n", 1194 | "metadata = sa.MetaData(engine)\n", 1195 | "transactions = sa.Table('transactions2', metadata,\n", 1196 | " sa.Column('name', sa.String),\n", 1197 | " sa.Column('balance', sa.Integer),\n", 1198 | " sa.Column('timestamp', sa.DateTime))\n", 1199 | "transactions.create()" 1200 | ], 1201 | "language": "python", 1202 | "metadata": {}, 1203 | "outputs": [], 1204 | "prompt_number": 20 1205 | }, 1206 | { 1207 | "cell_type": "code", 1208 | "collapsed": false, 1209 | "input": [], 1210 | "language": "python", 1211 | "metadata": {}, 1212 | "outputs": [], 1213 | "prompt_number": 20 1214 | } 1215 | ], 1216 | "metadata": {} 1217 | } 1218 | ] 1219 | } --------------------------------------------------------------------------------