├── data
    ├── sample.hdf5
    └── iris.csv
├── images
    ├── star.png
    ├── blaze_med.png
    ├── conversions.png
    └── star.dot
├── README.md
├── 06-Blaze-with-into.ipynb
├── 03-into-Design.ipynb
├── 05-Blaze-with-SQL.ipynb
├── 04-Blaze-Introduction.ipynb
├── 01-into-Introduction.ipynb
├── 00-Motivation.ipynb
└── 02-into-Datatypes.ipynb


/data/sample.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/data/sample.hdf5


--------------------------------------------------------------------------------
/images/star.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/star.png


--------------------------------------------------------------------------------
/images/blaze_med.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/blaze_med.png


--------------------------------------------------------------------------------
/images/conversions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/blaze/blaze-tutorial/HEAD/images/conversions.png


--------------------------------------------------------------------------------
/images/star.dot:
--------------------------------------------------------------------------------
 1 | graph {
 2 |     overlap=False;
 3 |     splines=True;
 4 | A -- Central
 5 | B -- Central
 6 | C -- Central
 7 | D -- Central
 8 | E -- Central
 9 | F -- Central
10 | G -- Central
11 | H -- Central
12 | I -- Central
13 | J -- Central
14 | K -- Central
15 | }
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | Blaze Tutorial
 2 | ==============
 3 | 
 4 | This repository contains notebooks and data for a tutorial for
 5 | [Blaze](http://blaze.pydata.org) a library to compute on foreign data from
 6 | within Python.
 7 | 
 8 | This is a work in progress.
 9 | 
10 | Outline
11 | -------
12 | 
13 | 0.  [Motivation](00-Motivation.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/00-Motivation.ipynb))
14 | 
15 | ### Into
16 | 
17 | We present most Blaze fundamentals while discussing the simpler topic of data
18 | migration using the `into` project.
19 | 
20 | 1.  [Basics](01-into-Introduction.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/01-into-Introduction.ipynb))
21 | 2.  [Datatypes](02-into-Datatypes.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/02-into-Datatypes.ipynb))
22 | 3.  [Internal Design](03-into-Design.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/03-into-Design.ipynb))
23 | 
24 | 
25 | ### Blaze Queries
26 | 
27 | 1.  [Basics](04-Blaze-Introduction.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/04-Blaze-Introduction.ipynb))
28 | 2.  [Databases](05-Blaze-with-SQL.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/05-Blaze-with-SQL.ipynb))
29 | 3.  [Storing Results](06-Blaze-with-into.ipynb) - ([nbviewer](http://nbviewer.ipython.org/github/mrocklin/blaze-tutorial/blob/master/06-Blaze-with-into.ipynb))
30 | 
31 | 
32 | Downloads
33 | ---------
34 | 
35 | This tutorial depends on the
36 | [Lahman Baseball statistics database](https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite)
37 | 


--------------------------------------------------------------------------------
/06-Blaze-with-into.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:96746f6bc03e6fec9c102fd49c4dffc99ad3edefff7266618321c39d3d020abb"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "markdown",
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
 16 |       "                                align=\"right\"\n",
 17 |       "                                width=\"30%\">\n",
 18 |       "\n",
 19 |       "# Getting Started with Blaze\n",
 20 |       "\n",
 21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
 22 |       "* Install software with `conda install -c blaze blaze`"
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "markdown",
 27 |      "metadata": {},
 28 |      "source": [
 29 |       "## 3.  Storing results with `into`\n",
 30 |       "\n",
 31 |       "We just played with some interesting queries on baseball statistics"
 32 |      ]
 33 |     },
 34 |     {
 35 |      "cell_type": "code",
 36 |      "collapsed": false,
 37 |      "input": [
 38 |       "from blaze import Data, into, by, join\n",
 39 |       "db = Data('sqlite:///data/lahman2013.sqlite')\n",
 40 |       "joined = join(db.Salaries, db.Teams)\n",
 41 |       "result = by(joined[['name', 'yearID']], avg=joined.salary.mean())\n",
 42 |       "result"
 43 |      ],
 44 |      "language": "python",
 45 |      "metadata": {},
 46 |      "outputs": []
 47 |     },
 48 |     {
 49 |      "cell_type": "markdown",
 50 |      "metadata": {},
 51 |      "source": [
 52 |       "How do we now store this result or use it with other libraries?\n",
 53 |       "\n",
 54 |       "The result itself is a Blaze expression, not terribly useful if we're not using Blaze."
 55 |      ]
 56 |     },
 57 |     {
 58 |      "cell_type": "code",
 59 |      "collapsed": false,
 60 |      "input": [
 61 |       "type(result)"
 62 |      ],
 63 |      "language": "python",
 64 |      "metadata": {},
 65 |      "outputs": []
 66 |     },
 67 |     {
 68 |      "cell_type": "markdown",
 69 |      "metadata": {},
 70 |      "source": [
 71 |       "### Use normal Python collections like `list` or `np.array`\n",
 72 |       "\n",
 73 |       "Blaze follows normal conventions and so can be converted by standard constructors"
 74 |      ]
 75 |     },
 76 |     {
 77 |      "cell_type": "code",
 78 |      "collapsed": false,
 79 |      "input": [
 80 |       "list(result)"
 81 |      ],
 82 |      "language": "python",
 83 |      "metadata": {},
 84 |      "outputs": []
 85 |     },
 86 |     {
 87 |      "cell_type": "code",
 88 |      "collapsed": false,
 89 |      "input": [
 90 |       "import numpy as np\n",
 91 |       "np.array(result)"
 92 |      ],
 93 |      "language": "python",
 94 |      "metadata": {},
 95 |      "outputs": []
 96 |     },
 97 |     {
 98 |      "cell_type": "markdown",
 99 |      "metadata": {},
100 |      "source": [
101 |       "### Use `into`\n",
102 |       "\n",
103 |       "Alternatively, Blaze has registered itself into the `into` project and so can migrate its results to any of those formats."
104 |      ]
105 |     },
106 |     {
107 |      "cell_type": "code",
108 |      "collapsed": false,
109 |      "input": [
110 |       "into('salaries.csv', result)"
111 |      ],
112 |      "language": "python",
113 |      "metadata": {},
114 |      "outputs": []
115 |     },
116 |     {
117 |      "cell_type": "code",
118 |      "collapsed": false,
119 |      "input": [
120 |       "!head salaries.csv"
121 |      ],
122 |      "language": "python",
123 |      "metadata": {},
124 |      "outputs": []
125 |     },
126 |     {
127 |      "cell_type": "markdown",
128 |      "metadata": {},
129 |      "source": [
130 |       "#### Exercise\n",
131 |       "\n",
132 |       "Dump `results` into the following formats"
133 |      ]
134 |     },
135 |     {
136 |      "cell_type": "code",
137 |      "collapsed": false,
138 |      "input": [
139 |       "# Dump results into a Python set\n"
140 |      ],
141 |      "language": "python",
142 |      "metadata": {},
143 |      "outputs": []
144 |     },
145 |     {
146 |      "cell_type": "code",
147 |      "collapsed": false,
148 |      "input": [
149 |       "# Dump results into a Pandas DataFrame\n"
150 |      ],
151 |      "language": "python",
152 |      "metadata": {},
153 |      "outputs": []
154 |     },
155 |     {
156 |      "cell_type": "code",
157 |      "collapsed": false,
158 |      "input": [
159 |       "# Dump results into a JSON file, inspect the file to make sure that it came out ok\n"
160 |      ],
161 |      "language": "python",
162 |      "metadata": {},
163 |      "outputs": []
164 |     }
165 |    ],
166 |    "metadata": {}
167 |   }
168 |  ]
169 | }


--------------------------------------------------------------------------------
/data/iris.csv:
--------------------------------------------------------------------------------
  1 | sepal_length,sepal_width,petal_length,petal_width,species
  2 | 5.1,3.5,1.4,0.2,Iris-setosa
  3 | 4.9,3.0,1.4,0.2,Iris-setosa
  4 | 4.7,3.2,1.3,0.2,Iris-setosa
  5 | 4.6,3.1,1.5,0.2,Iris-setosa
  6 | 5.0,3.6,1.4,0.2,Iris-setosa
  7 | 5.4,3.9,1.7,0.4,Iris-setosa
  8 | 4.6,3.4,1.4,0.3,Iris-setosa
  9 | 5.0,3.4,1.5,0.2,Iris-setosa
 10 | 4.4,2.9,1.4,0.2,Iris-setosa
 11 | 4.9,3.1,1.5,0.1,Iris-setosa
 12 | 5.4,3.7,1.5,0.2,Iris-setosa
 13 | 4.8,3.4,1.6,0.2,Iris-setosa
 14 | 4.8,3.0,1.4,0.1,Iris-setosa
 15 | 4.3,3.0,1.1,0.1,Iris-setosa
 16 | 5.8,4.0,1.2,0.2,Iris-setosa
 17 | 5.7,4.4,1.5,0.4,Iris-setosa
 18 | 5.4,3.9,1.3,0.4,Iris-setosa
 19 | 5.1,3.5,1.4,0.3,Iris-setosa
 20 | 5.7,3.8,1.7,0.3,Iris-setosa
 21 | 5.1,3.8,1.5,0.3,Iris-setosa
 22 | 5.4,3.4,1.7,0.2,Iris-setosa
 23 | 5.1,3.7,1.5,0.4,Iris-setosa
 24 | 4.6,3.6,1.0,0.2,Iris-setosa
 25 | 5.1,3.3,1.7,0.5,Iris-setosa
 26 | 4.8,3.4,1.9,0.2,Iris-setosa
 27 | 5.0,3.0,1.6,0.2,Iris-setosa
 28 | 5.0,3.4,1.6,0.4,Iris-setosa
 29 | 5.2,3.5,1.5,0.2,Iris-setosa
 30 | 5.2,3.4,1.4,0.2,Iris-setosa
 31 | 4.7,3.2,1.6,0.2,Iris-setosa
 32 | 4.8,3.1,1.6,0.2,Iris-setosa
 33 | 5.4,3.4,1.5,0.4,Iris-setosa
 34 | 5.2,4.1,1.5,0.1,Iris-setosa
 35 | 5.5,4.2,1.4,0.2,Iris-setosa
 36 | 4.9,3.1,1.5,0.2,Iris-setosa
 37 | 5.0,3.2,1.2,0.2,Iris-setosa
 38 | 5.5,3.5,1.3,0.2,Iris-setosa
 39 | 4.9,3.6,1.4,0.1,Iris-setosa
 40 | 4.4,3.0,1.3,0.2,Iris-setosa
 41 | 5.1,3.4,1.5,0.2,Iris-setosa
 42 | 5.0,3.5,1.3,0.3,Iris-setosa
 43 | 4.5,2.3,1.3,0.3,Iris-setosa
 44 | 4.4,3.2,1.3,0.2,Iris-setosa
 45 | 5.0,3.5,1.6,0.6,Iris-setosa
 46 | 5.1,3.8,1.9,0.4,Iris-setosa
 47 | 4.8,3.0,1.4,0.3,Iris-setosa
 48 | 5.1,3.8,1.6,0.2,Iris-setosa
 49 | 4.6,3.2,1.4,0.2,Iris-setosa
 50 | 5.3,3.7,1.5,0.2,Iris-setosa
 51 | 5.0,3.3,1.4,0.2,Iris-setosa
 52 | 7.0,3.2,4.7,1.4,Iris-versicolor
 53 | 6.4,3.2,4.5,1.5,Iris-versicolor
 54 | 6.9,3.1,4.9,1.5,Iris-versicolor
 55 | 5.5,2.3,4.0,1.3,Iris-versicolor
 56 | 6.5,2.8,4.6,1.5,Iris-versicolor
 57 | 5.7,2.8,4.5,1.3,Iris-versicolor
 58 | 6.3,3.3,4.7,1.6,Iris-versicolor
 59 | 4.9,2.4,3.3,1.0,Iris-versicolor
 60 | 6.6,2.9,4.6,1.3,Iris-versicolor
 61 | 5.2,2.7,3.9,1.4,Iris-versicolor
 62 | 5.0,2.0,3.5,1.0,Iris-versicolor
 63 | 5.9,3.0,4.2,1.5,Iris-versicolor
 64 | 6.0,2.2,4.0,1.0,Iris-versicolor
 65 | 6.1,2.9,4.7,1.4,Iris-versicolor
 66 | 5.6,2.9,3.6,1.3,Iris-versicolor
 67 | 6.7,3.1,4.4,1.4,Iris-versicolor
 68 | 5.6,3.0,4.5,1.5,Iris-versicolor
 69 | 5.8,2.7,4.1,1.0,Iris-versicolor
 70 | 6.2,2.2,4.5,1.5,Iris-versicolor
 71 | 5.6,2.5,3.9,1.1,Iris-versicolor
 72 | 5.9,3.2,4.8,1.8,Iris-versicolor
 73 | 6.1,2.8,4.0,1.3,Iris-versicolor
 74 | 6.3,2.5,4.9,1.5,Iris-versicolor
 75 | 6.1,2.8,4.7,1.2,Iris-versicolor
 76 | 6.4,2.9,4.3,1.3,Iris-versicolor
 77 | 6.6,3.0,4.4,1.4,Iris-versicolor
 78 | 6.8,2.8,4.8,1.4,Iris-versicolor
 79 | 6.7,3.0,5.0,1.7,Iris-versicolor
 80 | 6.0,2.9,4.5,1.5,Iris-versicolor
 81 | 5.7,2.6,3.5,1.0,Iris-versicolor
 82 | 5.5,2.4,3.8,1.1,Iris-versicolor
 83 | 5.5,2.4,3.7,1.0,Iris-versicolor
 84 | 5.8,2.7,3.9,1.2,Iris-versicolor
 85 | 6.0,2.7,5.1,1.6,Iris-versicolor
 86 | 5.4,3.0,4.5,1.5,Iris-versicolor
 87 | 6.0,3.4,4.5,1.6,Iris-versicolor
 88 | 6.7,3.1,4.7,1.5,Iris-versicolor
 89 | 6.3,2.3,4.4,1.3,Iris-versicolor
 90 | 5.6,3.0,4.1,1.3,Iris-versicolor
 91 | 5.5,2.5,4.0,1.3,Iris-versicolor
 92 | 5.5,2.6,4.4,1.2,Iris-versicolor
 93 | 6.1,3.0,4.6,1.4,Iris-versicolor
 94 | 5.8,2.6,4.0,1.2,Iris-versicolor
 95 | 5.0,2.3,3.3,1.0,Iris-versicolor
 96 | 5.6,2.7,4.2,1.3,Iris-versicolor
 97 | 5.7,3.0,4.2,1.2,Iris-versicolor
 98 | 5.7,2.9,4.2,1.3,Iris-versicolor
 99 | 6.2,2.9,4.3,1.3,Iris-versicolor
100 | 5.1,2.5,3.0,1.1,Iris-versicolor
101 | 5.7,2.8,4.1,1.3,Iris-versicolor
102 | 6.3,3.3,6.0,2.5,Iris-virginica
103 | 5.8,2.7,5.1,1.9,Iris-virginica
104 | 7.1,3.0,5.9,2.1,Iris-virginica
105 | 6.3,2.9,5.6,1.8,Iris-virginica
106 | 6.5,3.0,5.8,2.2,Iris-virginica
107 | 7.6,3.0,6.6,2.1,Iris-virginica
108 | 4.9,2.5,4.5,1.7,Iris-virginica
109 | 7.3,2.9,6.3,1.8,Iris-virginica
110 | 6.7,2.5,5.8,1.8,Iris-virginica
111 | 7.2,3.6,6.1,2.5,Iris-virginica
112 | 6.5,3.2,5.1,2.0,Iris-virginica
113 | 6.4,2.7,5.3,1.9,Iris-virginica
114 | 6.8,3.0,5.5,2.1,Iris-virginica
115 | 5.7,2.5,5.0,2.0,Iris-virginica
116 | 5.8,2.8,5.1,2.4,Iris-virginica
117 | 6.4,3.2,5.3,2.3,Iris-virginica
118 | 6.5,3.0,5.5,1.8,Iris-virginica
119 | 7.7,3.8,6.7,2.2,Iris-virginica
120 | 7.7,2.6,6.9,2.3,Iris-virginica
121 | 6.0,2.2,5.0,1.5,Iris-virginica
122 | 6.9,3.2,5.7,2.3,Iris-virginica
123 | 5.6,2.8,4.9,2.0,Iris-virginica
124 | 7.7,2.8,6.7,2.0,Iris-virginica
125 | 6.3,2.7,4.9,1.8,Iris-virginica
126 | 6.7,3.3,5.7,2.1,Iris-virginica
127 | 7.2,3.2,6.0,1.8,Iris-virginica
128 | 6.2,2.8,4.8,1.8,Iris-virginica
129 | 6.1,3.0,4.9,1.8,Iris-virginica
130 | 6.4,2.8,5.6,2.1,Iris-virginica
131 | 7.2,3.0,5.8,1.6,Iris-virginica
132 | 7.4,2.8,6.1,1.9,Iris-virginica
133 | 7.9,3.8,6.4,2.0,Iris-virginica
134 | 6.4,2.8,5.6,2.2,Iris-virginica
135 | 6.3,2.8,5.1,1.5,Iris-virginica
136 | 6.1,2.6,5.6,1.4,Iris-virginica
137 | 7.7,3.0,6.1,2.3,Iris-virginica
138 | 6.3,3.4,5.6,2.4,Iris-virginica
139 | 6.4,3.1,5.5,1.8,Iris-virginica
140 | 6.0,3.0,4.8,1.8,Iris-virginica
141 | 6.9,3.1,5.4,2.1,Iris-virginica
142 | 6.7,3.1,5.6,2.4,Iris-virginica
143 | 6.9,3.1,5.1,2.3,Iris-virginica
144 | 5.8,2.7,5.1,1.9,Iris-virginica
145 | 6.8,3.2,5.9,2.3,Iris-virginica
146 | 6.7,3.3,5.7,2.5,Iris-virginica
147 | 6.7,3.0,5.2,2.3,Iris-virginica
148 | 6.3,2.5,5.0,1.9,Iris-virginica
149 | 6.5,3.0,5.2,2.0,Iris-virginica
150 | 6.2,3.4,5.4,2.3,Iris-virginica
151 | 5.9,3.0,5.1,1.8,Iris-virginica
152 | 


--------------------------------------------------------------------------------
/03-into-Design.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:ecd1889f30dd086fa191899741e3d2e3f75ef8a3fb15f463c88c0def7e579e3f"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "markdown",
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
 16 |       "                                align=\"right\"\n",
 17 |       "                                width=\"30%\">\n",
 18 |       "\n",
 19 |       "# Getting started with `into`\n",
 20 |       "\n",
 21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
 22 |       "* Install software with `conda install -c blaze blaze`"
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "markdown",
 27 |      "metadata": {},
 28 |      "source": [
 29 |       "## 3 Design\n",
 30 |       "\n",
 31 |       "Into is a network of pair-wise conversions\n",
 32 |       "\n",
 33 |       "<img src=\"images/conversions.png\" width=\"80%\" alt=\"into conversion graph\">\n"
 34 |      ]
 35 |     },
 36 |     {
 37 |      "cell_type": "markdown",
 38 |      "metadata": {},
 39 |      "source": [
 40 |       "Nodes represent data formats.  Arrows represent functions that migrate data from one format to another.  Red nodes are possibly larger than memory.\n",
 41 |       "\n",
 42 |       "This differs from most data migration systems, which always migrate data through a common format.\n",
 43 |       "\n",
 44 |       "![](images/star.png)"
 45 |      ]
 46 |     },
 47 |     {
 48 |      "cell_type": "markdown",
 49 |      "metadata": {},
 50 |      "source": [
 51 |       "This common strategy is simpler in design and easier to get right, so why does `into` use a more complex design?"
 52 |      ]
 53 |     },
 54 |     {
 55 |      "cell_type": "markdown",
 56 |      "metadata": {},
 57 |      "source": [
 58 |       "### Some edges are very fast\n",
 59 |       "\n",
 60 |       "For example transforming an `np.ndarray` to a `pd.Series` or a `list` to `Iterator` only requires us to manipulate metadata which can be done quickly; the bytes of data remain untouched.  In many cases transfering data through a common format can be much slower.\n",
 61 |       "\n",
 62 |       "For example consider `CSV -> SQL` migration.  Using Python iterators as a common central format we're bound by SQLAlchemy's insertion code which maxes out at around 2000 records per second.  Using CSV loaders native to the database (e.g. PostgreSQL Copy) we achieve more than 50000 records per second, turning an overnight task into 20 minutes.\n",
 63 |       "\n",
 64 |       "Efficient data migration *is intrinsically messy* in practice.  The into graph reflects this mess."
 65 |      ]
 66 |     },
 67 |     {
 68 |      "cell_type": "markdown",
 69 |      "metadata": {},
 70 |      "source": [
 71 |       "### How does `into` use this graph?\n",
 72 |       "\n",
 73 |       "When you convert one dataset into another, Into walks the graph above and finds the minimum path.  Each edge corresponds to a single function and that edge is weighted according to relative cost.  E.g. if we transform a CSV file into a set we can see the stages through which our data moves."
 74 |      ]
 75 |     },
 76 |     {
 77 |      "cell_type": "code",
 78 |      "collapsed": false,
 79 |      "input": [
 80 |       "from into import into, convert, append, CSV\n",
 81 |       "path =  convert.path(CSV, set)\n",
 82 |       "\n",
 83 |       "for source, target, func in path:\n",
 84 |       "    print '%25s -> %-25s   via   %s()' % (source.__name__, target.__name__, func.__name__)"
 85 |      ],
 86 |      "language": "python",
 87 |      "metadata": {},
 88 |      "outputs": [
 89 |       {
 90 |        "output_type": "stream",
 91 |        "stream": "stdout",
 92 |        "text": [
 93 |         "                      CSV -> Chunks_pandas_DataFrame     via   CSV_to_chunks_of_dataframes()\n",
 94 |         "  Chunks_pandas_DataFrame -> Iterator                    via   numpy_chunks_to_iterator()\n",
 95 |         "                 Iterator -> list                        via   iterator_to_list()\n",
 96 |         "                     list -> set                         via   iterable_to_set()\n"
 97 |        ]
 98 |       }
 99 |      ],
100 |      "prompt_number": 1
101 |     },
102 |     {
103 |      "cell_type": "markdown",
104 |      "metadata": {},
105 |      "source": [
106 |       "### Chunks of in-memory data\n",
107 |       "\n",
108 |       "The red nodes in our graph represent data that might be larger than memory.  We we migrate between two red nodes we restrict ourselves to the subgraph of only red nodes so that we never blow up.\n",
109 |       "\n",
110 |       "Yet we still want to use NumPy and Pandas in these migrations (they're very helpful intermediaries) so we partition our data into a sequence of chunks such that each chunks fit in memory.  We describe this data with parametrized types like `chunks(np.ndarray)` or `chunks(pd.DataFrame)`."
111 |      ]
112 |     },
113 |     {
114 |      "cell_type": "markdown",
115 |      "metadata": {},
116 |      "source": [
117 |       "### `into` = `convert` + `append`\n",
118 |       "\n",
119 |       "Recall the two modes of into"
120 |      ]
121 |     },
122 |     {
123 |      "cell_type": "code",
124 |      "collapsed": false,
125 |      "input": [
126 |       "# Given type:  Convert source to new object\n",
127 |       "into(list, (1, 2, 3))"
128 |      ],
129 |      "language": "python",
130 |      "metadata": {},
131 |      "outputs": [
132 |       {
133 |        "metadata": {},
134 |        "output_type": "pyout",
135 |        "prompt_number": 2,
136 |        "text": [
137 |         "[1, 2, 3]"
138 |        ]
139 |       }
140 |      ],
141 |      "prompt_number": 2
142 |     },
143 |     {
144 |      "cell_type": "code",
145 |      "collapsed": false,
146 |      "input": [
147 |       "# Given object: Append source to that object\n",
148 |       "L = [1, 2, 3]\n",
149 |       "into(L, (4, 5, 6))\n",
150 |       "L"
151 |      ],
152 |      "language": "python",
153 |      "metadata": {},
154 |      "outputs": [
155 |       {
156 |        "metadata": {},
157 |        "output_type": "pyout",
158 |        "prompt_number": 3,
159 |        "text": [
160 |         "[1, 2, 3, 4, 5, 6]"
161 |        ]
162 |       }
163 |      ],
164 |      "prompt_number": 3
165 |     },
166 |     {
167 |      "cell_type": "markdown",
168 |      "metadata": {},
169 |      "source": [
170 |       "These two modes are actually separate functions, `convert` and `append`.  Into uses both depending on the situation.  A single `into` call may engage both functions.\n",
171 |       "\n",
172 |       "You should use `into` by default, it does some other checks.  For the sake of being explicit however, here are examples using `convert` and `append` directly."
173 |      ]
174 |     },
175 |     {
176 |      "cell_type": "code",
177 |      "collapsed": false,
178 |      "input": [
179 |       "from into import convert, append\n",
180 |       "\n",
181 |       "convert(list, (1, 2, 3))"
182 |      ],
183 |      "language": "python",
184 |      "metadata": {},
185 |      "outputs": [
186 |       {
187 |        "metadata": {},
188 |        "output_type": "pyout",
189 |        "prompt_number": 1,
190 |        "text": [
191 |         "[1, 2, 3]"
192 |        ]
193 |       }
194 |      ],
195 |      "prompt_number": 1
196 |     },
197 |     {
198 |      "cell_type": "code",
199 |      "collapsed": false,
200 |      "input": [
201 |       "L = [1, 2, 3]\n",
202 |       "append(L, (4, 5, 6))"
203 |      ],
204 |      "language": "python",
205 |      "metadata": {},
206 |      "outputs": [
207 |       {
208 |        "metadata": {},
209 |        "output_type": "pyout",
210 |        "prompt_number": 3,
211 |        "text": [
212 |         "[1, 2, 3, 4, 5, 6]"
213 |        ]
214 |       }
215 |      ],
216 |      "prompt_number": 3
217 |     },
218 |     {
219 |      "cell_type": "markdown",
220 |      "metadata": {},
221 |      "source": [
222 |       "### How to extend `into`\n",
223 |       "\n",
224 |       "When we encounter new data formats we may wish to connect them to the `into` graph.  We do this by implementing new versions of `discover`, `convert`, and `append` (if we support appending).\n",
225 |       "\n",
226 |       "We register new implementations of an operation like convert by creating a new Python function and decorating it with types and a cost.\n",
227 |       "\n",
228 |       "#### Example\n",
229 |       "\n",
230 |       "Here we define how to convert from a DataFrame to a NumPy array"
231 |      ]
232 |     },
233 |     {
234 |      "cell_type": "code",
235 |      "collapsed": false,
236 |      "input": [
237 |       "import numpy as np\n",
238 |       "import pandas as pd\n",
239 |       "\n",
240 |       "#                 target type, source data, cost\n",
241 |       "@convert.register(np.ndarray, pd.DataFrame, cost=1.0)\n",
242 |       "def dataframe_to_numpy(df, **kwargs):\n",
243 |       "    return df.to_records(index=False)"
244 |      ],
245 |      "language": "python",
246 |      "metadata": {},
247 |      "outputs": [],
248 |      "prompt_number": 4
249 |     }
250 |    ],
251 |    "metadata": {}
252 |   }
253 |  ]
254 | }


--------------------------------------------------------------------------------
/05-Blaze-with-SQL.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:d66bda4ad596e7d1038472946b691f727760aa36c6a5bcb7a9e63f8067624882"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "markdown",
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
 16 |       "                                align=\"right\"\n",
 17 |       "                                width=\"30%\">\n",
 18 |       "\n",
 19 |       "# Getting Started with Blaze\n",
 20 |       "\n",
 21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
 22 |       "* Install software with `conda install -c blaze blaze`"
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "markdown",
 27 |      "metadata": {},
 28 |      "source": [
 29 |       "## 2.  Using Blaze with Foreign Systems\n",
 30 |       "\n",
 31 |       "Blaze gives us an interactive experience over data living in foreign systems, like a SQL database."
 32 |      ]
 33 |     },
 34 |     {
 35 |      "cell_type": "code",
 36 |      "collapsed": false,
 37 |      "input": [
 38 |       "from blaze import Data, join, by, transform\n",
 39 |       "\n",
 40 |       "db = Data('sqlite:///data/lahman2013.sqlite')\n",
 41 |       "db.Teams"
 42 |      ],
 43 |      "language": "python",
 44 |      "metadata": {},
 45 |      "outputs": [
 46 |       {
 47 |        "metadata": {},
 48 |        "output_type": "pyout",
 49 |        "prompt_number": 2,
 50 |        "text": [
 51 |         "    yearID lgID teamID franchID divID  Rank   G  Ghome   W   L     ...       \\\n",
 52 |         "0     1871   NA    BS1      BNA  None     3  31    NaN  20  10     ...        \n",
 53 |         "1     1871   NA    CH1      CNA  None     2  28    NaN  19   9     ...        \n",
 54 |         "2     1871   NA    CL1      CFC  None     8  29    NaN  10  19     ...        \n",
 55 |         "3     1871   NA    FW1      KEK  None     7  19    NaN   7  12     ...        \n",
 56 |         "4     1871   NA    NY2      NNA  None     5  33    NaN  16  17     ...        \n",
 57 |         "5     1871   NA    PH1      PNA  None     1  28    NaN  21   7     ...        \n",
 58 |         "6     1871   NA    RC1      ROK  None     9  25    NaN   4  21     ...        \n",
 59 |         "7     1871   NA    TRO      TRO  None     6  29    NaN  13  15     ...        \n",
 60 |         "8     1871   NA    WS3      OLY  None     4  32    NaN  15  15     ...        \n",
 61 |         "9     1872   NA    BL1      BLC  None     2  58    NaN  35  19     ...        \n",
 62 |         "10    1872   NA    BR1      ECK  None     9  29    NaN   3  26     ...        \n",
 63 |         "\n",
 64 |         "    DP    FP                     name                               park  \\\n",
 65 |         "0  NaN  0.83     Boston Red Stockings                South End Grounds I   \n",
 66 |         "1  NaN  0.82  Chicago White Stockings            Union Base-Ball Grounds   \n",
 67 |         "2  NaN  0.81   Cleveland Forest Citys       National Association Grounds   \n",
 68 |         "3  NaN  0.80     Fort Wayne Kekiongas                     Hamilton Field   \n",
 69 |         "4  NaN  0.83         New York Mutuals           Union Grounds (Brooklyn)   \n",
 70 |         "5  NaN  0.84   Philadelphia Athletics           Jefferson Street Grounds   \n",
 71 |         "6  NaN  0.82    Rockford Forest Citys  Agricultural Society Fair Grounds   \n",
 72 |         "7  NaN  0.84           Troy Haymakers                 Haymakers' Grounds   \n",
 73 |         "8  NaN  0.85      Washington Olympics                   Olympics Grounds   \n",
 74 |         "9  NaN  0.82       Baltimore Canaries                     Newington Park   \n",
 75 |         "10 NaN  0.79        Brooklyn Eckfords                      Union Grounds   \n",
 76 |         "\n",
 77 |         "    attendance  BPF  PPF  teamIDBR  teamIDlahman45  teamIDretro  \n",
 78 |         "0          NaN  103   98       BOS             BS1          BS1  \n",
 79 |         "1          NaN  104  102       CHI             CH1          CH1  \n",
 80 |         "2          NaN   96  100       CLE             CL1          CL1  \n",
 81 |         "3          NaN  101  107       KEK             FW1          FW1  \n",
 82 |         "4          NaN   90   88       NYU             NY2          NY2  \n",
 83 |         "5          NaN  102   98       ATH             PH1          PH1  \n",
 84 |         "6          NaN   97   99       ROK             RC1          RC1  \n",
 85 |         "7          NaN  101  100       TRO             TRO          TRO  \n",
 86 |         "8          NaN   94   98       OLY             WS3          WS3  \n",
 87 |         "9          NaN  106  102       BAL             BL1          BL1  \n",
 88 |         "10         NaN   87   96       ECK             BR1          BR1  \n",
 89 |         "\n",
 90 |         "..."
 91 |        ]
 92 |       }
 93 |      ],
 94 |      "prompt_number": 2
 95 |     },
 96 |     {
 97 |      "cell_type": "markdown",
 98 |      "metadata": {},
 99 |      "source": [
100 |       "We can use all of our standard queries"
101 |      ]
102 |     },
103 |     {
104 |      "cell_type": "code",
105 |      "collapsed": false,
106 |      "input": [
107 |       "db.Teams[db.Teams.name == 'Chicago White Stockings'][['name', 'yearID', 'park']]"
108 |      ],
109 |      "language": "python",
110 |      "metadata": {},
111 |      "outputs": [
112 |       {
113 |        "metadata": {},
114 |        "output_type": "pyout",
115 |        "prompt_number": 5,
116 |        "text": [
117 |         "                       name  yearID                                  park\n",
118 |         "0   Chicago White Stockings    1871               Union Base-Ball Grounds\n",
119 |         "1   Chicago White Stockings    1874                   23rd Street Grounds\n",
120 |         "2   Chicago White Stockings    1875                   23rd Street Grounds\n",
121 |         "3   Chicago White Stockings    1876                   23rd Street Grounds\n",
122 |         "4   Chicago White Stockings    1877                   23rd Street Grounds\n",
123 |         "5   Chicago White Stockings    1878                     Lake Front Park I\n",
124 |         "6   Chicago White Stockings    1879                     Lake Front Park I\n",
125 |         "7   Chicago White Stockings    1880                     Lake Front Park I\n",
126 |         "8   Chicago White Stockings    1881                     Lake Front Park I\n",
127 |         "9   Chicago White Stockings    1882  Lake Front Park I/Lake Front Park II\n",
128 |         "..."
129 |        ]
130 |       }
131 |      ],
132 |      "prompt_number": 5
133 |     },
134 |     {
135 |      "cell_type": "code",
136 |      "collapsed": false,
137 |      "input": [
138 |       "by(db.Teams.name, start_year=db.Teams.yearID.min(), \n",
139 |       "                    end_year=db.Teams.yearID.max()).sort('end_year', ascending=False)"
140 |      ],
141 |      "language": "python",
142 |      "metadata": {},
143 |      "outputs": [
144 |       {
145 |        "metadata": {},
146 |        "output_type": "pyout",
147 |        "prompt_number": 6,
148 |        "text": [
149 |         "                    name  end_year  start_year\n",
150 |         "0   Arizona Diamondbacks      2013        1998\n",
151 |         "1         Atlanta Braves      2013        1966\n",
152 |         "2      Baltimore Orioles      2013        1882\n",
153 |         "3         Boston Red Sox      2013        1908\n",
154 |         "4           Chicago Cubs      2013        1903\n",
155 |         "5      Chicago White Sox      2013        1901\n",
156 |         "6        Cincinnati Reds      2013        1876\n",
157 |         "7      Cleveland Indians      2013        1915\n",
158 |         "8       Colorado Rockies      2013        1993\n",
159 |         "9         Detroit Tigers      2013        1901\n",
160 |         "..."
161 |        ]
162 |       }
163 |      ],
164 |      "prompt_number": 6
165 |     },
166 |     {
167 |      "cell_type": "markdown",
168 |      "metadata": {},
169 |      "source": [
170 |       "### These queries happen in the database\n",
171 |       "\n",
172 |       "Note that the Blaze software doesn't perform computations, the database does.  Blaze is just a friendly intermediary between you and your database.\n",
173 |       "\n",
174 |       "Here we see what the generated SQL looks like.  You generally don't need to do this at home."
175 |      ]
176 |     },
177 |     {
178 |      "cell_type": "code",
179 |      "collapsed": false,
180 |      "input": [
181 |       "from blaze import compute  # compute is internal API, not intended for users\n",
182 |       "\n",
183 |       "expr = db.Teams[db.Teams.name == 'Chicago White Stockings'][['name', 'yearID', 'park']]\n",
184 |       "print compute(expr, post_compute=False)"
185 |      ],
186 |      "language": "python",
187 |      "metadata": {},
188 |      "outputs": [
189 |       {
190 |        "output_type": "stream",
191 |        "stream": "stdout",
192 |        "text": [
193 |         "SELECT \"Teams\".name, \"Teams\".\"yearID\", \"Teams\".park \n",
194 |         "FROM \"Teams\" \n",
195 |         "WHERE \"Teams\".name = ?\n"
196 |        ]
197 |       }
198 |      ],
199 |      "prompt_number": 7
200 |     },
201 |     {
202 |      "cell_type": "code",
203 |      "collapsed": false,
204 |      "input": [
205 |       "expr = by(db.Teams.name, start_year=db.Teams.yearID.min(), \n",
206 |       "                    end_year=db.Teams.yearID.max()).sort('end_year', ascending=False)\n",
207 |       "print compute(expr, post_compute=False)"
208 |      ],
209 |      "language": "python",
210 |      "metadata": {},
211 |      "outputs": [
212 |       {
213 |        "output_type": "stream",
214 |        "stream": "stdout",
215 |        "text": [
216 |         "SELECT \"Teams\".name, max(\"Teams\".\"yearID\") AS end_year, min(\"Teams\".\"yearID\") AS start_year \n",
217 |         "FROM \"Teams\" GROUP BY \"Teams\".name ORDER BY end_year DESC\n"
218 |        ]
219 |       }
220 |      ],
221 |      "prompt_number": 8
222 |     },
223 |     {
224 |      "cell_type": "markdown",
225 |      "metadata": {},
226 |      "source": [
227 |       "### Exercises\n",
228 |       "\n",
229 |       "First, look at the `db.dshape` to get an idea of what we have in the database.  This will tell you what tables are in the database as well as what columns we have in each table."
230 |      ]
231 |     },
232 |     {
233 |      "cell_type": "code",
234 |      "collapsed": false,
235 |      "input": [
236 |       "db.dshape"
237 |      ],
238 |      "language": "python",
239 |      "metadata": {},
240 |      "outputs": []
241 |     },
242 |     {
243 |      "cell_type": "markdown",
244 |      "metadata": {},
245 |      "source": [
246 |       "Play for a while with these tables inspecting data and writing a few queries that might interest you.\n",
247 |       "\n",
248 |       "If you're having trouble finding interesting queries you might try the following:\n",
249 |       "\n",
250 |       "*  How many unique pitchers were there in each year?\n",
251 |       "*  In what years did Hank Aaron play?  When was his last year?  (Hank has the playerID `aaronha01`)\n",
252 |       "*  What was the average salary per year.  Has it increased or decreased over time?"
253 |      ]
254 |     },
255 |     {
256 |      "cell_type": "code",
257 |      "collapsed": false,
258 |      "input": [],
259 |      "language": "python",
260 |      "metadata": {},
261 |      "outputs": []
262 |     }
263 |    ],
264 |    "metadata": {}
265 |   }
266 |  ]
267 | }


--------------------------------------------------------------------------------
/04-Blaze-Introduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:d825a277965c84a256612bb6ff63f606b5bba263c90ac5407506f08370ba339b"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "markdown",
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
 16 |       "                                align=\"right\"\n",
 17 |       "                                width=\"30%\">\n",
 18 |       "\n",
 19 |       "# Getting Started with Blaze\n",
 20 |       "\n",
 21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
 22 |       "* Install software with `conda install -c blaze blaze`"
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "markdown",
 27 |      "metadata": {},
 28 |      "source": [
 29 |       "## 1.  Basic Queries\n",
 30 |       "\n",
 31 |       "For basic tabular queries, Blaze shares the same syntax as Pandas."
 32 |      ]
 33 |     },
 34 |     {
 35 |      "cell_type": "code",
 36 |      "collapsed": false,
 37 |      "input": [
 38 |       "from blaze import Data, by, join, transform"
 39 |      ],
 40 |      "language": "python",
 41 |      "metadata": {},
 42 |      "outputs": [],
 43 |      "prompt_number": 1
 44 |     },
 45 |     {
 46 |      "cell_type": "code",
 47 |      "collapsed": false,
 48 |      "input": [
 49 |       "bank = Data([[1, 'Alice',   100],\n",
 50 |       "             [2, 'Bob',    -200],\n",
 51 |       "             [3, 'Charlie', 300],\n",
 52 |       "             [4, 'Dennis',  400],\n",
 53 |       "             [5, 'Edith',  -500]], columns=['id', 'name', 'amount'])"
 54 |      ],
 55 |      "language": "python",
 56 |      "metadata": {},
 57 |      "outputs": [],
 58 |      "prompt_number": 2
 59 |     },
 60 |     {
 61 |      "cell_type": "markdown",
 62 |      "metadata": {},
 63 |      "source": [
 64 |       "### Arithmetic and Reductions"
 65 |      ]
 66 |     },
 67 |     {
 68 |      "cell_type": "code",
 69 |      "collapsed": false,
 70 |      "input": [
 71 |       "bank.amount"
 72 |      ],
 73 |      "language": "python",
 74 |      "metadata": {},
 75 |      "outputs": [
 76 |       {
 77 |        "metadata": {},
 78 |        "output_type": "pyout",
 79 |        "prompt_number": 3,
 80 |        "text": [
 81 |         "   amount\n",
 82 |         "0     100\n",
 83 |         "1    -200\n",
 84 |         "2     300\n",
 85 |         "3     400\n",
 86 |         "4    -500"
 87 |        ]
 88 |       }
 89 |      ],
 90 |      "prompt_number": 3
 91 |     },
 92 |     {
 93 |      "cell_type": "code",
 94 |      "collapsed": false,
 95 |      "input": [
 96 |       "bank.amount / 100"
 97 |      ],
 98 |      "language": "python",
 99 |      "metadata": {},
100 |      "outputs": [
101 |       {
102 |        "metadata": {},
103 |        "output_type": "pyout",
104 |        "prompt_number": 4,
105 |        "text": [
106 |         "   amount\n",
107 |         "0       1\n",
108 |         "1      -2\n",
109 |         "2       3\n",
110 |         "3       4\n",
111 |         "4      -5"
112 |        ]
113 |       }
114 |      ],
115 |      "prompt_number": 4
116 |     },
117 |     {
118 |      "cell_type": "code",
119 |      "collapsed": false,
120 |      "input": [
121 |       "(bank.amount / 100).mean()"
122 |      ],
123 |      "language": "python",
124 |      "metadata": {},
125 |      "outputs": [
126 |       {
127 |        "metadata": {},
128 |        "output_type": "pyout",
129 |        "prompt_number": 5,
130 |        "text": [
131 |         "0.2"
132 |        ]
133 |       }
134 |      ],
135 |      "prompt_number": 5
136 |     },
137 |     {
138 |      "cell_type": "markdown",
139 |      "metadata": {},
140 |      "source": [
141 |       "### Multiple columns and sorting"
142 |      ]
143 |     },
144 |     {
145 |      "cell_type": "code",
146 |      "collapsed": false,
147 |      "input": [
148 |       "bank[['name', 'amount']].sort('amount')"
149 |      ],
150 |      "language": "python",
151 |      "metadata": {},
152 |      "outputs": []
153 |     },
154 |     {
155 |      "cell_type": "markdown",
156 |      "metadata": {},
157 |      "source": [
158 |       "### Selections\n",
159 |       "\n",
160 |       "We select subsets of data by indexing one expression with another"
161 |      ]
162 |     },
163 |     {
164 |      "cell_type": "code",
165 |      "collapsed": false,
166 |      "input": [
167 |       "bank[bank.amount < 0]"
168 |      ],
169 |      "language": "python",
170 |      "metadata": {},
171 |      "outputs": [
172 |       {
173 |        "metadata": {},
174 |        "output_type": "pyout",
175 |        "prompt_number": 6,
176 |        "text": [
177 |         "   id   name  amount\n",
178 |         "0   2    Bob    -200\n",
179 |         "1   5  Edith    -500"
180 |        ]
181 |       }
182 |      ],
183 |      "prompt_number": 6
184 |     },
185 |     {
186 |      "cell_type": "markdown",
187 |      "metadata": {},
188 |      "source": [
189 |       "### Combining Operations\n",
190 |       "\n",
191 |       "We can combine these sorts of operations with each other"
192 |      ]
193 |     },
194 |     {
195 |      "cell_type": "code",
196 |      "collapsed": false,
197 |      "input": [
198 |       "bank[bank.amount < 0].amount / 100"
199 |      ],
200 |      "language": "python",
201 |      "metadata": {},
202 |      "outputs": [
203 |       {
204 |        "metadata": {},
205 |        "output_type": "pyout",
206 |        "prompt_number": 7,
207 |        "text": [
208 |         "   amount\n",
209 |         "0      -2\n",
210 |         "1      -5"
211 |        ]
212 |       }
213 |      ],
214 |      "prompt_number": 7
215 |     },
216 |     {
217 |      "cell_type": "code",
218 |      "collapsed": false,
219 |      "input": [
220 |       "bank[bank.amount < 0].name"
221 |      ],
222 |      "language": "python",
223 |      "metadata": {},
224 |      "outputs": [
225 |       {
226 |        "metadata": {},
227 |        "output_type": "pyout",
228 |        "prompt_number": 8,
229 |        "text": [
230 |         "    name\n",
231 |         "0    Bob\n",
232 |         "1  Edith"
233 |        ]
234 |       }
235 |      ],
236 |      "prompt_number": 8
237 |     },
238 |     {
239 |      "cell_type": "markdown",
240 |      "metadata": {},
241 |      "source": [
242 |       "#### Exercises\n",
243 |       "\n",
244 |       "Write expressions to answer the following questions"
245 |      ]
246 |     },
247 |     {
248 |      "cell_type": "code",
249 |      "collapsed": false,
250 |      "input": [
251 |       "# What are the IDs of everyone with a positive amount?\n"
252 |      ],
253 |      "language": "python",
254 |      "metadata": {},
255 |      "outputs": [],
256 |      "prompt_number": 9
257 |     },
258 |     {
259 |      "cell_type": "code",
260 |      "collapsed": false,
261 |      "input": [
262 |       "# What is the name of the person with amount 400?\n"
263 |      ],
264 |      "language": "python",
265 |      "metadata": {},
266 |      "outputs": [],
267 |      "prompt_number": 10
268 |     },
269 |     {
270 |      "cell_type": "code",
271 |      "collapsed": false,
272 |      "input": [
273 |       "# What is the difference between the minimum and maximum amounts?\n"
274 |      ],
275 |      "language": "python",
276 |      "metadata": {},
277 |      "outputs": [],
278 |      "prompt_number": 11
279 |     },
280 |     {
281 |      "cell_type": "markdown",
282 |      "metadata": {},
283 |      "source": [
284 |       "## 2. More complex queries\n",
285 |       "\n"
286 |      ]
287 |     },
288 |     {
289 |      "cell_type": "markdown",
290 |      "metadata": {},
291 |      "source": [
292 |       "First, we need a more interesting dataset.  We open the standard *iris* dataset, a table of 150 measurements of flowers in the iris genus.  We find this dataset in a CSV file in the `data/` directory.  "
293 |      ]
294 |     },
295 |     {
296 |      "cell_type": "code",
297 |      "collapsed": false,
298 |      "input": [
299 |       "iris = Data('data/iris.csv')\n",
300 |       "iris"
301 |      ],
302 |      "language": "python",
303 |      "metadata": {},
304 |      "outputs": [
305 |       {
306 |        "html": [
307 |         "<table border=\"1\" class=\"dataframe\">\n",
308 |         "  <thead>\n",
309 |         "    <tr style=\"text-align: right;\">\n",
310 |         "      <th></th>\n",
311 |         "      <th>sepal_length</th>\n",
312 |         "      <th>sepal_width</th>\n",
313 |         "      <th>petal_length</th>\n",
314 |         "      <th>petal_width</th>\n",
315 |         "      <th>species</th>\n",
316 |         "    </tr>\n",
317 |         "  </thead>\n",
318 |         "  <tbody>\n",
319 |         "    <tr>\n",
320 |         "      <th>0 </th>\n",
321 |         "      <td> 5.1</td>\n",
322 |         "      <td> 3.5</td>\n",
323 |         "      <td> 1.4</td>\n",
324 |         "      <td> 0.2</td>\n",
325 |         "      <td> Iris-setosa</td>\n",
326 |         "    </tr>\n",
327 |         "    <tr>\n",
328 |         "      <th>1 </th>\n",
329 |         "      <td> 4.9</td>\n",
330 |         "      <td> 3.0</td>\n",
331 |         "      <td> 1.4</td>\n",
332 |         "      <td> 0.2</td>\n",
333 |         "      <td> Iris-setosa</td>\n",
334 |         "    </tr>\n",
335 |         "    <tr>\n",
336 |         "      <th>2 </th>\n",
337 |         "      <td> 4.7</td>\n",
338 |         "      <td> 3.2</td>\n",
339 |         "      <td> 1.3</td>\n",
340 |         "      <td> 0.2</td>\n",
341 |         "      <td> Iris-setosa</td>\n",
342 |         "    </tr>\n",
343 |         "    <tr>\n",
344 |         "      <th>3 </th>\n",
345 |         "      <td> 4.6</td>\n",
346 |         "      <td> 3.1</td>\n",
347 |         "      <td> 1.5</td>\n",
348 |         "      <td> 0.2</td>\n",
349 |         "      <td> Iris-setosa</td>\n",
350 |         "    </tr>\n",
351 |         "    <tr>\n",
352 |         "      <th>4 </th>\n",
353 |         "      <td> 5.0</td>\n",
354 |         "      <td> 3.6</td>\n",
355 |         "      <td> 1.4</td>\n",
356 |         "      <td> 0.2</td>\n",
357 |         "      <td> Iris-setosa</td>\n",
358 |         "    </tr>\n",
359 |         "    <tr>\n",
360 |         "      <th>5 </th>\n",
361 |         "      <td> 5.4</td>\n",
362 |         "      <td> 3.9</td>\n",
363 |         "      <td> 1.7</td>\n",
364 |         "      <td> 0.4</td>\n",
365 |         "      <td> Iris-setosa</td>\n",
366 |         "    </tr>\n",
367 |         "    <tr>\n",
368 |         "      <th>6 </th>\n",
369 |         "      <td> 4.6</td>\n",
370 |         "      <td> 3.4</td>\n",
371 |         "      <td> 1.4</td>\n",
372 |         "      <td> 0.3</td>\n",
373 |         "      <td> Iris-setosa</td>\n",
374 |         "    </tr>\n",
375 |         "    <tr>\n",
376 |         "      <th>7 </th>\n",
377 |         "      <td> 5.0</td>\n",
378 |         "      <td> 3.4</td>\n",
379 |         "      <td> 1.5</td>\n",
380 |         "      <td> 0.2</td>\n",
381 |         "      <td> Iris-setosa</td>\n",
382 |         "    </tr>\n",
383 |         "    <tr>\n",
384 |         "      <th>8 </th>\n",
385 |         "      <td> 4.4</td>\n",
386 |         "      <td> 2.9</td>\n",
387 |         "      <td> 1.4</td>\n",
388 |         "      <td> 0.2</td>\n",
389 |         "      <td> Iris-setosa</td>\n",
390 |         "    </tr>\n",
391 |         "    <tr>\n",
392 |         "      <th>9 </th>\n",
393 |         "      <td> 4.9</td>\n",
394 |         "      <td> 3.1</td>\n",
395 |         "      <td> 1.5</td>\n",
396 |         "      <td> 0.1</td>\n",
397 |         "      <td> Iris-setosa</td>\n",
398 |         "    </tr>\n",
399 |         "    <tr>\n",
400 |         "      <th>10</th>\n",
401 |         "      <td> 5.4</td>\n",
402 |         "      <td> 3.7</td>\n",
403 |         "      <td> 1.5</td>\n",
404 |         "      <td> 0.2</td>\n",
405 |         "      <td> Iris-setosa</td>\n",
406 |         "    </tr>\n",
407 |         "  </tbody>\n",
408 |         "</table>"
409 |        ],
410 |        "metadata": {},
411 |        "output_type": "pyout",
412 |        "prompt_number": 12,
413 |        "text": [
414 |         "    sepal_length  sepal_width  petal_length  petal_width      species\n",
415 |         "0            5.1          3.5           1.4          0.2  Iris-setosa\n",
416 |         "1            4.9          3.0           1.4          0.2  Iris-setosa\n",
417 |         "2            4.7          3.2           1.3          0.2  Iris-setosa\n",
418 |         "3            4.6          3.1           1.5          0.2  Iris-setosa\n",
419 |         "4            5.0          3.6           1.4          0.2  Iris-setosa\n",
420 |         "5            5.4          3.9           1.7          0.4  Iris-setosa\n",
421 |         "6            4.6          3.4           1.4          0.3  Iris-setosa\n",
422 |         "7            5.0          3.4           1.5          0.2  Iris-setosa\n",
423 |         "8            4.4          2.9           1.4          0.2  Iris-setosa\n",
424 |         "9            4.9          3.1           1.5          0.1  Iris-setosa\n",
425 |         "..."
426 |        ]
427 |       }
428 |      ],
429 |      "prompt_number": 12
430 |     },
431 |     {
432 |      "cell_type": "markdown",
433 |      "metadata": {},
434 |      "source": [
435 |       "The `blaze.Data` function operates on all of the file types that we saw in the previous sections on `into`.  Blaze expressions use functions like `discover` to get datashapes that help them interact with you."
436 |      ]
437 |     },
438 |     {
439 |      "cell_type": "code",
440 |      "collapsed": false,
441 |      "input": [
442 |       "iris.dshape"
443 |      ],
444 |      "language": "python",
445 |      "metadata": {},
446 |      "outputs": [
447 |       {
448 |        "metadata": {},
449 |        "output_type": "pyout",
450 |        "prompt_number": 13,
451 |        "text": [
452 |         "dshape(\"\"\"var * {\n",
453 |         "  sepal_length: float64,\n",
454 |         "  sepal_width: float64,\n",
455 |         "  petal_length: float64,\n",
456 |         "  petal_width: float64,\n",
457 |         "  species: string\n",
458 |         "  }\"\"\")"
459 |        ]
460 |       }
461 |      ],
462 |      "prompt_number": 13
463 |     },
464 |     {
465 |      "cell_type": "markdown",
466 |      "metadata": {},
467 |      "source": [
468 |       "### Distinct\n",
469 |       "\n",
470 |       "Now some more queries.  Distinct finds unique entries"
471 |      ]
472 |     },
473 |     {
474 |      "cell_type": "code",
475 |      "collapsed": false,
476 |      "input": [
477 |       "iris.species.distinct()"
478 |      ],
479 |      "language": "python",
480 |      "metadata": {},
481 |      "outputs": [
482 |       {
483 |        "metadata": {},
484 |        "output_type": "pyout",
485 |        "prompt_number": 14,
486 |        "text": [
487 |         "           species\n",
488 |         "0      Iris-setosa\n",
489 |         "1  Iris-versicolor\n",
490 |         "2   Iris-virginica"
491 |        ]
492 |       }
493 |      ],
494 |      "prompt_number": 14
495 |     },
496 |     {
497 |      "cell_type": "markdown",
498 |      "metadata": {},
499 |      "source": [
500 |       "Or count the number of distinct entries"
501 |      ]
502 |     },
503 |     {
504 |      "cell_type": "code",
505 |      "collapsed": false,
506 |      "input": [
507 |       "iris.species.nunique()"
508 |      ],
509 |      "language": "python",
510 |      "metadata": {},
511 |      "outputs": [
512 |       {
513 |        "metadata": {},
514 |        "output_type": "pyout",
515 |        "prompt_number": 15,
516 |        "text": [
517 |         "3"
518 |        ]
519 |       }
520 |      ],
521 |      "prompt_number": 15
522 |     },
523 |     {
524 |      "cell_type": "code",
525 |      "collapsed": false,
526 |      "input": [
527 |       "iris.sepal_length.nunique()"
528 |      ],
529 |      "language": "python",
530 |      "metadata": {},
531 |      "outputs": [
532 |       {
533 |        "metadata": {},
534 |        "output_type": "pyout",
535 |        "prompt_number": 16,
536 |        "text": [
537 |         "35"
538 |        ]
539 |       }
540 |      ],
541 |      "prompt_number": 16
542 |     },
543 |     {
544 |      "cell_type": "markdown",
545 |      "metadata": {},
546 |      "source": [
547 |       "### Transform\n",
548 |       "\n",
549 |       "Transform adds new columns based on old ones"
550 |      ]
551 |     },
552 |     {
553 |      "cell_type": "code",
554 |      "collapsed": false,
555 |      "input": [
556 |       "transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,\n",
557 |       "                petal_ratio=iris.petal_length / iris.petal_width)"
558 |      ],
559 |      "language": "python",
560 |      "metadata": {},
561 |      "outputs": [
562 |       {
563 |        "metadata": {},
564 |        "output_type": "pyout",
565 |        "prompt_number": 17,
566 |        "text": [
567 |         "    sepal_length  sepal_width  petal_length  petal_width      species  \\\n",
568 |         "0            5.1          3.5           1.4          0.2  Iris-setosa   \n",
569 |         "1            4.9          3.0           1.4          0.2  Iris-setosa   \n",
570 |         "2            4.7          3.2           1.3          0.2  Iris-setosa   \n",
571 |         "3            4.6          3.1           1.5          0.2  Iris-setosa   \n",
572 |         "4            5.0          3.6           1.4          0.2  Iris-setosa   \n",
573 |         "5            5.4          3.9           1.7          0.4  Iris-setosa   \n",
574 |         "6            4.6          3.4           1.4          0.3  Iris-setosa   \n",
575 |         "7            5.0          3.4           1.5          0.2  Iris-setosa   \n",
576 |         "8            4.4          2.9           1.4          0.2  Iris-setosa   \n",
577 |         "9            4.9          3.1           1.5          0.1  Iris-setosa   \n",
578 |         "10           5.4          3.7           1.5          0.2  Iris-setosa   \n",
579 |         "\n",
580 |         "    sepal_ratio  petal_ratio  \n",
581 |         "0      1.457143     7.000000  \n",
582 |         "1      1.633333     7.000000  \n",
583 |         "2      1.468750     6.500000  \n",
584 |         "3      1.483871     7.500000  \n",
585 |         "4      1.388889     7.000000  \n",
586 |         "5      1.384615     4.250000  \n",
587 |         "6      1.352941     4.666667  \n",
588 |         "7      1.470588     7.500000  \n",
589 |         "8      1.517241     7.000000  \n",
590 |         "9      1.580645    15.000000  \n",
591 |         "..."
592 |        ]
593 |       }
594 |      ],
595 |      "prompt_number": 17
596 |     },
597 |     {
598 |      "cell_type": "markdown",
599 |      "metadata": {},
600 |      "source": [
601 |       "### Split-apply-combine -- `by`\n",
602 |       "\n",
603 |       "Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group.  We express these queries in blaze with the `by` operator\n",
604 |       "\n",
605 |       "    by(column-on-which-to-split, result_name=reduction_on_group())"
606 |      ]
607 |     },
608 |     {
609 |      "cell_type": "markdown",
610 |      "metadata": {},
611 |      "source": [
612 |       "How many measurements do we have per species?"
613 |      ]
614 |     },
615 |     {
616 |      "cell_type": "code",
617 |      "collapsed": false,
618 |      "input": [
619 |       "by(iris.species, count=iris.species.count())"
620 |      ],
621 |      "language": "python",
622 |      "metadata": {},
623 |      "outputs": [
624 |       {
625 |        "metadata": {},
626 |        "output_type": "pyout",
627 |        "prompt_number": 18,
628 |        "text": [
629 |         "           species  count\n",
630 |         "0      Iris-setosa     50\n",
631 |         "1  Iris-versicolor     50\n",
632 |         "2   Iris-virginica     50"
633 |        ]
634 |       }
635 |      ],
636 |      "prompt_number": 18
637 |     },
638 |     {
639 |      "cell_type": "markdown",
640 |      "metadata": {},
641 |      "source": [
642 |       "How many measurements do we have per species and what is the longest petal length per species?"
643 |      ]
644 |     },
645 |     {
646 |      "cell_type": "code",
647 |      "collapsed": false,
648 |      "input": [
649 |       "by(iris.species, count=iris.species.count(), \n",
650 |       "                 longest_petal=iris.petal_length.max())"
651 |      ],
652 |      "language": "python",
653 |      "metadata": {},
654 |      "outputs": [
655 |       {
656 |        "metadata": {},
657 |        "output_type": "pyout",
658 |        "prompt_number": 19,
659 |        "text": [
660 |         "           species  count  longest_petal\n",
661 |         "0      Iris-setosa     50            1.9\n",
662 |         "1  Iris-versicolor     50            5.1\n",
663 |         "2   Iris-virginica     50            6.9"
664 |        ]
665 |       }
666 |      ],
667 |      "prompt_number": 19
668 |     },
669 |     {
670 |      "cell_type": "markdown",
671 |      "metadata": {},
672 |      "source": [
673 |       "#### Exercise\n",
674 |       "\n",
675 |       "Write queries to answer the following questions"
676 |      ]
677 |     },
678 |     {
679 |      "cell_type": "code",
680 |      "collapsed": false,
681 |      "input": [
682 |       "# What are the longest and shortest sepal_lengths per species?\n"
683 |      ],
684 |      "language": "python",
685 |      "metadata": {},
686 |      "outputs": [],
687 |      "prompt_number": 20
688 |     },
689 |     {
690 |      "cell_type": "code",
691 |      "collapsed": false,
692 |      "input": [
693 |       "# What is the difference of longest to shortest sepal length per species\n"
694 |      ],
695 |      "language": "python",
696 |      "metadata": {},
697 |      "outputs": [],
698 |      "prompt_number": 21
699 |     },
700 |     {
701 |      "cell_type": "markdown",
702 |      "metadata": {},
703 |      "source": [
704 |       "### This is similar to how we solve these problems in Pandas\n",
705 |       "\n",
706 |       "So far, everything we've seen is similar to solving problems in Pandas"
707 |      ]
708 |     },
709 |     {
710 |      "cell_type": "code",
711 |      "collapsed": false,
712 |      "input": [
713 |       "import pandas as pd\n",
714 |       "df = pd.read_csv('data/iris.csv')\n",
715 |       "df.groupby(df.species).sepal_length.min()"
716 |      ],
717 |      "language": "python",
718 |      "metadata": {},
719 |      "outputs": [
720 |       {
721 |        "metadata": {},
722 |        "output_type": "pyout",
723 |        "prompt_number": 22,
724 |        "text": [
725 |         "species\n",
726 |         "Iris-setosa        4.3\n",
727 |         "Iris-versicolor    4.9\n",
728 |         "Iris-virginica     4.9\n",
729 |         "Name: sepal_length, dtype: float64"
730 |        ]
731 |       }
732 |      ],
733 |      "prompt_number": 22
734 |     },
735 |     {
736 |      "cell_type": "markdown",
737 |      "metadata": {},
738 |      "source": [
739 |       "In fact, for small CSV files like this, Blaze *uses Pandas*, so one might consider just using Pandas directly.\n",
740 |       "\n",
741 |       "Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section."
742 |      ]
743 |     }
744 |    ],
745 |    "metadata": {}
746 |   }
747 |  ]
748 | }


--------------------------------------------------------------------------------
/01-into-Introduction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "metadata": {
  3 |   "name": "",
  4 |   "signature": "sha256:05ae0cb667432cc3cca0cfc359be4b95108d55951367bf9229e536a93c4eba5c"
  5 |  },
  6 |  "nbformat": 3,
  7 |  "nbformat_minor": 0,
  8 |  "worksheets": [
  9 |   {
 10 |    "cells": [
 11 |     {
 12 |      "cell_type": "markdown",
 13 |      "metadata": {},
 14 |      "source": [
 15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
 16 |       "                                align=\"right\"\n",
 17 |       "                                width=\"30%\">\n",
 18 |       "\n",
 19 |       "# Getting started with `into`\n",
 20 |       "\n",
 21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
 22 |       "* Install software with `conda install -c blaze blaze`"
 23 |      ]
 24 |     },
 25 |     {
 26 |      "cell_type": "markdown",
 27 |      "metadata": {},
 28 |      "source": [
 29 |       "# 0. Introduction\n",
 30 |       "\n",
 31 |       "## Describing Data\n",
 32 |       "\n",
 33 |       "We describe \"data\" with a few attributes\n",
 34 |       "\n",
 35 |       "1.  Where that data lives (in a file on your laptop, in the cloud, etc..)\n",
 36 |       "2.  What data-types we use (we represent names as `strings` and balances as `float32`s)\n",
 37 |       "3.  What storage format we use (we store in CSV, in a PostgreSQL database, in JSON)\n",
 38 |       "4.  The semantic values that those bits represent (Barack Obama, or the number 3)\n",
 39 |       "\n",
 40 |       "As analysts we care only about point 4, the values that our data represent.  Points 1-3 are incidental to how we use computers; these points only get in the way of analysis.\n",
 41 |       "\n",
 42 |       "As computationalists though we care very deeply about points 1-3.  The choice of format, location, and datatype *strongly* impact the efficiency and correctness of our computations.  Good choices here can mean the difference between waiting overnight and using our data *interactively*.\n",
 43 |       "\n",
 44 |       "Unfortunately points 1-3 encompass a lot of complexity and change more quickly than most analysts care to manage.\n",
 45 |       "\n",
 46 |       "The `into` project alleviates the pain of dealing with the first three points by providing intuitive description and transfer between data formats and storage systems.  This allows analysts to quickly reason about and migrate their data to efficient, correct, and resilient formats."
 47 |      ]
 48 |     },
 49 |     {
 50 |      "cell_type": "markdown",
 51 |      "metadata": {},
 52 |      "source": [
 53 |       "### Motivating Example\n",
 54 |       "\n",
 55 |       "Before we start small with the tutorial we give a more comprehensive example.\n",
 56 |       "\n",
 57 |       "We have a small CSV file holding the iris data"
 58 |      ]
 59 |     },
 60 |     {
 61 |      "cell_type": "code",
 62 |      "collapsed": false,
 63 |      "input": [
 64 |       "!head -5 data/iris.csv"
 65 |      ],
 66 |      "language": "python",
 67 |      "metadata": {},
 68 |      "outputs": [
 69 |       {
 70 |        "output_type": "stream",
 71 |        "stream": "stdout",
 72 |        "text": [
 73 |         "sepal_length,sepal_width,petal_length,petal_width,species\r\n",
 74 |         "5.1,3.5,1.4,0.2,Iris-setosa\r\n",
 75 |         "4.9,3.0,1.4,0.2,Iris-setosa\r\n",
 76 |         "4.7,3.2,1.3,0.2,Iris-setosa\r\n",
 77 |         "4.6,3.1,1.5,0.2,Iris-setosa\r\n"
 78 |        ]
 79 |       }
 80 |      ],
 81 |      "prompt_number": 1
 82 |     },
 83 |     {
 84 |      "cell_type": "markdown",
 85 |      "metadata": {},
 86 |      "source": [
 87 |       "We put this data into a list, a NumPy array, and a SQLite database.  We move data to three very different technologies with the same abstraction."
 88 |      ]
 89 |     },
 90 |     {
 91 |      "cell_type": "code",
 92 |      "collapsed": false,
 93 |      "input": [
 94 |       "from into import into\n",
 95 |       "into(list, 'data/iris.csv')[:5]"
 96 |      ],
 97 |      "language": "python",
 98 |      "metadata": {},
 99 |      "outputs": [
100 |       {
101 |        "metadata": {},
102 |        "output_type": "pyout",
103 |        "prompt_number": 2,
104 |        "text": [
105 |         "[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n",
106 |         " (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n",
107 |         " (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n",
108 |         " (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n",
109 |         " (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')]"
110 |        ]
111 |       }
112 |      ],
113 |      "prompt_number": 2
114 |     },
115 |     {
116 |      "cell_type": "code",
117 |      "collapsed": false,
118 |      "input": [
119 |       "import numpy as np\n",
120 |       "into(np.ndarray, 'data/iris.csv')[:5]"
121 |      ],
122 |      "language": "python",
123 |      "metadata": {},
124 |      "outputs": [
125 |       {
126 |        "metadata": {},
127 |        "output_type": "pyout",
128 |        "prompt_number": 3,
129 |        "text": [
130 |         "rec.array([(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),\n",
131 |         "       (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),\n",
132 |         "       (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),\n",
133 |         "       (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),\n",
134 |         "       (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')], \n",
135 |         "      dtype=[('sepal_length', '<f8'), ('sepal_width', '<f8'), ('petal_length', '<f8'), ('petal_width', '<f8'), ('species', 'O')])"
136 |        ]
137 |       }
138 |      ],
139 |      "prompt_number": 3
140 |     },
141 |     {
142 |      "cell_type": "code",
143 |      "collapsed": false,
144 |      "input": [
145 |       "into('sqlite:///data/my.db::iris', 'data/iris.csv')"
146 |      ],
147 |      "language": "python",
148 |      "metadata": {},
149 |      "outputs": [
150 |       {
151 |        "metadata": {},
152 |        "output_type": "pyout",
153 |        "prompt_number": 4,
154 |        "text": [
155 |         "Table('iris', MetaData(bind=Engine(sqlite:///data/my.db)), Column('sepal_length', FLOAT(), table=<iris>, nullable=False), Column('sepal_width', FLOAT(), table=<iris>, nullable=False), Column('petal_length', FLOAT(), table=<iris>, nullable=False), Column('petal_width', FLOAT(), table=<iris>, nullable=False), Column('species', TEXT(), table=<iris>, nullable=False), schema=None)"
156 |        ]
157 |       }
158 |      ],
159 |      "prompt_number": 4
160 |     },
161 |     {
162 |      "cell_type": "markdown",
163 |      "metadata": {},
164 |      "source": [
165 |       "### Outline\n",
166 |       "\n",
167 |       "Into moves data between formats intuitively.\n",
168 |       "\n",
169 |       "We structure this tutorial as follows:\n",
170 |       "\n",
171 |       "1.  **Basic How-to**: on how to use this library effectively\n",
172 |       "2.  **Datatypes**: to enhance performance\n",
173 |       "3.  **Internals**"
174 |      ]
175 |     },
176 |     {
177 |      "cell_type": "markdown",
178 |      "metadata": {},
179 |      "source": [
180 |       "`into` is available on conda\n",
181 |       "\n",
182 |       "    conda install into\n",
183 |       "    or \n",
184 |       "    conda install into -c blaze   # Up-to-date version\n",
185 |       "    \n",
186 |       "or on PyPI\n",
187 |       "\n",
188 |       "    pip install into\n",
189 |       "    or \n",
190 |       "    pip install git+http://github.com/ContinuumIO/into.git  # Up-to-date version\n",
191 |       "   "
192 |      ]
193 |     },
194 |     {
195 |      "cell_type": "markdown",
196 |      "metadata": {},
197 |      "source": [
198 |       "# 1.  Basic Usage"
199 |      ]
200 |     },
201 |     {
202 |      "cell_type": "markdown",
203 |      "metadata": {},
204 |      "source": [
205 |       "## 1.1 `into`"
206 |      ]
207 |     },
208 |     {
209 |      "cell_type": "code",
210 |      "collapsed": false,
211 |      "input": [
212 |       "from into import into"
213 |      ],
214 |      "language": "python",
215 |      "metadata": {},
216 |      "outputs": [],
217 |      "prompt_number": 5
218 |     },
219 |     {
220 |      "cell_type": "markdown",
221 |      "metadata": {},
222 |      "source": [
223 |       "Into takes two arguments, a target and a source\n",
224 |       "\n",
225 |       "    into(target, source)\n",
226 |       "    \n",
227 |       "And it turns the source into something like the target"
228 |      ]
229 |     },
230 |     {
231 |      "cell_type": "markdown",
232 |      "metadata": {},
233 |      "source": [
234 |       "### The `target` can be a type\n",
235 |       "\n",
236 |       "In which case it makes a new object of that type"
237 |      ]
238 |     },
239 |     {
240 |      "cell_type": "code",
241 |      "collapsed": false,
242 |      "input": [
243 |       "import numpy as np\n",
244 |       "into(np.ndarray, [1, 2, 3])"
245 |      ],
246 |      "language": "python",
247 |      "metadata": {},
248 |      "outputs": [
249 |       {
250 |        "metadata": {},
251 |        "output_type": "pyout",
252 |        "prompt_number": 6,
253 |        "text": [
254 |         "array([1, 2, 3])"
255 |        ]
256 |       }
257 |      ],
258 |      "prompt_number": 6
259 |     },
260 |     {
261 |      "cell_type": "code",
262 |      "collapsed": false,
263 |      "input": [
264 |       "into(set, [1, 2, 3])"
265 |      ],
266 |      "language": "python",
267 |      "metadata": {},
268 |      "outputs": [
269 |       {
270 |        "metadata": {},
271 |        "output_type": "pyout",
272 |        "prompt_number": 7,
273 |        "text": [
274 |         "{1, 2, 3}"
275 |        ]
276 |       }
277 |      ],
278 |      "prompt_number": 7
279 |     },
280 |     {
281 |      "cell_type": "code",
282 |      "collapsed": false,
283 |      "input": [
284 |       "import pandas as pd\n",
285 |       "into(pd.Series, (10, 20, 30))"
286 |      ],
287 |      "language": "python",
288 |      "metadata": {},
289 |      "outputs": [
290 |       {
291 |        "metadata": {},
292 |        "output_type": "pyout",
293 |        "prompt_number": 8,
294 |        "text": [
295 |         "0    10\n",
296 |         "1    20\n",
297 |         "2    30\n",
298 |         "dtype: int64"
299 |        ]
300 |       }
301 |      ],
302 |      "prompt_number": 8
303 |     },
304 |     {
305 |      "cell_type": "markdown",
306 |      "metadata": {},
307 |      "source": [
308 |       "#### Exercise\n",
309 |       "\n",
310 |       "Use into to turn the following DataFrame into an `np.ndarray` and a `list`"
311 |      ]
312 |     },
313 |     {
314 |      "cell_type": "code",
315 |      "collapsed": false,
316 |      "input": [
317 |       "df = pd.DataFrame([['Alice',   100],\n",
318 |       "                   ['Bob',     200],\n",
319 |       "                   ['Charlie', 300]], columns=['name', 'balance'])"
320 |      ],
321 |      "language": "python",
322 |      "metadata": {},
323 |      "outputs": [],
324 |      "prompt_number": 9
325 |     },
326 |     {
327 |      "cell_type": "code",
328 |      "collapsed": false,
329 |      "input": [
330 |       "# Turn df into an np.ndarray\n",
331 |       "# into(..., ...)"
332 |      ],
333 |      "language": "python",
334 |      "metadata": {},
335 |      "outputs": [],
336 |      "prompt_number": 10
337 |     },
338 |     {
339 |      "cell_type": "code",
340 |      "collapsed": false,
341 |      "input": [
342 |       "# Turn df into a list\n",
343 |       "# into(..., ...)"
344 |      ],
345 |      "language": "python",
346 |      "metadata": {},
347 |      "outputs": [],
348 |      "prompt_number": 11
349 |     },
350 |     {
351 |      "cell_type": "code",
352 |      "collapsed": false,
353 |      "input": [],
354 |      "language": "python",
355 |      "metadata": {},
356 |      "outputs": [],
357 |      "prompt_number": 11
358 |     },
359 |     {
360 |      "cell_type": "markdown",
361 |      "metadata": {},
362 |      "source": [
363 |       "### The `target` can be an object\n",
364 |       "\n",
365 |       "In which we append the source to that object."
366 |      ]
367 |     },
368 |     {
369 |      "cell_type": "code",
370 |      "collapsed": false,
371 |      "input": [
372 |       "target = []\n",
373 |       "into(target, (1, 2, 3))\n",
374 |       "into(target, (1, 2, 3))\n",
375 |       "into(target, (1, 2, 3))\n",
376 |       "target"
377 |      ],
378 |      "language": "python",
379 |      "metadata": {},
380 |      "outputs": [
381 |       {
382 |        "metadata": {},
383 |        "output_type": "pyout",
384 |        "prompt_number": 12,
385 |        "text": [
386 |         "[1, 2, 3, 1, 2, 3, 1, 2, 3]"
387 |        ]
388 |       }
389 |      ],
390 |      "prompt_number": 12
391 |     },
392 |     {
393 |      "cell_type": "markdown",
394 |      "metadata": {},
395 |      "source": [
396 |       "#### Exercise\n",
397 |       "\n",
398 |       "Use `into` to make a set holding all the data in the following list of DataFrames."
399 |      ]
400 |     },
401 |     {
402 |      "cell_type": "code",
403 |      "collapsed": false,
404 |      "input": [
405 |       "L = [pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}),\n",
406 |       "     pd.DataFrame({'name': ['Charlie', 'Dan'], 'balance': [300, 400]}),\n",
407 |       "     pd.DataFrame({'name': ['Edith', 'Frank'], 'balance': [500, 600]})]"
408 |      ],
409 |      "language": "python",
410 |      "metadata": {},
411 |      "outputs": [],
412 |      "prompt_number": 13
413 |     },
414 |     {
415 |      "cell_type": "code",
416 |      "collapsed": false,
417 |      "input": [
418 |       "s = set()\n",
419 |       "# Use into and some kind of for loop to put all of the data in L into the set s\n"
420 |      ],
421 |      "language": "python",
422 |      "metadata": {},
423 |      "outputs": [],
424 |      "prompt_number": 14
425 |     },
426 |     {
427 |      "cell_type": "code",
428 |      "collapsed": false,
429 |      "input": [],
430 |      "language": "python",
431 |      "metadata": {},
432 |      "outputs": [],
433 |      "prompt_number": 14
434 |     },
435 |     {
436 |      "cell_type": "markdown",
437 |      "metadata": {},
438 |      "source": [
439 |       "#### Exercise\n",
440 |       "\n",
441 |       "Repeat the last exercise but append all of the data onto a `tuple`.  What do you expect to happen?"
442 |      ]
443 |     },
444 |     {
445 |      "cell_type": "code",
446 |      "collapsed": false,
447 |      "input": [
448 |       "t = tuple()\n"
449 |      ],
450 |      "language": "python",
451 |      "metadata": {},
452 |      "outputs": [],
453 |      "prompt_number": 15
454 |     },
455 |     {
456 |      "cell_type": "code",
457 |      "collapsed": false,
458 |      "input": [],
459 |      "language": "python",
460 |      "metadata": {},
461 |      "outputs": [],
462 |      "prompt_number": 15
463 |     },
464 |     {
465 |      "cell_type": "markdown",
466 |      "metadata": {},
467 |      "source": [
468 |       "It's important to know that `into` has limitations.  "
469 |      ]
470 |     },
471 |     {
472 |      "cell_type": "markdown",
473 |      "metadata": {},
474 |      "source": [
475 |       "### The target can be a string\n",
476 |       "\n",
477 |       "Many data sources external to Python (like a CSV file) don't have a Python object that we can put as the source or target.  In these cases we use string URIs.  Examples of strings include\n",
478 |       "\n",
479 |       "    myfile.csv\n",
480 |       "    myfile.json\n",
481 |       "    myfile.hdf5\n",
482 |       "    myfile.hdf5::/data\n",
483 |       "    sqlite:///myfile.db::table-name\n",
484 |       "    postgresql:///user:password@host:port::table-name\n",
485 |       "    ...\n",
486 |       "    \n",
487 |       "These can go either in the source or target inputs.\n",
488 |       "\n",
489 |       "Here we write our dataframe to a CSV file"
490 |      ]
491 |     },
492 |     {
493 |      "cell_type": "code",
494 |      "collapsed": false,
495 |      "input": [
496 |       "# Write DataFrame to CSV file\n",
497 |       "into('accounts.csv', df)"
498 |      ],
499 |      "language": "python",
500 |      "metadata": {},
501 |      "outputs": [
502 |       {
503 |        "metadata": {},
504 |        "output_type": "pyout",
505 |        "prompt_number": 16,
506 |        "text": [
507 |         "<into.backends.csv.CSV at 0x7fc92844fd50>"
508 |        ]
509 |       }
510 |      ],
511 |      "prompt_number": 16
512 |     },
513 |     {
514 |      "cell_type": "code",
515 |      "collapsed": false,
516 |      "input": [
517 |       "# print out text in accounts.csv\n",
518 |       "!head accounts.csv"
519 |      ],
520 |      "language": "python",
521 |      "metadata": {},
522 |      "outputs": [
523 |       {
524 |        "output_type": "stream",
525 |        "stream": "stdout",
526 |        "text": [
527 |         "name,balance\r\n",
528 |         "Alice,100\r\n",
529 |         "Bob,200\r\n",
530 |         "Charlie,300\r\n"
531 |        ]
532 |       }
533 |      ],
534 |      "prompt_number": 17
535 |     },
536 |     {
537 |      "cell_type": "code",
538 |      "collapsed": false,
539 |      "input": [
540 |       "# Read CSV file into memory as list\n",
541 |       "into(list, 'accounts.csv')"
542 |      ],
543 |      "language": "python",
544 |      "metadata": {},
545 |      "outputs": [
546 |       {
547 |        "metadata": {},
548 |        "output_type": "pyout",
549 |        "prompt_number": 18,
550 |        "text": [
551 |         "[(u'Alice', 100), (u'Bob', 200), (u'Charlie', 300)]"
552 |        ]
553 |       }
554 |      ],
555 |      "prompt_number": 18
556 |     },
557 |     {
558 |      "cell_type": "markdown",
559 |      "metadata": {},
560 |      "source": [
561 |       "#### Exercise\n",
562 |       "\n",
563 |       "Read the contents of the file `'data/iris.csv'` into a `pd.DataFrame`.  Then write that dataframe to `'data/iris.json'`.  Inspect the json data to ensure that it came out correctly."
564 |      ]
565 |     },
566 |     {
567 |      "cell_type": "code",
568 |      "collapsed": false,
569 |      "input": [],
570 |      "language": "python",
571 |      "metadata": {},
572 |      "outputs": [],
573 |      "prompt_number": 18
574 |     },
575 |     {
576 |      "cell_type": "code",
577 |      "collapsed": false,
578 |      "input": [],
579 |      "language": "python",
580 |      "metadata": {},
581 |      "outputs": [],
582 |      "prompt_number": 18
583 |     },
584 |     {
585 |      "cell_type": "markdown",
586 |      "metadata": {},
587 |      "source": [
588 |       "#### Exercise\n",
589 |       "\n",
590 |       "Write the contents of your json file to a SQLite database using the following URI as target\n",
591 |       "\n",
592 |       "    `sqlite:///data/my.db::iris'\n",
593 |       "    \n",
594 |       "Then read data from that SQLite database into Python to make sure that it arrived safely."
595 |      ]
596 |     },
597 |     {
598 |      "cell_type": "code",
599 |      "collapsed": false,
600 |      "input": [],
601 |      "language": "python",
602 |      "metadata": {},
603 |      "outputs": [],
604 |      "prompt_number": 18
605 |     },
606 |     {
607 |      "cell_type": "code",
608 |      "collapsed": false,
609 |      "input": [],
610 |      "language": "python",
611 |      "metadata": {},
612 |      "outputs": [],
613 |      "prompt_number": 18
614 |     },
615 |     {
616 |      "cell_type": "markdown",
617 |      "metadata": {},
618 |      "source": [
619 |       "## 1.2  `resource`\n",
620 |       "\n",
621 |       "Much interesting data lives *outside* of Python.  As we just saw, we often use URIs to specify this kind of data.\n",
622 |       "\n",
623 |       "Here we load a bit of a SQL database on baseball statistics ([download here](https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite)) into memory as a list"
624 |      ]
625 |     },
626 |     {
627 |      "cell_type": "code",
628 |      "collapsed": false,
629 |      "input": [
630 |       "into(list, 'sqlite:///data/lahman2013.sqlite::BattingPost')[:5]"
631 |      ],
632 |      "language": "python",
633 |      "metadata": {},
634 |      "outputs": [
635 |       {
636 |        "metadata": {},
637 |        "output_type": "pyout",
638 |        "prompt_number": 19,
639 |        "text": [
640 |         "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n",
641 |         " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n",
642 |         " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n",
643 |         " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n",
644 |         " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]"
645 |        ]
646 |       }
647 |      ],
648 |      "prompt_number": 19
649 |     },
650 |     {
651 |      "cell_type": "markdown",
652 |      "metadata": {},
653 |      "source": [
654 |       "Now we learn how these strings work so that we can specify many types of external data.\n",
655 |       "\n",
656 |       "Internally `into` uses the function `resource` to turn a string into a Python proxy object.  Usually these objects don't hold the data themselves.  They just serve as useful pointers to where the data lives.  In most cases we use other Python projects for proxy objects.\n",
657 |       "\n",
658 |       "In the case of SQL tables, resource returns a `sqlalchemy.Table` object."
659 |      ]
660 |     },
661 |     {
662 |      "cell_type": "code",
663 |      "collapsed": false,
664 |      "input": [
665 |       "from into import resource\n",
666 |       "t = resource('sqlite:///data/lahman2013.sqlite::BattingPost')"
667 |      ],
668 |      "language": "python",
669 |      "metadata": {},
670 |      "outputs": [],
671 |      "prompt_number": 20
672 |     },
673 |     {
674 |      "cell_type": "code",
675 |      "collapsed": false,
676 |      "input": [
677 |       "type(t)"
678 |      ],
679 |      "language": "python",
680 |      "metadata": {},
681 |      "outputs": [
682 |       {
683 |        "metadata": {},
684 |        "output_type": "pyout",
685 |        "prompt_number": 21,
686 |        "text": [
687 |         "sqlalchemy.sql.schema.Table"
688 |        ]
689 |       }
690 |      ],
691 |      "prompt_number": 21
692 |     },
693 |     {
694 |      "cell_type": "code",
695 |      "collapsed": false,
696 |      "input": [
697 |       "t"
698 |      ],
699 |      "language": "python",
700 |      "metadata": {},
701 |      "outputs": [
702 |       {
703 |        "metadata": {},
704 |        "output_type": "pyout",
705 |        "prompt_number": 22,
706 |        "text": [
707 |         "Table('BattingPost', MetaData(bind=Engine(sqlite:///data/lahman2013.sqlite)), Column('yearID', INTEGER(), table=<BattingPost>), Column('round', TEXT(), table=<BattingPost>), Column('playerID', TEXT(), table=<BattingPost>), Column('teamID', TEXT(), table=<BattingPost>), Column('lgID', TEXT(), table=<BattingPost>), Column('G', INTEGER(), table=<BattingPost>), Column('AB', INTEGER(), table=<BattingPost>), Column('R', INTEGER(), table=<BattingPost>), Column('H', INTEGER(), table=<BattingPost>), Column('2B', INTEGER(), table=<BattingPost>), Column('3B', INTEGER(), table=<BattingPost>), Column('HR', INTEGER(), table=<BattingPost>), Column('RBI', INTEGER(), table=<BattingPost>), Column('SB', INTEGER(), table=<BattingPost>), Column('CS', INTEGER(), table=<BattingPost>), Column('BB', INTEGER(), table=<BattingPost>), Column('SO', INTEGER(), table=<BattingPost>), Column('IBB', INTEGER(), table=<BattingPost>), Column('HBP', INTEGER(), table=<BattingPost>), Column('SH', INTEGER(), table=<BattingPost>), Column('SF', INTEGER(), table=<BattingPost>), Column('GIDP', INTEGER(), table=<BattingPost>), schema=None)"
708 |        ]
709 |       }
710 |      ],
711 |      "prompt_number": 22
712 |     },
713 |     {
714 |      "cell_type": "markdown",
715 |      "metadata": {},
716 |      "source": [
717 |       "We use *this* object as the `into` source."
718 |      ]
719 |     },
720 |     {
721 |      "cell_type": "code",
722 |      "collapsed": false,
723 |      "input": [
724 |       "into(list, t)[:5]"
725 |      ],
726 |      "language": "python",
727 |      "metadata": {},
728 |      "outputs": [
729 |       {
730 |        "metadata": {},
731 |        "output_type": "pyout",
732 |        "prompt_number": 23,
733 |        "text": [
734 |         "[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),\n",
735 |         " (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),\n",
736 |         " (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),\n",
737 |         " (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),\n",
738 |         " (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]"
739 |        ]
740 |       }
741 |      ],
742 |      "prompt_number": 23
743 |     },
744 |     {
745 |      "cell_type": "markdown",
746 |      "metadata": {},
747 |      "source": [
748 |       "So when you write\n",
749 |       "\n",
750 |       "```python\n",
751 |       "into(list, 'some-string')\n",
752 |       "```\n",
753 |       "\n",
754 |       "It is actually just shorthand for\n",
755 |       "\n",
756 |       "```python\n",
757 |       "into(list, resource('some-string'))\n",
758 |       "```"
759 |      ]
760 |     },
761 |     {
762 |      "cell_type": "markdown",
763 |      "metadata": {},
764 |      "source": [
765 |       "#### Exercise\n",
766 |       "\n",
767 |       "We have some data in the `data/` directory.  Use `resource` on each of the following strings to see what it returns.\n",
768 |       "\n",
769 |       "    sqlite:///data/lahman2013.sqlite::BattingPost\n",
770 |       "    sqlite:///data/lahman2013.sqlite::Salaries\n",
771 |       "    sqlite:///data/lahman2013.sqlite\n",
772 |       "    data/sample.hdf5::/points\n",
773 |       "    data/sample.hdf5\n",
774 |       "    data/iris.csv"
775 |      ]
776 |     },
777 |     {
778 |      "cell_type": "code",
779 |      "collapsed": false,
780 |      "input": [],
781 |      "language": "python",
782 |      "metadata": {},
783 |      "outputs": []
784 |     },
785 |     {
786 |      "cell_type": "code",
787 |      "collapsed": false,
788 |      "input": [],
789 |      "language": "python",
790 |      "metadata": {},
791 |      "outputs": []
792 |     },
793 |     {
794 |      "cell_type": "code",
795 |      "collapsed": false,
796 |      "input": [],
797 |      "language": "python",
798 |      "metadata": {},
799 |      "outputs": []
800 |     },
801 |     {
802 |      "cell_type": "code",
803 |      "collapsed": false,
804 |      "input": [],
805 |      "language": "python",
806 |      "metadata": {},
807 |      "outputs": []
808 |     },
809 |     {
810 |      "cell_type": "code",
811 |      "collapsed": false,
812 |      "input": [],
813 |      "language": "python",
814 |      "metadata": {},
815 |      "outputs": []
816 |     },
817 |     {
818 |      "cell_type": "code",
819 |      "collapsed": false,
820 |      "input": [],
821 |      "language": "python",
822 |      "metadata": {},
823 |      "outputs": []
824 |     },
825 |     {
826 |      "cell_type": "markdown",
827 |      "metadata": {},
828 |      "source": [
829 |       "### The `::` separator\n",
830 |       "\n",
831 |       "In the last exercise we saw the following URI for a table in a SQLite database\n",
832 |       "\n",
833 |       "    sqlite:///data/my.db::iris\n",
834 |       "    \n",
835 |       "We deconstruct this URI to make it more clear.  First we separate the URI on `::` to separate out the database from the table name\n",
836 |       "\n",
837 |       "    Database:    sqlite:///data/my.db\n",
838 |       "    Table name:  iris\n",
839 |       "    \n",
840 |       "We use the `::` separator whenever datasets live within some nested structure like a database or HDF5 file."
841 |      ]
842 |     },
843 |     {
844 |      "cell_type": "markdown",
845 |      "metadata": {},
846 |      "source": [
847 |       "### Specifying protocols with `://`\n",
848 |       "\n",
849 |       "The database string `sqlite:///data/my.db` is specific to SQLAlchemy, but follows a common format, notably\n",
850 |       "\n",
851 |       "    Protocol:  sqlite://\n",
852 |       "    Filename:  data/my.db\n",
853 |       "    \n",
854 |       "Into also uses protocols in many cases to give extra hints on how to handle your data.  For example Python has a few different libraries to handle HDF5 files (`h5py`, `pytables`, `pandas.HDFStore`).  By default when we see a URI like `myfile.hdf5` we currently use `h5py`.  To override this behavior you can specify a protocol string like `hdfstore://myfile.hdf5` to specify that you want to use the special `pandas.HDFStore` format.\n",
855 |       "\n",
856 |       "*Note:* SQLAlchemy strings are a little odd in that they use three slashes by default (e.g. `sqlite:///my.db`) and *four* slashes when using absolute paths (e.g. `sqlite:////Users/Alice/data/my.db`)."
857 |      ]
858 |     },
859 |     {
860 |      "cell_type": "markdown",
861 |      "metadata": {},
862 |      "source": [
863 |       "### Exercise\n",
864 |       "\n",
865 |       "People use the `.json` extension in two ways.  \n",
866 |       "\n",
867 |       "1.  The entire file is one JSON blob, often a list of dictionaries.  (Traditional JSON)\n",
868 |       "2.  Each line of the file is one JSON blob (Line-delimited JSON or JSON-lines.)\n",
869 |       "\n",
870 |       "Parsers have a hard time figuring out which case is which.  When reading an existing file `into` can usually figure out if the file is line-delimited or not.  When createing a file however we don't know what your intention is.  You can specify your intention by adding either a `json://` or `jsonlines://` protocol."
871 |      ]
872 |     },
873 |     {
874 |      "cell_type": "markdown",
875 |      "metadata": {},
876 |      "source": [
877 |       "Here we write our dataframe to a JSON file."
878 |      ]
879 |     },
880 |     {
881 |      "cell_type": "code",
882 |      "collapsed": true,
883 |      "input": [
884 |       "into('accounts.json', df)\n",
885 |       "!head accounts.json"
886 |      ],
887 |      "language": "python",
888 |      "metadata": {},
889 |      "outputs": [
890 |       {
891 |        "output_type": "stream",
892 |        "stream": "stdout",
893 |        "text": [
894 |         "[{\"balance\": 100, \"name\": \"Alice\"}, {\"balance\": 200, \"name\": \"Bob\"}, {\"balance\": 300, \"name\": \"Charlie\"}]"
895 |        ]
896 |       }
897 |      ],
898 |      "prompt_number": 24
899 |     },
900 |     {
901 |      "cell_type": "code",
902 |      "collapsed": false,
903 |      "input": [
904 |       "!rm accounts.json  # Remove old file"
905 |      ],
906 |      "language": "python",
907 |      "metadata": {},
908 |      "outputs": [],
909 |      "prompt_number": 25
910 |     },
911 |     {
912 |      "cell_type": "markdown",
913 |      "metadata": {},
914 |      "source": [
915 |       "This is the traditional single-JSON-blob-per-file format.  \n",
916 |       "\n",
917 |       "Instead write our DataFrame in the line-delimited format by adding a `jsonlines://` protocol to the target string.  Inspect the result to make sure that each line is a separate valid JSON blob."
918 |      ]
919 |     },
920 |     {
921 |      "cell_type": "code",
922 |      "collapsed": false,
923 |      "input": [],
924 |      "language": "python",
925 |      "metadata": {},
926 |      "outputs": []
927 |     },
928 |     {
929 |      "cell_type": "code",
930 |      "collapsed": false,
931 |      "input": [],
932 |      "language": "python",
933 |      "metadata": {},
934 |      "outputs": []
935 |     }
936 |    ],
937 |    "metadata": {}
938 |   }
939 |  ]
940 | }


--------------------------------------------------------------------------------
/00-Motivation.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "metadata": {
   3 |   "name": "",
   4 |   "signature": "sha256:f41f038d7ff89a4de13d4540c9f9b364d8886a157a3b10a3162b144b3437a86c"
   5 |  },
   6 |  "nbformat": 3,
   7 |  "nbformat_minor": 0,
   8 |  "worksheets": [
   9 |   {
  10 |    "cells": [
  11 |     {
  12 |      "cell_type": "markdown",
  13 |      "metadata": {},
  14 |      "source": [
  15 |       "# 0. Motivation\n",
  16 |       "\n",
  17 |       "Where we store data and how we interact with it are unfortunately related."
  18 |      ]
  19 |     },
  20 |     {
  21 |      "cell_type": "markdown",
  22 |      "metadata": {},
  23 |      "source": [
  24 |       "## Access SQL Database\n",
  25 |       "\n",
  26 |       "For example, we often access SQL databases using SQL query strings"
  27 |      ]
  28 |     },
  29 |     {
  30 |      "cell_type": "code",
  31 |      "collapsed": false,
  32 |      "input": [
  33 |       "import sqlalchemy\n",
  34 |       "engine = sqlalchemy.create_engine('sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db')\n",
  35 |       "engine"
  36 |      ],
  37 |      "language": "python",
  38 |      "metadata": {},
  39 |      "outputs": [
  40 |       {
  41 |        "metadata": {},
  42 |        "output_type": "pyout",
  43 |        "prompt_number": 1,
  44 |        "text": [
  45 |         "Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)"
  46 |        ]
  47 |       }
  48 |      ],
  49 |      "prompt_number": 1
  50 |     },
  51 |     {
  52 |      "cell_type": "code",
  53 |      "collapsed": false,
  54 |      "input": [
  55 |       "conn = engine.connect()\n",
  56 |       "list(conn.execute('''SELECT petal_length, petal_width, sepal_length, sepal_width, species\n",
  57 |       "                     FROM iris\n",
  58 |       "                     LIMIT 10'''))"
  59 |      ],
  60 |      "language": "python",
  61 |      "metadata": {},
  62 |      "outputs": [
  63 |       {
  64 |        "metadata": {},
  65 |        "output_type": "pyout",
  66 |        "prompt_number": 2,
  67 |        "text": [
  68 |         "[(1.4, 0.2, 5.1, 3.5, u'Iris-setosa'),\n",
  69 |         " (1.4, 0.2, 4.9, 3.0, u'Iris-setosa'),\n",
  70 |         " (1.3, 0.2, 4.7, 3.2, u'Iris-setosa'),\n",
  71 |         " (1.5, 0.2, 4.6, 3.1, u'Iris-setosa'),\n",
  72 |         " (1.4, 0.2, 5.0, 3.6, u'Iris-setosa'),\n",
  73 |         " (1.7, 0.4, 5.4, 3.9, u'Iris-setosa'),\n",
  74 |         " (1.4, 0.3, 4.6, 3.4, u'Iris-setosa'),\n",
  75 |         " (1.5, 0.2, 5.0, 3.4, u'Iris-setosa'),\n",
  76 |         " (1.4, 0.2, 4.4, 2.9, u'Iris-setosa'),\n",
  77 |         " (1.5, 0.1, 4.9, 3.1, u'Iris-setosa')]"
  78 |        ]
  79 |       }
  80 |      ],
  81 |      "prompt_number": 2
  82 |     },
  83 |     {
  84 |      "cell_type": "code",
  85 |      "collapsed": false,
  86 |      "input": [
  87 |       "list(conn.execute('''SELECT avg(petal_length), max(petal_width), species\n",
  88 |       "                     FROM iris\n",
  89 |       "                     GROUP BY species'''))"
  90 |      ],
  91 |      "language": "python",
  92 |      "metadata": {},
  93 |      "outputs": [
  94 |       {
  95 |        "metadata": {},
  96 |        "output_type": "pyout",
  97 |        "prompt_number": 3,
  98 |        "text": [
  99 |         "[(1.4620000000000002, 0.6, u'Iris-setosa'),\n",
 100 |         " (4.26, 1.8, u'Iris-versicolor'),\n",
 101 |         " (5.552, 2.5, u'Iris-virginica')]"
 102 |        ]
 103 |       }
 104 |      ],
 105 |      "prompt_number": 3
 106 |     },
 107 |     {
 108 |      "cell_type": "markdown",
 109 |      "metadata": {},
 110 |      "source": [
 111 |       "## Load database into memory, interact with Pandas\n",
 112 |       "\n",
 113 |       "Many Python users prefer the interactive Pandas DataFrame.  We can use Pandas on this data by copying the entire table (actually quite small in this case) into memory and manipulate it directly with Pandas syntax and algorithms."
 114 |      ]
 115 |     },
 116 |     {
 117 |      "cell_type": "code",
 118 |      "collapsed": false,
 119 |      "input": [
 120 |       "import pandas as pd\n",
 121 |       "df = pd.read_sql('SELECT * FROM iris', engine)"
 122 |      ],
 123 |      "language": "python",
 124 |      "metadata": {},
 125 |      "outputs": [],
 126 |      "prompt_number": 4
 127 |     },
 128 |     {
 129 |      "cell_type": "code",
 130 |      "collapsed": false,
 131 |      "input": [
 132 |       "df"
 133 |      ],
 134 |      "language": "python",
 135 |      "metadata": {},
 136 |      "outputs": [
 137 |       {
 138 |        "html": [
 139 |         "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
 140 |         "<table border=\"1\" class=\"dataframe\">\n",
 141 |         "  <thead>\n",
 142 |         "    <tr style=\"text-align: right;\">\n",
 143 |         "      <th></th>\n",
 144 |         "      <th>sepal_length</th>\n",
 145 |         "      <th>sepal_width</th>\n",
 146 |         "      <th>petal_length</th>\n",
 147 |         "      <th>petal_width</th>\n",
 148 |         "      <th>species</th>\n",
 149 |         "    </tr>\n",
 150 |         "  </thead>\n",
 151 |         "  <tbody>\n",
 152 |         "    <tr>\n",
 153 |         "      <th>0  </th>\n",
 154 |         "      <td> 5.1</td>\n",
 155 |         "      <td> 3.5</td>\n",
 156 |         "      <td> 1.4</td>\n",
 157 |         "      <td> 0.2</td>\n",
 158 |         "      <td>    Iris-setosa</td>\n",
 159 |         "    </tr>\n",
 160 |         "    <tr>\n",
 161 |         "      <th>1  </th>\n",
 162 |         "      <td> 4.9</td>\n",
 163 |         "      <td> 3.0</td>\n",
 164 |         "      <td> 1.4</td>\n",
 165 |         "      <td> 0.2</td>\n",
 166 |         "      <td>    Iris-setosa</td>\n",
 167 |         "    </tr>\n",
 168 |         "    <tr>\n",
 169 |         "      <th>2  </th>\n",
 170 |         "      <td> 4.7</td>\n",
 171 |         "      <td> 3.2</td>\n",
 172 |         "      <td> 1.3</td>\n",
 173 |         "      <td> 0.2</td>\n",
 174 |         "      <td>    Iris-setosa</td>\n",
 175 |         "    </tr>\n",
 176 |         "    <tr>\n",
 177 |         "      <th>3  </th>\n",
 178 |         "      <td> 4.6</td>\n",
 179 |         "      <td> 3.1</td>\n",
 180 |         "      <td> 1.5</td>\n",
 181 |         "      <td> 0.2</td>\n",
 182 |         "      <td>    Iris-setosa</td>\n",
 183 |         "    </tr>\n",
 184 |         "    <tr>\n",
 185 |         "      <th>4  </th>\n",
 186 |         "      <td> 5.0</td>\n",
 187 |         "      <td> 3.6</td>\n",
 188 |         "      <td> 1.4</td>\n",
 189 |         "      <td> 0.2</td>\n",
 190 |         "      <td>    Iris-setosa</td>\n",
 191 |         "    </tr>\n",
 192 |         "    <tr>\n",
 193 |         "      <th>5  </th>\n",
 194 |         "      <td> 5.4</td>\n",
 195 |         "      <td> 3.9</td>\n",
 196 |         "      <td> 1.7</td>\n",
 197 |         "      <td> 0.4</td>\n",
 198 |         "      <td>    Iris-setosa</td>\n",
 199 |         "    </tr>\n",
 200 |         "    <tr>\n",
 201 |         "      <th>6  </th>\n",
 202 |         "      <td> 4.6</td>\n",
 203 |         "      <td> 3.4</td>\n",
 204 |         "      <td> 1.4</td>\n",
 205 |         "      <td> 0.3</td>\n",
 206 |         "      <td>    Iris-setosa</td>\n",
 207 |         "    </tr>\n",
 208 |         "    <tr>\n",
 209 |         "      <th>7  </th>\n",
 210 |         "      <td> 5.0</td>\n",
 211 |         "      <td> 3.4</td>\n",
 212 |         "      <td> 1.5</td>\n",
 213 |         "      <td> 0.2</td>\n",
 214 |         "      <td>    Iris-setosa</td>\n",
 215 |         "    </tr>\n",
 216 |         "    <tr>\n",
 217 |         "      <th>8  </th>\n",
 218 |         "      <td> 4.4</td>\n",
 219 |         "      <td> 2.9</td>\n",
 220 |         "      <td> 1.4</td>\n",
 221 |         "      <td> 0.2</td>\n",
 222 |         "      <td>    Iris-setosa</td>\n",
 223 |         "    </tr>\n",
 224 |         "    <tr>\n",
 225 |         "      <th>9  </th>\n",
 226 |         "      <td> 4.9</td>\n",
 227 |         "      <td> 3.1</td>\n",
 228 |         "      <td> 1.5</td>\n",
 229 |         "      <td> 0.1</td>\n",
 230 |         "      <td>    Iris-setosa</td>\n",
 231 |         "    </tr>\n",
 232 |         "    <tr>\n",
 233 |         "      <th>10 </th>\n",
 234 |         "      <td> 5.4</td>\n",
 235 |         "      <td> 3.7</td>\n",
 236 |         "      <td> 1.5</td>\n",
 237 |         "      <td> 0.2</td>\n",
 238 |         "      <td>    Iris-setosa</td>\n",
 239 |         "    </tr>\n",
 240 |         "    <tr>\n",
 241 |         "      <th>11 </th>\n",
 242 |         "      <td> 4.8</td>\n",
 243 |         "      <td> 3.4</td>\n",
 244 |         "      <td> 1.6</td>\n",
 245 |         "      <td> 0.2</td>\n",
 246 |         "      <td>    Iris-setosa</td>\n",
 247 |         "    </tr>\n",
 248 |         "    <tr>\n",
 249 |         "      <th>12 </th>\n",
 250 |         "      <td> 4.8</td>\n",
 251 |         "      <td> 3.0</td>\n",
 252 |         "      <td> 1.4</td>\n",
 253 |         "      <td> 0.1</td>\n",
 254 |         "      <td>    Iris-setosa</td>\n",
 255 |         "    </tr>\n",
 256 |         "    <tr>\n",
 257 |         "      <th>13 </th>\n",
 258 |         "      <td> 4.3</td>\n",
 259 |         "      <td> 3.0</td>\n",
 260 |         "      <td> 1.1</td>\n",
 261 |         "      <td> 0.1</td>\n",
 262 |         "      <td>    Iris-setosa</td>\n",
 263 |         "    </tr>\n",
 264 |         "    <tr>\n",
 265 |         "      <th>14 </th>\n",
 266 |         "      <td> 5.8</td>\n",
 267 |         "      <td> 4.0</td>\n",
 268 |         "      <td> 1.2</td>\n",
 269 |         "      <td> 0.2</td>\n",
 270 |         "      <td>    Iris-setosa</td>\n",
 271 |         "    </tr>\n",
 272 |         "    <tr>\n",
 273 |         "      <th>15 </th>\n",
 274 |         "      <td> 5.7</td>\n",
 275 |         "      <td> 4.4</td>\n",
 276 |         "      <td> 1.5</td>\n",
 277 |         "      <td> 0.4</td>\n",
 278 |         "      <td>    Iris-setosa</td>\n",
 279 |         "    </tr>\n",
 280 |         "    <tr>\n",
 281 |         "      <th>16 </th>\n",
 282 |         "      <td> 5.4</td>\n",
 283 |         "      <td> 3.9</td>\n",
 284 |         "      <td> 1.3</td>\n",
 285 |         "      <td> 0.4</td>\n",
 286 |         "      <td>    Iris-setosa</td>\n",
 287 |         "    </tr>\n",
 288 |         "    <tr>\n",
 289 |         "      <th>17 </th>\n",
 290 |         "      <td> 5.1</td>\n",
 291 |         "      <td> 3.5</td>\n",
 292 |         "      <td> 1.4</td>\n",
 293 |         "      <td> 0.3</td>\n",
 294 |         "      <td>    Iris-setosa</td>\n",
 295 |         "    </tr>\n",
 296 |         "    <tr>\n",
 297 |         "      <th>18 </th>\n",
 298 |         "      <td> 5.7</td>\n",
 299 |         "      <td> 3.8</td>\n",
 300 |         "      <td> 1.7</td>\n",
 301 |         "      <td> 0.3</td>\n",
 302 |         "      <td>    Iris-setosa</td>\n",
 303 |         "    </tr>\n",
 304 |         "    <tr>\n",
 305 |         "      <th>19 </th>\n",
 306 |         "      <td> 5.1</td>\n",
 307 |         "      <td> 3.8</td>\n",
 308 |         "      <td> 1.5</td>\n",
 309 |         "      <td> 0.3</td>\n",
 310 |         "      <td>    Iris-setosa</td>\n",
 311 |         "    </tr>\n",
 312 |         "    <tr>\n",
 313 |         "      <th>20 </th>\n",
 314 |         "      <td> 5.4</td>\n",
 315 |         "      <td> 3.4</td>\n",
 316 |         "      <td> 1.7</td>\n",
 317 |         "      <td> 0.2</td>\n",
 318 |         "      <td>    Iris-setosa</td>\n",
 319 |         "    </tr>\n",
 320 |         "    <tr>\n",
 321 |         "      <th>21 </th>\n",
 322 |         "      <td> 5.1</td>\n",
 323 |         "      <td> 3.7</td>\n",
 324 |         "      <td> 1.5</td>\n",
 325 |         "      <td> 0.4</td>\n",
 326 |         "      <td>    Iris-setosa</td>\n",
 327 |         "    </tr>\n",
 328 |         "    <tr>\n",
 329 |         "      <th>22 </th>\n",
 330 |         "      <td> 4.6</td>\n",
 331 |         "      <td> 3.6</td>\n",
 332 |         "      <td> 1.0</td>\n",
 333 |         "      <td> 0.2</td>\n",
 334 |         "      <td>    Iris-setosa</td>\n",
 335 |         "    </tr>\n",
 336 |         "    <tr>\n",
 337 |         "      <th>23 </th>\n",
 338 |         "      <td> 5.1</td>\n",
 339 |         "      <td> 3.3</td>\n",
 340 |         "      <td> 1.7</td>\n",
 341 |         "      <td> 0.5</td>\n",
 342 |         "      <td>    Iris-setosa</td>\n",
 343 |         "    </tr>\n",
 344 |         "    <tr>\n",
 345 |         "      <th>24 </th>\n",
 346 |         "      <td> 4.8</td>\n",
 347 |         "      <td> 3.4</td>\n",
 348 |         "      <td> 1.9</td>\n",
 349 |         "      <td> 0.2</td>\n",
 350 |         "      <td>    Iris-setosa</td>\n",
 351 |         "    </tr>\n",
 352 |         "    <tr>\n",
 353 |         "      <th>25 </th>\n",
 354 |         "      <td> 5.0</td>\n",
 355 |         "      <td> 3.0</td>\n",
 356 |         "      <td> 1.6</td>\n",
 357 |         "      <td> 0.2</td>\n",
 358 |         "      <td>    Iris-setosa</td>\n",
 359 |         "    </tr>\n",
 360 |         "    <tr>\n",
 361 |         "      <th>26 </th>\n",
 362 |         "      <td> 5.0</td>\n",
 363 |         "      <td> 3.4</td>\n",
 364 |         "      <td> 1.6</td>\n",
 365 |         "      <td> 0.4</td>\n",
 366 |         "      <td>    Iris-setosa</td>\n",
 367 |         "    </tr>\n",
 368 |         "    <tr>\n",
 369 |         "      <th>27 </th>\n",
 370 |         "      <td> 5.2</td>\n",
 371 |         "      <td> 3.5</td>\n",
 372 |         "      <td> 1.5</td>\n",
 373 |         "      <td> 0.2</td>\n",
 374 |         "      <td>    Iris-setosa</td>\n",
 375 |         "    </tr>\n",
 376 |         "    <tr>\n",
 377 |         "      <th>28 </th>\n",
 378 |         "      <td> 5.2</td>\n",
 379 |         "      <td> 3.4</td>\n",
 380 |         "      <td> 1.4</td>\n",
 381 |         "      <td> 0.2</td>\n",
 382 |         "      <td>    Iris-setosa</td>\n",
 383 |         "    </tr>\n",
 384 |         "    <tr>\n",
 385 |         "      <th>29 </th>\n",
 386 |         "      <td> 4.7</td>\n",
 387 |         "      <td> 3.2</td>\n",
 388 |         "      <td> 1.6</td>\n",
 389 |         "      <td> 0.2</td>\n",
 390 |         "      <td>    Iris-setosa</td>\n",
 391 |         "    </tr>\n",
 392 |         "    <tr>\n",
 393 |         "      <th>...</th>\n",
 394 |         "      <td>...</td>\n",
 395 |         "      <td>...</td>\n",
 396 |         "      <td>...</td>\n",
 397 |         "      <td>...</td>\n",
 398 |         "      <td>...</td>\n",
 399 |         "    </tr>\n",
 400 |         "    <tr>\n",
 401 |         "      <th>120</th>\n",
 402 |         "      <td> 6.9</td>\n",
 403 |         "      <td> 3.2</td>\n",
 404 |         "      <td> 5.7</td>\n",
 405 |         "      <td> 2.3</td>\n",
 406 |         "      <td> Iris-virginica</td>\n",
 407 |         "    </tr>\n",
 408 |         "    <tr>\n",
 409 |         "      <th>121</th>\n",
 410 |         "      <td> 5.6</td>\n",
 411 |         "      <td> 2.8</td>\n",
 412 |         "      <td> 4.9</td>\n",
 413 |         "      <td> 2.0</td>\n",
 414 |         "      <td> Iris-virginica</td>\n",
 415 |         "    </tr>\n",
 416 |         "    <tr>\n",
 417 |         "      <th>122</th>\n",
 418 |         "      <td> 7.7</td>\n",
 419 |         "      <td> 2.8</td>\n",
 420 |         "      <td> 6.7</td>\n",
 421 |         "      <td> 2.0</td>\n",
 422 |         "      <td> Iris-virginica</td>\n",
 423 |         "    </tr>\n",
 424 |         "    <tr>\n",
 425 |         "      <th>123</th>\n",
 426 |         "      <td> 6.3</td>\n",
 427 |         "      <td> 2.7</td>\n",
 428 |         "      <td> 4.9</td>\n",
 429 |         "      <td> 1.8</td>\n",
 430 |         "      <td> Iris-virginica</td>\n",
 431 |         "    </tr>\n",
 432 |         "    <tr>\n",
 433 |         "      <th>124</th>\n",
 434 |         "      <td> 6.7</td>\n",
 435 |         "      <td> 3.3</td>\n",
 436 |         "      <td> 5.7</td>\n",
 437 |         "      <td> 2.1</td>\n",
 438 |         "      <td> Iris-virginica</td>\n",
 439 |         "    </tr>\n",
 440 |         "    <tr>\n",
 441 |         "      <th>125</th>\n",
 442 |         "      <td> 7.2</td>\n",
 443 |         "      <td> 3.2</td>\n",
 444 |         "      <td> 6.0</td>\n",
 445 |         "      <td> 1.8</td>\n",
 446 |         "      <td> Iris-virginica</td>\n",
 447 |         "    </tr>\n",
 448 |         "    <tr>\n",
 449 |         "      <th>126</th>\n",
 450 |         "      <td> 6.2</td>\n",
 451 |         "      <td> 2.8</td>\n",
 452 |         "      <td> 4.8</td>\n",
 453 |         "      <td> 1.8</td>\n",
 454 |         "      <td> Iris-virginica</td>\n",
 455 |         "    </tr>\n",
 456 |         "    <tr>\n",
 457 |         "      <th>127</th>\n",
 458 |         "      <td> 6.1</td>\n",
 459 |         "      <td> 3.0</td>\n",
 460 |         "      <td> 4.9</td>\n",
 461 |         "      <td> 1.8</td>\n",
 462 |         "      <td> Iris-virginica</td>\n",
 463 |         "    </tr>\n",
 464 |         "    <tr>\n",
 465 |         "      <th>128</th>\n",
 466 |         "      <td> 6.4</td>\n",
 467 |         "      <td> 2.8</td>\n",
 468 |         "      <td> 5.6</td>\n",
 469 |         "      <td> 2.1</td>\n",
 470 |         "      <td> Iris-virginica</td>\n",
 471 |         "    </tr>\n",
 472 |         "    <tr>\n",
 473 |         "      <th>129</th>\n",
 474 |         "      <td> 7.2</td>\n",
 475 |         "      <td> 3.0</td>\n",
 476 |         "      <td> 5.8</td>\n",
 477 |         "      <td> 1.6</td>\n",
 478 |         "      <td> Iris-virginica</td>\n",
 479 |         "    </tr>\n",
 480 |         "    <tr>\n",
 481 |         "      <th>130</th>\n",
 482 |         "      <td> 7.4</td>\n",
 483 |         "      <td> 2.8</td>\n",
 484 |         "      <td> 6.1</td>\n",
 485 |         "      <td> 1.9</td>\n",
 486 |         "      <td> Iris-virginica</td>\n",
 487 |         "    </tr>\n",
 488 |         "    <tr>\n",
 489 |         "      <th>131</th>\n",
 490 |         "      <td> 7.9</td>\n",
 491 |         "      <td> 3.8</td>\n",
 492 |         "      <td> 6.4</td>\n",
 493 |         "      <td> 2.0</td>\n",
 494 |         "      <td> Iris-virginica</td>\n",
 495 |         "    </tr>\n",
 496 |         "    <tr>\n",
 497 |         "      <th>132</th>\n",
 498 |         "      <td> 6.4</td>\n",
 499 |         "      <td> 2.8</td>\n",
 500 |         "      <td> 5.6</td>\n",
 501 |         "      <td> 2.2</td>\n",
 502 |         "      <td> Iris-virginica</td>\n",
 503 |         "    </tr>\n",
 504 |         "    <tr>\n",
 505 |         "      <th>133</th>\n",
 506 |         "      <td> 6.3</td>\n",
 507 |         "      <td> 2.8</td>\n",
 508 |         "      <td> 5.1</td>\n",
 509 |         "      <td> 1.5</td>\n",
 510 |         "      <td> Iris-virginica</td>\n",
 511 |         "    </tr>\n",
 512 |         "    <tr>\n",
 513 |         "      <th>134</th>\n",
 514 |         "      <td> 6.1</td>\n",
 515 |         "      <td> 2.6</td>\n",
 516 |         "      <td> 5.6</td>\n",
 517 |         "      <td> 1.4</td>\n",
 518 |         "      <td> Iris-virginica</td>\n",
 519 |         "    </tr>\n",
 520 |         "    <tr>\n",
 521 |         "      <th>135</th>\n",
 522 |         "      <td> 7.7</td>\n",
 523 |         "      <td> 3.0</td>\n",
 524 |         "      <td> 6.1</td>\n",
 525 |         "      <td> 2.3</td>\n",
 526 |         "      <td> Iris-virginica</td>\n",
 527 |         "    </tr>\n",
 528 |         "    <tr>\n",
 529 |         "      <th>136</th>\n",
 530 |         "      <td> 6.3</td>\n",
 531 |         "      <td> 3.4</td>\n",
 532 |         "      <td> 5.6</td>\n",
 533 |         "      <td> 2.4</td>\n",
 534 |         "      <td> Iris-virginica</td>\n",
 535 |         "    </tr>\n",
 536 |         "    <tr>\n",
 537 |         "      <th>137</th>\n",
 538 |         "      <td> 6.4</td>\n",
 539 |         "      <td> 3.1</td>\n",
 540 |         "      <td> 5.5</td>\n",
 541 |         "      <td> 1.8</td>\n",
 542 |         "      <td> Iris-virginica</td>\n",
 543 |         "    </tr>\n",
 544 |         "    <tr>\n",
 545 |         "      <th>138</th>\n",
 546 |         "      <td> 6.0</td>\n",
 547 |         "      <td> 3.0</td>\n",
 548 |         "      <td> 4.8</td>\n",
 549 |         "      <td> 1.8</td>\n",
 550 |         "      <td> Iris-virginica</td>\n",
 551 |         "    </tr>\n",
 552 |         "    <tr>\n",
 553 |         "      <th>139</th>\n",
 554 |         "      <td> 6.9</td>\n",
 555 |         "      <td> 3.1</td>\n",
 556 |         "      <td> 5.4</td>\n",
 557 |         "      <td> 2.1</td>\n",
 558 |         "      <td> Iris-virginica</td>\n",
 559 |         "    </tr>\n",
 560 |         "    <tr>\n",
 561 |         "      <th>140</th>\n",
 562 |         "      <td> 6.7</td>\n",
 563 |         "      <td> 3.1</td>\n",
 564 |         "      <td> 5.6</td>\n",
 565 |         "      <td> 2.4</td>\n",
 566 |         "      <td> Iris-virginica</td>\n",
 567 |         "    </tr>\n",
 568 |         "    <tr>\n",
 569 |         "      <th>141</th>\n",
 570 |         "      <td> 6.9</td>\n",
 571 |         "      <td> 3.1</td>\n",
 572 |         "      <td> 5.1</td>\n",
 573 |         "      <td> 2.3</td>\n",
 574 |         "      <td> Iris-virginica</td>\n",
 575 |         "    </tr>\n",
 576 |         "    <tr>\n",
 577 |         "      <th>142</th>\n",
 578 |         "      <td> 5.8</td>\n",
 579 |         "      <td> 2.7</td>\n",
 580 |         "      <td> 5.1</td>\n",
 581 |         "      <td> 1.9</td>\n",
 582 |         "      <td> Iris-virginica</td>\n",
 583 |         "    </tr>\n",
 584 |         "    <tr>\n",
 585 |         "      <th>143</th>\n",
 586 |         "      <td> 6.8</td>\n",
 587 |         "      <td> 3.2</td>\n",
 588 |         "      <td> 5.9</td>\n",
 589 |         "      <td> 2.3</td>\n",
 590 |         "      <td> Iris-virginica</td>\n",
 591 |         "    </tr>\n",
 592 |         "    <tr>\n",
 593 |         "      <th>144</th>\n",
 594 |         "      <td> 6.7</td>\n",
 595 |         "      <td> 3.3</td>\n",
 596 |         "      <td> 5.7</td>\n",
 597 |         "      <td> 2.5</td>\n",
 598 |         "      <td> Iris-virginica</td>\n",
 599 |         "    </tr>\n",
 600 |         "    <tr>\n",
 601 |         "      <th>145</th>\n",
 602 |         "      <td> 6.7</td>\n",
 603 |         "      <td> 3.0</td>\n",
 604 |         "      <td> 5.2</td>\n",
 605 |         "      <td> 2.3</td>\n",
 606 |         "      <td> Iris-virginica</td>\n",
 607 |         "    </tr>\n",
 608 |         "    <tr>\n",
 609 |         "      <th>146</th>\n",
 610 |         "      <td> 6.3</td>\n",
 611 |         "      <td> 2.5</td>\n",
 612 |         "      <td> 5.0</td>\n",
 613 |         "      <td> 1.9</td>\n",
 614 |         "      <td> Iris-virginica</td>\n",
 615 |         "    </tr>\n",
 616 |         "    <tr>\n",
 617 |         "      <th>147</th>\n",
 618 |         "      <td> 6.5</td>\n",
 619 |         "      <td> 3.0</td>\n",
 620 |         "      <td> 5.2</td>\n",
 621 |         "      <td> 2.0</td>\n",
 622 |         "      <td> Iris-virginica</td>\n",
 623 |         "    </tr>\n",
 624 |         "    <tr>\n",
 625 |         "      <th>148</th>\n",
 626 |         "      <td> 6.2</td>\n",
 627 |         "      <td> 3.4</td>\n",
 628 |         "      <td> 5.4</td>\n",
 629 |         "      <td> 2.3</td>\n",
 630 |         "      <td> Iris-virginica</td>\n",
 631 |         "    </tr>\n",
 632 |         "    <tr>\n",
 633 |         "      <th>149</th>\n",
 634 |         "      <td> 5.9</td>\n",
 635 |         "      <td> 3.0</td>\n",
 636 |         "      <td> 5.1</td>\n",
 637 |         "      <td> 1.8</td>\n",
 638 |         "      <td> Iris-virginica</td>\n",
 639 |         "    </tr>\n",
 640 |         "  </tbody>\n",
 641 |         "</table>\n",
 642 |         "<p>150 rows \u00d7 5 columns</p>\n",
 643 |         "</div>"
 644 |        ],
 645 |        "metadata": {},
 646 |        "output_type": "pyout",
 647 |        "prompt_number": 5,
 648 |        "text": [
 649 |         "     sepal_length  sepal_width  petal_length  petal_width         species\n",
 650 |         "0             5.1          3.5           1.4          0.2     Iris-setosa\n",
 651 |         "1             4.9          3.0           1.4          0.2     Iris-setosa\n",
 652 |         "2             4.7          3.2           1.3          0.2     Iris-setosa\n",
 653 |         "3             4.6          3.1           1.5          0.2     Iris-setosa\n",
 654 |         "4             5.0          3.6           1.4          0.2     Iris-setosa\n",
 655 |         "5             5.4          3.9           1.7          0.4     Iris-setosa\n",
 656 |         "6             4.6          3.4           1.4          0.3     Iris-setosa\n",
 657 |         "7             5.0          3.4           1.5          0.2     Iris-setosa\n",
 658 |         "8             4.4          2.9           1.4          0.2     Iris-setosa\n",
 659 |         "9             4.9          3.1           1.5          0.1     Iris-setosa\n",
 660 |         "10            5.4          3.7           1.5          0.2     Iris-setosa\n",
 661 |         "11            4.8          3.4           1.6          0.2     Iris-setosa\n",
 662 |         "12            4.8          3.0           1.4          0.1     Iris-setosa\n",
 663 |         "13            4.3          3.0           1.1          0.1     Iris-setosa\n",
 664 |         "14            5.8          4.0           1.2          0.2     Iris-setosa\n",
 665 |         "15            5.7          4.4           1.5          0.4     Iris-setosa\n",
 666 |         "16            5.4          3.9           1.3          0.4     Iris-setosa\n",
 667 |         "17            5.1          3.5           1.4          0.3     Iris-setosa\n",
 668 |         "18            5.7          3.8           1.7          0.3     Iris-setosa\n",
 669 |         "19            5.1          3.8           1.5          0.3     Iris-setosa\n",
 670 |         "20            5.4          3.4           1.7          0.2     Iris-setosa\n",
 671 |         "21            5.1          3.7           1.5          0.4     Iris-setosa\n",
 672 |         "22            4.6          3.6           1.0          0.2     Iris-setosa\n",
 673 |         "23            5.1          3.3           1.7          0.5     Iris-setosa\n",
 674 |         "24            4.8          3.4           1.9          0.2     Iris-setosa\n",
 675 |         "25            5.0          3.0           1.6          0.2     Iris-setosa\n",
 676 |         "26            5.0          3.4           1.6          0.4     Iris-setosa\n",
 677 |         "27            5.2          3.5           1.5          0.2     Iris-setosa\n",
 678 |         "28            5.2          3.4           1.4          0.2     Iris-setosa\n",
 679 |         "29            4.7          3.2           1.6          0.2     Iris-setosa\n",
 680 |         "..            ...          ...           ...          ...             ...\n",
 681 |         "120           6.9          3.2           5.7          2.3  Iris-virginica\n",
 682 |         "121           5.6          2.8           4.9          2.0  Iris-virginica\n",
 683 |         "122           7.7          2.8           6.7          2.0  Iris-virginica\n",
 684 |         "123           6.3          2.7           4.9          1.8  Iris-virginica\n",
 685 |         "124           6.7          3.3           5.7          2.1  Iris-virginica\n",
 686 |         "125           7.2          3.2           6.0          1.8  Iris-virginica\n",
 687 |         "126           6.2          2.8           4.8          1.8  Iris-virginica\n",
 688 |         "127           6.1          3.0           4.9          1.8  Iris-virginica\n",
 689 |         "128           6.4          2.8           5.6          2.1  Iris-virginica\n",
 690 |         "129           7.2          3.0           5.8          1.6  Iris-virginica\n",
 691 |         "130           7.4          2.8           6.1          1.9  Iris-virginica\n",
 692 |         "131           7.9          3.8           6.4          2.0  Iris-virginica\n",
 693 |         "132           6.4          2.8           5.6          2.2  Iris-virginica\n",
 694 |         "133           6.3          2.8           5.1          1.5  Iris-virginica\n",
 695 |         "134           6.1          2.6           5.6          1.4  Iris-virginica\n",
 696 |         "135           7.7          3.0           6.1          2.3  Iris-virginica\n",
 697 |         "136           6.3          3.4           5.6          2.4  Iris-virginica\n",
 698 |         "137           6.4          3.1           5.5          1.8  Iris-virginica\n",
 699 |         "138           6.0          3.0           4.8          1.8  Iris-virginica\n",
 700 |         "139           6.9          3.1           5.4          2.1  Iris-virginica\n",
 701 |         "140           6.7          3.1           5.6          2.4  Iris-virginica\n",
 702 |         "141           6.9          3.1           5.1          2.3  Iris-virginica\n",
 703 |         "142           5.8          2.7           5.1          1.9  Iris-virginica\n",
 704 |         "143           6.8          3.2           5.9          2.3  Iris-virginica\n",
 705 |         "144           6.7          3.3           5.7          2.5  Iris-virginica\n",
 706 |         "145           6.7          3.0           5.2          2.3  Iris-virginica\n",
 707 |         "146           6.3          2.5           5.0          1.9  Iris-virginica\n",
 708 |         "147           6.5          3.0           5.2          2.0  Iris-virginica\n",
 709 |         "148           6.2          3.4           5.4          2.3  Iris-virginica\n",
 710 |         "149           5.9          3.0           5.1          1.8  Iris-virginica\n",
 711 |         "\n",
 712 |         "[150 rows x 5 columns]"
 713 |        ]
 714 |       }
 715 |      ],
 716 |      "prompt_number": 5
 717 |     },
 718 |     {
 719 |      "cell_type": "code",
 720 |      "collapsed": false,
 721 |      "input": [],
 722 |      "language": "python",
 723 |      "metadata": {},
 724 |      "outputs": [],
 725 |      "prompt_number": 5
 726 |     },
 727 |     {
 728 |      "cell_type": "code",
 729 |      "collapsed": false,
 730 |      "input": [],
 731 |      "language": "python",
 732 |      "metadata": {},
 733 |      "outputs": [],
 734 |      "prompt_number": 5
 735 |     },
 736 |     {
 737 |      "cell_type": "markdown",
 738 |      "metadata": {},
 739 |      "source": [
 740 |       "## Larger databases\n",
 741 |       "\n",
 742 |       "Pandas provides both a great user experience *and* fast in-memory algorithms.  When those algorithms become obsolete (e.g. when datasets grow large) then we're forced to throw away the great user experience and switch back to using SQL.\n",
 743 |       "\n",
 744 |       "Here we connect to Hive, a database backed by Hadoop MapReduce."
 745 |      ]
 746 |     },
 747 |     {
 748 |      "cell_type": "code",
 749 |      "collapsed": false,
 750 |      "input": [
 751 |       "engine = sqlalchemy.create_engine('hive://hdfs@54.91.57.226:10000/default')\n",
 752 |       "conn = engine.connect()\n",
 753 |       "list(conn.execute('''SELECT * \n",
 754 |       "                     FROM iris \n",
 755 |       "                     LIMIT 10'''))  # Imagine that this was big"
 756 |      ],
 757 |      "language": "python",
 758 |      "metadata": {},
 759 |      "outputs": [
 760 |       {
 761 |        "metadata": {},
 762 |        "output_type": "pyout",
 763 |        "prompt_number": 6,
 764 |        "text": [
 765 |         "[(5.099999904632568, 3.5, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n",
 766 |         " (4.900000095367432, 3.0, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n",
 767 |         " (4.699999809265137, 3.200000047683716, 1.2999999523162842, 0.20000000298023224, u'Iris-setosa'),\n",
 768 |         " (4.599999904632568, 3.0999999046325684, 1.5, 0.20000000298023224, u'Iris-setosa'),\n",
 769 |         " (5.0, 3.5999999046325684, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n",
 770 |         " (5.400000095367432, 3.9000000953674316, 1.7000000476837158, 0.4000000059604645, u'Iris-setosa'),\n",
 771 |         " (4.599999904632568, 3.4000000953674316, 1.399999976158142, 0.30000001192092896, u'Iris-setosa'),\n",
 772 |         " (5.0, 3.4000000953674316, 1.5, 0.20000000298023224, u'Iris-setosa'),\n",
 773 |         " (4.400000095367432, 2.9000000953674316, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'),\n",
 774 |         " (4.900000095367432, 3.0999999046325684, 1.5, 0.10000000149011612, u'Iris-setosa')]"
 775 |        ]
 776 |       }
 777 |      ],
 778 |      "prompt_number": 6
 779 |     },
 780 |     {
 781 |      "cell_type": "markdown",
 782 |      "metadata": {},
 783 |      "source": [
 784 |       "### Blaze provides a Pandas-like experience over foreign data\n",
 785 |       "\n",
 786 |       "We get the best of both worlds\n",
 787 |       "\n",
 788 |       "1.  The scalable computation of Hive\n",
 789 |       "2.  A Pandas-like interactive feel"
 790 |      ]
 791 |     },
 792 |     {
 793 |      "cell_type": "code",
 794 |      "collapsed": false,
 795 |      "input": [
 796 |       "from blaze import Data, by\n",
 797 |       "d = Data('hive://hdfs@54.91.57.226:10000/default')\n",
 798 |       "d.iris"
 799 |      ],
 800 |      "language": "python",
 801 |      "metadata": {},
 802 |      "outputs": [
 803 |       {
 804 |        "html": [
 805 |         "<table border=\"1\" class=\"dataframe\">\n",
 806 |         "  <thead>\n",
 807 |         "    <tr style=\"text-align: right;\">\n",
 808 |         "      <th></th>\n",
 809 |         "      <th>sepal_length</th>\n",
 810 |         "      <th>sepal_width</th>\n",
 811 |         "      <th>petal_length</th>\n",
 812 |         "      <th>petal_width</th>\n",
 813 |         "      <th>species</th>\n",
 814 |         "    </tr>\n",
 815 |         "  </thead>\n",
 816 |         "  <tbody>\n",
 817 |         "    <tr>\n",
 818 |         "      <th>0 </th>\n",
 819 |         "      <td> 5.1</td>\n",
 820 |         "      <td> 3.5</td>\n",
 821 |         "      <td> 1.4</td>\n",
 822 |         "      <td> 0.2</td>\n",
 823 |         "      <td> Iris-setosa</td>\n",
 824 |         "    </tr>\n",
 825 |         "    <tr>\n",
 826 |         "      <th>1 </th>\n",
 827 |         "      <td> 4.9</td>\n",
 828 |         "      <td> 3.0</td>\n",
 829 |         "      <td> 1.4</td>\n",
 830 |         "      <td> 0.2</td>\n",
 831 |         "      <td> Iris-setosa</td>\n",
 832 |         "    </tr>\n",
 833 |         "    <tr>\n",
 834 |         "      <th>2 </th>\n",
 835 |         "      <td> 4.7</td>\n",
 836 |         "      <td> 3.2</td>\n",
 837 |         "      <td> 1.3</td>\n",
 838 |         "      <td> 0.2</td>\n",
 839 |         "      <td> Iris-setosa</td>\n",
 840 |         "    </tr>\n",
 841 |         "    <tr>\n",
 842 |         "      <th>3 </th>\n",
 843 |         "      <td> 4.6</td>\n",
 844 |         "      <td> 3.1</td>\n",
 845 |         "      <td> 1.5</td>\n",
 846 |         "      <td> 0.2</td>\n",
 847 |         "      <td> Iris-setosa</td>\n",
 848 |         "    </tr>\n",
 849 |         "    <tr>\n",
 850 |         "      <th>4 </th>\n",
 851 |         "      <td> 5.0</td>\n",
 852 |         "      <td> 3.6</td>\n",
 853 |         "      <td> 1.4</td>\n",
 854 |         "      <td> 0.2</td>\n",
 855 |         "      <td> Iris-setosa</td>\n",
 856 |         "    </tr>\n",
 857 |         "    <tr>\n",
 858 |         "      <th>5 </th>\n",
 859 |         "      <td> 5.4</td>\n",
 860 |         "      <td> 3.9</td>\n",
 861 |         "      <td> 1.7</td>\n",
 862 |         "      <td> 0.4</td>\n",
 863 |         "      <td> Iris-setosa</td>\n",
 864 |         "    </tr>\n",
 865 |         "    <tr>\n",
 866 |         "      <th>6 </th>\n",
 867 |         "      <td> 4.6</td>\n",
 868 |         "      <td> 3.4</td>\n",
 869 |         "      <td> 1.4</td>\n",
 870 |         "      <td> 0.3</td>\n",
 871 |         "      <td> Iris-setosa</td>\n",
 872 |         "    </tr>\n",
 873 |         "    <tr>\n",
 874 |         "      <th>7 </th>\n",
 875 |         "      <td> 5.0</td>\n",
 876 |         "      <td> 3.4</td>\n",
 877 |         "      <td> 1.5</td>\n",
 878 |         "      <td> 0.2</td>\n",
 879 |         "      <td> Iris-setosa</td>\n",
 880 |         "    </tr>\n",
 881 |         "    <tr>\n",
 882 |         "      <th>8 </th>\n",
 883 |         "      <td> 4.4</td>\n",
 884 |         "      <td> 2.9</td>\n",
 885 |         "      <td> 1.4</td>\n",
 886 |         "      <td> 0.2</td>\n",
 887 |         "      <td> Iris-setosa</td>\n",
 888 |         "    </tr>\n",
 889 |         "    <tr>\n",
 890 |         "      <th>9 </th>\n",
 891 |         "      <td> 4.9</td>\n",
 892 |         "      <td> 3.1</td>\n",
 893 |         "      <td> 1.5</td>\n",
 894 |         "      <td> 0.1</td>\n",
 895 |         "      <td> Iris-setosa</td>\n",
 896 |         "    </tr>\n",
 897 |         "    <tr>\n",
 898 |         "      <th>10</th>\n",
 899 |         "      <td> 5.4</td>\n",
 900 |         "      <td> 3.7</td>\n",
 901 |         "      <td> 1.5</td>\n",
 902 |         "      <td> 0.2</td>\n",
 903 |         "      <td> Iris-setosa</td>\n",
 904 |         "    </tr>\n",
 905 |         "  </tbody>\n",
 906 |         "</table>"
 907 |        ],
 908 |        "metadata": {},
 909 |        "output_type": "pyout",
 910 |        "prompt_number": 7,
 911 |        "text": [
 912 |         "    sepal_length  sepal_width  petal_length  petal_width      species\n",
 913 |         "0            5.1          3.5           1.4          0.2  Iris-setosa\n",
 914 |         "1            4.9          3.0           1.4          0.2  Iris-setosa\n",
 915 |         "2            4.7          3.2           1.3          0.2  Iris-setosa\n",
 916 |         "3            4.6          3.1           1.5          0.2  Iris-setosa\n",
 917 |         "4            5.0          3.6           1.4          0.2  Iris-setosa\n",
 918 |         "5            5.4          3.9           1.7          0.4  Iris-setosa\n",
 919 |         "6            4.6          3.4           1.4          0.3  Iris-setosa\n",
 920 |         "7            5.0          3.4           1.5          0.2  Iris-setosa\n",
 921 |         "8            4.4          2.9           1.4          0.2  Iris-setosa\n",
 922 |         "9            4.9          3.1           1.5          0.1  Iris-setosa\n",
 923 |         "..."
 924 |        ]
 925 |       }
 926 |      ],
 927 |      "prompt_number": 7
 928 |     },
 929 |     {
 930 |      "cell_type": "code",
 931 |      "collapsed": false,
 932 |      "input": [
 933 |       "by(d.iris.species, largest=d.iris.sepal_length.max(),\n",
 934 |       "                  smallest=d.iris.sepal_length.min())"
 935 |      ],
 936 |      "language": "python",
 937 |      "metadata": {},
 938 |      "outputs": [
 939 |       {
 940 |        "html": [
 941 |         "<table border=\"1\" class=\"dataframe\">\n",
 942 |         "  <thead>\n",
 943 |         "    <tr style=\"text-align: right;\">\n",
 944 |         "      <th></th>\n",
 945 |         "      <th>species</th>\n",
 946 |         "      <th>largest</th>\n",
 947 |         "      <th>smallest</th>\n",
 948 |         "    </tr>\n",
 949 |         "  </thead>\n",
 950 |         "  <tbody>\n",
 951 |         "    <tr>\n",
 952 |         "      <th>0</th>\n",
 953 |         "      <td>     Iris-setosa</td>\n",
 954 |         "      <td> 5.8</td>\n",
 955 |         "      <td> 4.3</td>\n",
 956 |         "    </tr>\n",
 957 |         "    <tr>\n",
 958 |         "      <th>1</th>\n",
 959 |         "      <td> Iris-versicolor</td>\n",
 960 |         "      <td> 7.0</td>\n",
 961 |         "      <td> 4.9</td>\n",
 962 |         "    </tr>\n",
 963 |         "    <tr>\n",
 964 |         "      <th>2</th>\n",
 965 |         "      <td>  Iris-virginica</td>\n",
 966 |         "      <td> 7.9</td>\n",
 967 |         "      <td> 4.9</td>\n",
 968 |         "    </tr>\n",
 969 |         "  </tbody>\n",
 970 |         "</table>"
 971 |        ],
 972 |        "metadata": {},
 973 |        "output_type": "pyout",
 974 |        "prompt_number": 8,
 975 |        "text": [
 976 |         "           species  largest  smallest\n",
 977 |         "0      Iris-setosa      5.8       4.3\n",
 978 |         "1  Iris-versicolor      7.0       4.9\n",
 979 |         "2   Iris-virginica      7.9       4.9"
 980 |        ]
 981 |       }
 982 |      ],
 983 |      "prompt_number": 8
 984 |     },
 985 |     {
 986 |      "cell_type": "markdown",
 987 |      "metadata": {},
 988 |      "source": [
 989 |       "### Blaze doesn't move the data back and forth, it moves your query back and forth\n",
 990 |       "\n",
 991 |       "Here we use the internal API to show the translated Blaze query that Blaze sends to the Hive database."
 992 |      ]
 993 |     },
 994 |     {
 995 |      "cell_type": "code",
 996 |      "collapsed": false,
 997 |      "input": [
 998 |       "from blaze import compute\n",
 999 |       "\n",
1000 |       "query = by(d.iris.species, largest=d.iris.sepal_length.max(),\n",
1001 |       "                  smallest=d.iris.sepal_length.min())\n",
1002 |       "\n",
1003 |       "print compute(query)"
1004 |      ],
1005 |      "language": "python",
1006 |      "metadata": {},
1007 |      "outputs": [
1008 |       {
1009 |        "output_type": "stream",
1010 |        "stream": "stdout",
1011 |        "text": [
1012 |         "SELECT iris.species, max(iris.sepal_length) AS largest, min(iris.sepal_length) AS smallest \n",
1013 |         "FROM iris GROUP BY iris.species\n"
1014 |        ]
1015 |       }
1016 |      ],
1017 |      "prompt_number": 9
1018 |     }
1019 |    ],
1020 |    "metadata": {}
1021 |   }
1022 |  ]
1023 | }


--------------------------------------------------------------------------------
/02-into-Datatypes.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "metadata": {
   3 |   "name": "",
   4 |   "signature": "sha256:8f94d037b8b415d618faa18cc783a5bdf6c8d4830a454dd9ee2686367f9d24b9"
   5 |  },
   6 |  "nbformat": 3,
   7 |  "nbformat_minor": 0,
   8 |  "worksheets": [
   9 |   {
  10 |    "cells": [
  11 |     {
  12 |      "cell_type": "markdown",
  13 |      "metadata": {},
  14 |      "source": [
  15 |       "<img src=\"images/blaze_med.png\" alt=\"Blaze Logo\"\n",
  16 |       "                                align=\"right\"\n",
  17 |       "                                width=\"30%\">\n",
  18 |       "\n",
  19 |       "# Getting started with `into`\n",
  20 |       "\n",
  21 |       "* Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial\n",
  22 |       "* Install software with `conda install -c blaze blaze`"
  23 |      ]
  24 |     },
  25 |     {
  26 |      "cell_type": "markdown",
  27 |      "metadata": {},
  28 |      "source": [
  29 |       "## 2. Datatypes for Performance tuning\n",
  30 |       "\n",
  31 |       "*  *Do we store these as integers or floats?  `int32` or `int64`?  `int8`?*\n",
  32 |       "*  *Do we store times as datetimes or as strings?*\n",
  33 |       "*  *Do we store these strings as variable length or fixed length*?\n",
  34 |       "*  *Do we know how large this array will be?*\n",
  35 |       "\n",
  36 |       "As we encode values as bits we make choices; those choices can affect performance.  We encode how to convert values to bits and back as a *datatype*.  You've seen data types before in many forms including C types like `long`, `double` and `double[100]`, numpy dtypes like `i4` and `f8` or Python types like `int`, and `float`.  Other systems like SQL, HDF5, etc. have similar datatype systems with different names.\n",
  37 |       "\n",
  38 |       "To manage datatypes across different systems we use `datashape` a datatype system that maps cleanly on to all systems with which `into` interacts.  This one system can translate into any of the others.\n",
  39 |       "\n",
  40 |       "In this section we'll talk about the following\n",
  41 |       "\n",
  42 |       "1.  How to discover the datatype of your data, no matter how it is stored\n",
  43 |       "2.  Minor tweaks you can do to that datatype to improve performance in certain storage systems\n",
  44 |       "3.  How to create new datasets easily"
  45 |      ]
  46 |     },
  47 |     {
  48 |      "cell_type": "markdown",
  49 |      "metadata": {},
  50 |      "source": [
  51 |       "## 2.1  DataShape and `discover`\n",
  52 |       "\n",
  53 |       "We introduce datashape, an all-encompassing datatype language, and `discover`, a function that does all of the work for you.\n",
  54 |       "\n",
  55 |       "The discover function returns the datashape of an object.  Lets look at a few examples."
  56 |      ]
  57 |     },
  58 |     {
  59 |      "cell_type": "code",
  60 |      "collapsed": false,
  61 |      "input": [
  62 |       "from into import discover, into, resource\n",
  63 |       "discover(1)"
  64 |      ],
  65 |      "language": "python",
  66 |      "metadata": {},
  67 |      "outputs": [
  68 |       {
  69 |        "metadata": {},
  70 |        "output_type": "pyout",
  71 |        "prompt_number": 1,
  72 |        "text": [
  73 |         "ctype(\"int64\")"
  74 |        ]
  75 |       }
  76 |      ],
  77 |      "prompt_number": 1
  78 |     },
  79 |     {
  80 |      "cell_type": "code",
  81 |      "collapsed": false,
  82 |      "input": [
  83 |       "discover([1, 2, 3])"
  84 |      ],
  85 |      "language": "python",
  86 |      "metadata": {},
  87 |      "outputs": [
  88 |       {
  89 |        "metadata": {},
  90 |        "output_type": "pyout",
  91 |        "prompt_number": 2,
  92 |        "text": [
  93 |         "dshape(\"3 * int64\")"
  94 |        ]
  95 |       }
  96 |      ],
  97 |      "prompt_number": 2
  98 |     },
  99 |     {
 100 |      "cell_type": "code",
 101 |      "collapsed": false,
 102 |      "input": [
 103 |       "discover([[1, 2, 3],\n",
 104 |       "          [4, 5, 6]])"
 105 |      ],
 106 |      "language": "python",
 107 |      "metadata": {},
 108 |      "outputs": [
 109 |       {
 110 |        "metadata": {},
 111 |        "output_type": "pyout",
 112 |        "prompt_number": 3,
 113 |        "text": [
 114 |         "dshape(\"2 * 3 * int64\")"
 115 |        ]
 116 |       }
 117 |      ],
 118 |      "prompt_number": 3
 119 |     },
 120 |     {
 121 |      "cell_type": "code",
 122 |      "collapsed": false,
 123 |      "input": [
 124 |       "discover([{'x': 1, 'y': 1.0},\n",
 125 |       "          {'x': 2, 'y': 2.0},\n",
 126 |       "          {'x': 3, 'y': 3.0}])"
 127 |      ],
 128 |      "language": "python",
 129 |      "metadata": {},
 130 |      "outputs": [
 131 |       {
 132 |        "metadata": {},
 133 |        "output_type": "pyout",
 134 |        "prompt_number": 4,
 135 |        "text": [
 136 |         "dshape(\"3 * {x: int64, y: float64}\")"
 137 |        ]
 138 |       }
 139 |      ],
 140 |      "prompt_number": 4
 141 |     },
 142 |     {
 143 |      "cell_type": "code",
 144 |      "collapsed": false,
 145 |      "input": [
 146 |       "import pandas as pd\n",
 147 |       "df = pd.DataFrame([['Alice',   100],\n",
 148 |       "                   ['Bob',     200],\n",
 149 |       "                   ['Charlie', 300]], columns=['name', 'balance'])\n",
 150 |       "discover(df)"
 151 |      ],
 152 |      "language": "python",
 153 |      "metadata": {},
 154 |      "outputs": [
 155 |       {
 156 |        "metadata": {},
 157 |        "output_type": "pyout",
 158 |        "prompt_number": 5,
 159 |        "text": [
 160 |         "dshape(\"3 * {name: string, balance: int64}\")"
 161 |        ]
 162 |       }
 163 |      ],
 164 |      "prompt_number": 5
 165 |     },
 166 |     {
 167 |      "cell_type": "markdown",
 168 |      "metadata": {},
 169 |      "source": [
 170 |       "By looking closely at these examples we see the structure of datashape.  Elements have types like `int64` or `string`.  Records/structs/groups of elements have record dtypes like `{x: int64, y: float64}`.  Lengths of collections are encoded by numbers like `3 * ` for \"three of\" or `2 * 3 * ` for \"a two-by-three grid of\"."
 171 |      ]
 172 |     },
 173 |     {
 174 |      "cell_type": "markdown",
 175 |      "metadata": {},
 176 |      "source": [
 177 |       "#### Exercise\n",
 178 |       "\n",
 179 |       "Construct data with the following datashapes.  Use any container type (e.g. `list`, `pd.DataFrame`, `np.ndarray`).\n",
 180 |       "\n",
 181 |       "    2 * int64\n",
 182 |       "    2 * string\n",
 183 |       "    {name: string, id: int}\n",
 184 |       "    datetime\n",
 185 |       "    {name: string, id: int, payments: 2 * datetime}\n",
 186 |       "    2 * {name: string, id: int, payments: 2 * datetime}\n",
 187 |       "    5 * 5 * 5 * float32\n",
 188 |       "    \n",
 189 |       "Use `discover` to verify your answers."
 190 |      ]
 191 |     },
 192 |     {
 193 |      "cell_type": "code",
 194 |      "collapsed": false,
 195 |      "input": [
 196 |       "# Should be 2 * int64\n",
 197 |       "discover([1, 2])  "
 198 |      ],
 199 |      "language": "python",
 200 |      "metadata": {},
 201 |      "outputs": [
 202 |       {
 203 |        "metadata": {},
 204 |        "output_type": "pyout",
 205 |        "prompt_number": 6,
 206 |        "text": [
 207 |         "dshape(\"2 * int64\")"
 208 |        ]
 209 |       }
 210 |      ],
 211 |      "prompt_number": 6
 212 |     },
 213 |     {
 214 |      "cell_type": "code",
 215 |      "collapsed": false,
 216 |      "input": [],
 217 |      "language": "python",
 218 |      "metadata": {},
 219 |      "outputs": [],
 220 |      "prompt_number": 6
 221 |     },
 222 |     {
 223 |      "cell_type": "code",
 224 |      "collapsed": false,
 225 |      "input": [],
 226 |      "language": "python",
 227 |      "metadata": {},
 228 |      "outputs": [],
 229 |      "prompt_number": 6
 230 |     },
 231 |     {
 232 |      "cell_type": "code",
 233 |      "collapsed": false,
 234 |      "input": [],
 235 |      "language": "python",
 236 |      "metadata": {},
 237 |      "outputs": [],
 238 |      "prompt_number": 6
 239 |     },
 240 |     {
 241 |      "cell_type": "code",
 242 |      "collapsed": false,
 243 |      "input": [],
 244 |      "language": "python",
 245 |      "metadata": {},
 246 |      "outputs": [],
 247 |      "prompt_number": 6
 248 |     },
 249 |     {
 250 |      "cell_type": "code",
 251 |      "collapsed": false,
 252 |      "input": [],
 253 |      "language": "python",
 254 |      "metadata": {},
 255 |      "outputs": [],
 256 |      "prompt_number": 6
 257 |     },
 258 |     {
 259 |      "cell_type": "markdown",
 260 |      "metadata": {},
 261 |      "source": [
 262 |       "### Discover doesn't care about storage format\n",
 263 |       "\n",
 264 |       "The `discover` function doesn't care if your data lives in a Python list, Pandas DataFrame, NumPy Array, CSV file, PySpark RDD, or SQL database.  \n",
 265 |       "\n",
 266 |       "In other words, using `into` preserves datashape."
 267 |      ]
 268 |     },
 269 |     {
 270 |      "cell_type": "code",
 271 |      "collapsed": false,
 272 |      "input": [
 273 |       "discover(df)"
 274 |      ],
 275 |      "language": "python",
 276 |      "metadata": {},
 277 |      "outputs": [
 278 |       {
 279 |        "metadata": {},
 280 |        "output_type": "pyout",
 281 |        "prompt_number": 7,
 282 |        "text": [
 283 |         "dshape(\"3 * {name: string, balance: int64}\")"
 284 |        ]
 285 |       }
 286 |      ],
 287 |      "prompt_number": 7
 288 |     },
 289 |     {
 290 |      "cell_type": "code",
 291 |      "collapsed": false,
 292 |      "input": [
 293 |       "import numpy as np\n",
 294 |       "x = into(np.ndarray, df)\n",
 295 |       "discover(x)  # different container, same datashape"
 296 |      ],
 297 |      "language": "python",
 298 |      "metadata": {},
 299 |      "outputs": [
 300 |       {
 301 |        "metadata": {},
 302 |        "output_type": "pyout",
 303 |        "prompt_number": 8,
 304 |        "text": [
 305 |         "dshape(\"3 * {name: string, balance: int64}\")"
 306 |        ]
 307 |       }
 308 |      ],
 309 |      "prompt_number": 8
 310 |     },
 311 |     {
 312 |      "cell_type": "code",
 313 |      "collapsed": false,
 314 |      "input": [
 315 |       "t = into('sqlite:///:memory:::mydf', df)\n",
 316 |       "discover(t)  # different container, mostly the same datashape"
 317 |      ],
 318 |      "language": "python",
 319 |      "metadata": {},
 320 |      "outputs": [
 321 |       {
 322 |        "metadata": {},
 323 |        "output_type": "pyout",
 324 |        "prompt_number": 9,
 325 |        "text": [
 326 |         "dshape(\"var * {name: string, balance: int64}\")"
 327 |        ]
 328 |       }
 329 |      ],
 330 |      "prompt_number": 9
 331 |     },
 332 |     {
 333 |      "cell_type": "markdown",
 334 |      "metadata": {},
 335 |      "source": [
 336 |       "### Discover works on nested structures\n",
 337 |       "\n",
 338 |       "Call discover on a single table of our baseball statistics database."
 339 |      ]
 340 |     },
 341 |     {
 342 |      "cell_type": "code",
 343 |      "collapsed": false,
 344 |      "input": [
 345 |       "salaries = resource('sqlite:///data/lahman2013.sqlite::Salaries')\n",
 346 |       "discover(salaries)"
 347 |      ],
 348 |      "language": "python",
 349 |      "metadata": {},
 350 |      "outputs": [
 351 |       {
 352 |        "metadata": {},
 353 |        "output_type": "pyout",
 354 |        "prompt_number": 10,
 355 |        "text": [
 356 |         "dshape(\"\"\"var * {\n",
 357 |         "  yearID: ?int32,\n",
 358 |         "  teamID: ?string,\n",
 359 |         "  lgID: ?string,\n",
 360 |         "  playerID: ?string,\n",
 361 |         "  salary: ?float64\n",
 362 |         "  }\"\"\")"
 363 |        ]
 364 |       }
 365 |      ],
 366 |      "prompt_number": 10
 367 |     },
 368 |     {
 369 |      "cell_type": "markdown",
 370 |      "metadata": {},
 371 |      "source": [
 372 |       "And then call it on the entire database"
 373 |      ]
 374 |     },
 375 |     {
 376 |      "cell_type": "code",
 377 |      "collapsed": false,
 378 |      "input": [
 379 |       "db = resource('sqlite:///data/lahman2013.sqlite')\n",
 380 |       "discover(db)"
 381 |      ],
 382 |      "language": "python",
 383 |      "metadata": {},
 384 |      "outputs": [
 385 |       {
 386 |        "metadata": {},
 387 |        "output_type": "pyout",
 388 |        "prompt_number": 11,
 389 |        "text": [
 390 |         "dshape(\"\"\"{\n",
 391 |         "  AllstarFull: var * {\n",
 392 |         "    playerID: ?string,\n",
 393 |         "    yearID: ?int32,\n",
 394 |         "    gameNum: ?int32,\n",
 395 |         "    gameID: ?string,\n",
 396 |         "    teamID: ?string,\n",
 397 |         "    lgID: ?string,\n",
 398 |         "    GP: ?int32,\n",
 399 |         "    startingPos: ?int32\n",
 400 |         "    },\n",
 401 |         "  Appearances: var * {\n",
 402 |         "    yearID: ?int32,\n",
 403 |         "    teamID: ?string,\n",
 404 |         "    lgID: ?string,\n",
 405 |         "    playerID: ?string,\n",
 406 |         "    G_all: ?int32,\n",
 407 |         "    GS: ?int32,\n",
 408 |         "    G_batting: ?int32,\n",
 409 |         "    G_defense: ?int32,\n",
 410 |         "    G_p: ?int32,\n",
 411 |         "    G_c: ?int32,\n",
 412 |         "    G_1b: ?int32,\n",
 413 |         "    G_2b: ?int32,\n",
 414 |         "    G_3b: ?int32,\n",
 415 |         "    G_ss: ?int32,\n",
 416 |         "    G_lf: ?int32,\n",
 417 |         "    G_cf: ?int32,\n",
 418 |         "    G_rf: ?int32,\n",
 419 |         "    G_of: ?int32,\n",
 420 |         "    G_dh: ?int32,\n",
 421 |         "    G_ph: ?int32,\n",
 422 |         "    G_pr: ?int32\n",
 423 |         "    },\n",
 424 |         "  AwardsManagers: var * {\n",
 425 |         "    playerID: ?string,\n",
 426 |         "    awardID: ?string,\n",
 427 |         "    yearID: ?int32,\n",
 428 |         "    lgID: ?string,\n",
 429 |         "    tie: ?string,\n",
 430 |         "    notes: ?string\n",
 431 |         "    },\n",
 432 |         "  AwardsPlayers: var * {\n",
 433 |         "    playerID: ?string,\n",
 434 |         "    awardID: ?string,\n",
 435 |         "    yearID: ?int32,\n",
 436 |         "    lgID: ?string,\n",
 437 |         "    tie: ?string,\n",
 438 |         "    notes: ?string\n",
 439 |         "    },\n",
 440 |         "  AwardsShareManagers: var * {\n",
 441 |         "    awardID: ?string,\n",
 442 |         "    yearID: ?int32,\n",
 443 |         "    lgID: ?string,\n",
 444 |         "    playerID: ?string,\n",
 445 |         "    pointsWon: ?int32,\n",
 446 |         "    pointsMax: ?int32,\n",
 447 |         "    votesFirst: ?int32\n",
 448 |         "    },\n",
 449 |         "  AwardsSharePlayers: var * {\n",
 450 |         "    awardID: ?string,\n",
 451 |         "    yearID: ?int32,\n",
 452 |         "    lgID: ?string,\n",
 453 |         "    playerID: ?string,\n",
 454 |         "    pointsWon: ?float64,\n",
 455 |         "    pointsMax: ?int32,\n",
 456 |         "    votesFirst: ?float64\n",
 457 |         "    },\n",
 458 |         "  Batting: var * {\n",
 459 |         "    playerID: ?string,\n",
 460 |         "    yearID: ?int32,\n",
 461 |         "    stint: ?int32,\n",
 462 |         "    teamID: ?string,\n",
 463 |         "    lgID: ?string,\n",
 464 |         "    G: ?int32,\n",
 465 |         "    G_batting: ?int32,\n",
 466 |         "    AB: ?int32,\n",
 467 |         "    R: ?int32,\n",
 468 |         "    H: ?int32,\n",
 469 |         "    2B: ?int32,\n",
 470 |         "    3B: ?int32,\n",
 471 |         "    HR: ?int32,\n",
 472 |         "    RBI: ?int32,\n",
 473 |         "    SB: ?int32,\n",
 474 |         "    CS: ?int32,\n",
 475 |         "    BB: ?int32,\n",
 476 |         "    SO: ?int32,\n",
 477 |         "    IBB: ?int32,\n",
 478 |         "    HBP: ?int32,\n",
 479 |         "    SH: ?int32,\n",
 480 |         "    SF: ?int32,\n",
 481 |         "    GIDP: ?int32,\n",
 482 |         "    G_old: ?int32\n",
 483 |         "    },\n",
 484 |         "  BattingPost: var * {\n",
 485 |         "    yearID: ?int32,\n",
 486 |         "    round: ?string,\n",
 487 |         "    playerID: ?string,\n",
 488 |         "    teamID: ?string,\n",
 489 |         "    lgID: ?string,\n",
 490 |         "    G: ?int32,\n",
 491 |         "    AB: ?int32,\n",
 492 |         "    R: ?int32,\n",
 493 |         "    H: ?int32,\n",
 494 |         "    2B: ?int32,\n",
 495 |         "    3B: ?int32,\n",
 496 |         "    HR: ?int32,\n",
 497 |         "    RBI: ?int32,\n",
 498 |         "    SB: ?int32,\n",
 499 |         "    CS: ?int32,\n",
 500 |         "    BB: ?int32,\n",
 501 |         "    SO: ?int32,\n",
 502 |         "    IBB: ?int32,\n",
 503 |         "    HBP: ?int32,\n",
 504 |         "    SH: ?int32,\n",
 505 |         "    SF: ?int32,\n",
 506 |         "    GIDP: ?int32\n",
 507 |         "    },\n",
 508 |         "  Fielding: var * {\n",
 509 |         "    playerID: ?string,\n",
 510 |         "    yearID: ?int32,\n",
 511 |         "    stint: ?int32,\n",
 512 |         "    teamID: ?string,\n",
 513 |         "    lgID: ?string,\n",
 514 |         "    POS: ?string,\n",
 515 |         "    G: ?int32,\n",
 516 |         "    GS: ?int32,\n",
 517 |         "    InnOuts: ?int32,\n",
 518 |         "    PO: ?int32,\n",
 519 |         "    A: ?int32,\n",
 520 |         "    E: ?int32,\n",
 521 |         "    DP: ?int32,\n",
 522 |         "    PB: ?int32,\n",
 523 |         "    WP: ?int32,\n",
 524 |         "    SB: ?int32,\n",
 525 |         "    CS: ?int32,\n",
 526 |         "    ZR: ?float64\n",
 527 |         "    },\n",
 528 |         "  FieldingOF: var * {\n",
 529 |         "    playerID: ?string,\n",
 530 |         "    yearID: ?int32,\n",
 531 |         "    stint: ?int32,\n",
 532 |         "    Glf: ?int32,\n",
 533 |         "    Gcf: ?int32,\n",
 534 |         "    Grf: ?int32\n",
 535 |         "    },\n",
 536 |         "  FieldingPost: var * {\n",
 537 |         "    playerID: ?string,\n",
 538 |         "    yearID: ?int32,\n",
 539 |         "    teamID: ?string,\n",
 540 |         "    lgID: ?string,\n",
 541 |         "    round: ?string,\n",
 542 |         "    POS: ?string,\n",
 543 |         "    G: ?int32,\n",
 544 |         "    GS: ?int32,\n",
 545 |         "    InnOuts: ?int32,\n",
 546 |         "    PO: ?int32,\n",
 547 |         "    A: ?int32,\n",
 548 |         "    E: ?int32,\n",
 549 |         "    DP: ?int32,\n",
 550 |         "    TP: ?int32,\n",
 551 |         "    PB: ?int32,\n",
 552 |         "    SB: ?int32,\n",
 553 |         "    CS: ?int32\n",
 554 |         "    },\n",
 555 |         "  HallOfFame: var * {\n",
 556 |         "    playerID: ?string,\n",
 557 |         "    yearid: ?int32,\n",
 558 |         "    votedBy: ?string,\n",
 559 |         "    ballots: ?int32,\n",
 560 |         "    needed: ?int32,\n",
 561 |         "    votes: ?int32,\n",
 562 |         "    inducted: ?string,\n",
 563 |         "    category: ?string,\n",
 564 |         "    needed_note: ?string\n",
 565 |         "    },\n",
 566 |         "  Managers: var * {\n",
 567 |         "    playerID: ?string,\n",
 568 |         "    yearID: ?int32,\n",
 569 |         "    teamID: ?string,\n",
 570 |         "    lgID: ?string,\n",
 571 |         "    inseason: ?int32,\n",
 572 |         "    G: ?int32,\n",
 573 |         "    W: ?int32,\n",
 574 |         "    L: ?int32,\n",
 575 |         "    rank: ?int32,\n",
 576 |         "    plyrMgr: ?string\n",
 577 |         "    },\n",
 578 |         "  ManagersHalf: var * {\n",
 579 |         "    playerID: ?string,\n",
 580 |         "    yearID: ?int32,\n",
 581 |         "    teamID: ?string,\n",
 582 |         "    lgID: ?string,\n",
 583 |         "    inseason: ?int32,\n",
 584 |         "    half: ?int32,\n",
 585 |         "    G: ?int32,\n",
 586 |         "    W: ?int32,\n",
 587 |         "    L: ?int32,\n",
 588 |         "    rank: ?int32\n",
 589 |         "    },\n",
 590 |         "  Master: var * {\n",
 591 |         "    playerID: ?string,\n",
 592 |         "    birthYear: ?int32,\n",
 593 |         "    birthMonth: ?int32,\n",
 594 |         "    birthDay: ?int32,\n",
 595 |         "    birthCountry: ?string,\n",
 596 |         "    birthState: ?string,\n",
 597 |         "    birthCity: ?string,\n",
 598 |         "    deathYear: ?int32,\n",
 599 |         "    deathMonth: ?int32,\n",
 600 |         "    deathDay: ?int32,\n",
 601 |         "    deathCountry: ?string,\n",
 602 |         "    deathState: ?string,\n",
 603 |         "    deathCity: ?string,\n",
 604 |         "    nameFirst: ?string,\n",
 605 |         "    nameLast: ?string,\n",
 606 |         "    nameGiven: ?string,\n",
 607 |         "    weight: ?int32,\n",
 608 |         "    height: ?float64,\n",
 609 |         "    bats: ?string,\n",
 610 |         "    throws: ?string,\n",
 611 |         "    debut: ?float64,\n",
 612 |         "    finalGame: ?float64,\n",
 613 |         "    retroID: ?string,\n",
 614 |         "    bbrefID: ?string\n",
 615 |         "    },\n",
 616 |         "  Pitching: var * {\n",
 617 |         "    playerID: ?string,\n",
 618 |         "    yearID: ?int32,\n",
 619 |         "    stint: ?int32,\n",
 620 |         "    teamID: ?string,\n",
 621 |         "    lgID: ?string,\n",
 622 |         "    W: ?int32,\n",
 623 |         "    L: ?int32,\n",
 624 |         "    G: ?int32,\n",
 625 |         "    GS: ?int32,\n",
 626 |         "    CG: ?int32,\n",
 627 |         "    SHO: ?int32,\n",
 628 |         "    SV: ?int32,\n",
 629 |         "    IPouts: ?int32,\n",
 630 |         "    H: ?int32,\n",
 631 |         "    ER: ?int32,\n",
 632 |         "    HR: ?int32,\n",
 633 |         "    BB: ?int32,\n",
 634 |         "    SO: ?int32,\n",
 635 |         "    BAOpp: ?float64,\n",
 636 |         "    ERA: ?float64,\n",
 637 |         "    IBB: ?int32,\n",
 638 |         "    WP: ?int32,\n",
 639 |         "    HBP: ?int32,\n",
 640 |         "    BK: ?int32,\n",
 641 |         "    BFP: ?int32,\n",
 642 |         "    GF: ?int32,\n",
 643 |         "    R: ?int32,\n",
 644 |         "    SH: ?int32,\n",
 645 |         "    SF: ?int32,\n",
 646 |         "    GIDP: ?int32\n",
 647 |         "    },\n",
 648 |         "  PitchingPost: var * {\n",
 649 |         "    playerID: ?string,\n",
 650 |         "    yearID: ?int32,\n",
 651 |         "    round: ?string,\n",
 652 |         "    teamID: ?string,\n",
 653 |         "    lgID: ?string,\n",
 654 |         "    W: ?int32,\n",
 655 |         "    L: ?int32,\n",
 656 |         "    G: ?int32,\n",
 657 |         "    GS: ?int32,\n",
 658 |         "    CG: ?int32,\n",
 659 |         "    SHO: ?int32,\n",
 660 |         "    SV: ?int32,\n",
 661 |         "    IPouts: ?int32,\n",
 662 |         "    H: ?int32,\n",
 663 |         "    ER: ?int32,\n",
 664 |         "    HR: ?int32,\n",
 665 |         "    BB: ?int32,\n",
 666 |         "    SO: ?int32,\n",
 667 |         "    BAOpp: ?float64,\n",
 668 |         "    ERA: ?float64,\n",
 669 |         "    IBB: ?int32,\n",
 670 |         "    WP: ?int32,\n",
 671 |         "    HBP: ?int32,\n",
 672 |         "    BK: ?int32,\n",
 673 |         "    BFP: ?int32,\n",
 674 |         "    GF: ?int32,\n",
 675 |         "    R: ?int32,\n",
 676 |         "    SH: ?int32,\n",
 677 |         "    SF: ?int32,\n",
 678 |         "    GIDP: ?int32\n",
 679 |         "    },\n",
 680 |         "  Salaries: var * {\n",
 681 |         "    yearID: ?int32,\n",
 682 |         "    teamID: ?string,\n",
 683 |         "    lgID: ?string,\n",
 684 |         "    playerID: ?string,\n",
 685 |         "    salary: ?float64\n",
 686 |         "    },\n",
 687 |         "  Schools: var * {\n",
 688 |         "    schoolID: ?string,\n",
 689 |         "    schoolName: ?string,\n",
 690 |         "    schoolCity: ?string,\n",
 691 |         "    schoolState: ?string,\n",
 692 |         "    schoolNick: ?string\n",
 693 |         "    },\n",
 694 |         "  SchoolsPlayers: var * {\n",
 695 |         "    playerID: ?string,\n",
 696 |         "    schoolID: ?string,\n",
 697 |         "    yearMin: ?int32,\n",
 698 |         "    yearMax: ?int32\n",
 699 |         "    },\n",
 700 |         "  SeriesPost: var * {\n",
 701 |         "    yearID: ?int32,\n",
 702 |         "    round: ?string,\n",
 703 |         "    teamIDwinner: ?string,\n",
 704 |         "    lgIDwinner: ?string,\n",
 705 |         "    teamIDloser: ?string,\n",
 706 |         "    lgIDloser: ?string,\n",
 707 |         "    wins: ?int32,\n",
 708 |         "    losses: ?int32,\n",
 709 |         "    ties: ?int32\n",
 710 |         "    },\n",
 711 |         "  Teams: var * {\n",
 712 |         "    yearID: ?int32,\n",
 713 |         "    lgID: ?string,\n",
 714 |         "    teamID: ?string,\n",
 715 |         "    franchID: ?string,\n",
 716 |         "    divID: ?string,\n",
 717 |         "    Rank: ?int32,\n",
 718 |         "    G: ?int32,\n",
 719 |         "    Ghome: ?int32,\n",
 720 |         "    W: ?int32,\n",
 721 |         "    L: ?int32,\n",
 722 |         "    DivWin: ?string,\n",
 723 |         "    WCWin: ?string,\n",
 724 |         "    LgWin: ?string,\n",
 725 |         "    WSWin: ?string,\n",
 726 |         "    R: ?int32,\n",
 727 |         "    AB: ?int32,\n",
 728 |         "    H: ?int32,\n",
 729 |         "    2B: ?int32,\n",
 730 |         "    3B: ?int32,\n",
 731 |         "    HR: ?int32,\n",
 732 |         "    BB: ?int32,\n",
 733 |         "    SO: ?int32,\n",
 734 |         "    SB: ?int32,\n",
 735 |         "    CS: ?int32,\n",
 736 |         "    HBP: ?int32,\n",
 737 |         "    SF: ?int32,\n",
 738 |         "    RA: ?int32,\n",
 739 |         "    ER: ?int32,\n",
 740 |         "    ERA: ?float64,\n",
 741 |         "    CG: ?int32,\n",
 742 |         "    SHO: ?int32,\n",
 743 |         "    SV: ?int32,\n",
 744 |         "    IPouts: ?int32,\n",
 745 |         "    HA: ?int32,\n",
 746 |         "    HRA: ?int32,\n",
 747 |         "    BBA: ?int32,\n",
 748 |         "    SOA: ?int32,\n",
 749 |         "    E: ?int32,\n",
 750 |         "    DP: ?int32,\n",
 751 |         "    FP: ?float64,\n",
 752 |         "    name: ?string,\n",
 753 |         "    park: ?string,\n",
 754 |         "    attendance: ?int32,\n",
 755 |         "    BPF: ?int32,\n",
 756 |         "    PPF: ?int32,\n",
 757 |         "    teamIDBR: ?string,\n",
 758 |         "    teamIDlahman45: ?string,\n",
 759 |         "    teamIDretro: ?string\n",
 760 |         "    },\n",
 761 |         "  TeamsFranchises: var * {\n",
 762 |         "    franchID: ?string,\n",
 763 |         "    franchName: ?string,\n",
 764 |         "    active: ?string,\n",
 765 |         "    NAassoc: ?string\n",
 766 |         "    },\n",
 767 |         "  TeamsHalf: var * {\n",
 768 |         "    yearID: ?int32,\n",
 769 |         "    lgID: ?string,\n",
 770 |         "    teamID: ?string,\n",
 771 |         "    Half: ?string,\n",
 772 |         "    divID: ?string,\n",
 773 |         "    DivWin: ?string,\n",
 774 |         "    Rank: ?int32,\n",
 775 |         "    G: ?int32,\n",
 776 |         "    W: ?int32,\n",
 777 |         "    L: ?int32\n",
 778 |         "    },\n",
 779 |         "  temp: var * {ID: ?int32, namefull: ?string, born: ?float64}\n",
 780 |         "  }\"\"\")"
 781 |        ]
 782 |       }
 783 |      ],
 784 |      "prompt_number": 11
 785 |     },
 786 |     {
 787 |      "cell_type": "markdown",
 788 |      "metadata": {},
 789 |      "source": [
 790 |       "## 2.2 Tweaking Datashapes for Performance\n",
 791 |       "\n",
 792 |       "Some storage systems don't cleanly support some datashapes.  For example\n",
 793 |       "\n",
 794 |       "1.  SQL doesn't support nested data\n",
 795 |       "2.  HDF5 doesn't cleanly support datetimes\n",
 796 |       "3.  Variable length strings are often costly in binary stores\n",
 797 |       "4.  NumPy has poor missing value support\n",
 798 |       "\n",
 799 |       "Because of this we sometimes want to slightly change a datashape during migration."
 800 |      ]
 801 |     },
 802 |     {
 803 |      "cell_type": "markdown",
 804 |      "metadata": {},
 805 |      "source": [
 806 |       "### Fixed Length Strings\n",
 807 |       "\n",
 808 |       "A common example is the use of strings with a known maximum length, called \"fixed length strings,\" and strings with particular character encodings, like ASCII or UTF-8.\n",
 809 |       "\n",
 810 |       "Consider moving the following data, including strings, into a numpy array."
 811 |      ]
 812 |     },
 813 |     {
 814 |      "cell_type": "code",
 815 |      "collapsed": false,
 816 |      "input": [
 817 |       "data = [{'name': 'Alice', 'balance': 100},\n",
 818 |       "        {'name': 'Bob',   'balance': 200}]\n",
 819 |       "into(np.ndarray, data)"
 820 |      ],
 821 |      "language": "python",
 822 |      "metadata": {},
 823 |      "outputs": [
 824 |       {
 825 |        "metadata": {},
 826 |        "output_type": "pyout",
 827 |        "prompt_number": 12,
 828 |        "text": [
 829 |         "array([(100, 'Alice'), (200, 'Bob')], \n",
 830 |         "      dtype=[('balance', '<i8'), ('name', 'O')])"
 831 |        ]
 832 |       }
 833 |      ],
 834 |      "prompt_number": 12
 835 |     },
 836 |     {
 837 |      "cell_type": "markdown",
 838 |      "metadata": {},
 839 |      "source": [
 840 |       "Notice that the dtype for the `name` field is `'O'` for Python object type.  NumPy doesn't support arbitrary length strings and putting raw Python objects into a NumPy array can hurt performance.  So we choose a maximum length and tell that to into with a datashape.\n",
 841 |       "\n",
 842 |       "So how do we construct a datashape?  We could write it out by hand, but that is error prone.  Instead, we ask for the datashape of `data`, then copy-paste, making a small adjustment"
 843 |      ]
 844 |     },
 845 |     {
 846 |      "cell_type": "code",
 847 |      "collapsed": false,
 848 |      "input": [
 849 |       "discover(data)"
 850 |      ],
 851 |      "language": "python",
 852 |      "metadata": {},
 853 |      "outputs": [
 854 |       {
 855 |        "metadata": {},
 856 |        "output_type": "pyout",
 857 |        "prompt_number": 13,
 858 |        "text": [
 859 |         "dshape(\"2 * {balance: int64, name: string}\")"
 860 |        ]
 861 |       }
 862 |      ],
 863 |      "prompt_number": 13
 864 |     },
 865 |     {
 866 |      "cell_type": "code",
 867 |      "collapsed": false,
 868 |      "input": [
 869 |       "from into import dshape\n",
 870 |       "ds = dshape(\"2 * {balance: int64, name: string[10]}\") # max length of 10\n",
 871 |       "\n",
 872 |       "into(np.ndarray, data, dshape=ds)"
 873 |      ],
 874 |      "language": "python",
 875 |      "metadata": {},
 876 |      "outputs": [
 877 |       {
 878 |        "metadata": {},
 879 |        "output_type": "pyout",
 880 |        "prompt_number": 14,
 881 |        "text": [
 882 |         "array([(100, u'Alice'), (200, u'Bob')], \n",
 883 |         "      dtype=[('balance', '<i8'), ('name', '<U10')])"
 884 |        ]
 885 |       }
 886 |      ],
 887 |      "prompt_number": 14
 888 |     },
 889 |     {
 890 |      "cell_type": "markdown",
 891 |      "metadata": {},
 892 |      "source": [
 893 |       "Note the new type, `U10`, meaning Unicode of length ten or less.  Our data happens to be ASCII, so we'll specialize our datashape even more."
 894 |      ]
 895 |     },
 896 |     {
 897 |      "cell_type": "code",
 898 |      "collapsed": false,
 899 |      "input": [
 900 |       "ds = dshape(\"2 * {balance: int64, name: string[10, 'ascii']}\") # max length of 10\n",
 901 |       "into(np.ndarray, data, dshape=ds)"
 902 |      ],
 903 |      "language": "python",
 904 |      "metadata": {},
 905 |      "outputs": [
 906 |       {
 907 |        "metadata": {},
 908 |        "output_type": "pyout",
 909 |        "prompt_number": 15,
 910 |        "text": [
 911 |         "array([(100, 'Alice'), (200, 'Bob')], \n",
 912 |         "      dtype=[('balance', '<i8'), ('name', 'S10')])"
 913 |        ]
 914 |       }
 915 |      ],
 916 |      "prompt_number": 15
 917 |     },
 918 |     {
 919 |      "cell_type": "markdown",
 920 |      "metadata": {},
 921 |      "source": [
 922 |       "#### Exercise\n",
 923 |       "\n",
 924 |       "Move the data in `'data/iris.csv'` to the HDF5 dataset, `'data/myfile.hdf5::/iris'`.  Annotate the datatype of the `species` column to be ASCII with maximum length 30."
 925 |      ]
 926 |     },
 927 |     {
 928 |      "cell_type": "code",
 929 |      "collapsed": false,
 930 |      "input": [
 931 |       "csv = resource('data/iris.csv')\n",
 932 |       "discover(csv)"
 933 |      ],
 934 |      "language": "python",
 935 |      "metadata": {},
 936 |      "outputs": [
 937 |       {
 938 |        "metadata": {},
 939 |        "output_type": "pyout",
 940 |        "prompt_number": 16,
 941 |        "text": [
 942 |         "dshape(\"\"\"var * {\n",
 943 |         "  sepal_length: float64,\n",
 944 |         "  sepal_width: float64,\n",
 945 |         "  petal_length: float64,\n",
 946 |         "  petal_width: float64,\n",
 947 |         "  species: string\n",
 948 |         "  }\"\"\")"
 949 |        ]
 950 |       }
 951 |      ],
 952 |      "prompt_number": 16
 953 |     },
 954 |     {
 955 |      "cell_type": "code",
 956 |      "collapsed": false,
 957 |      "input": [],
 958 |      "language": "python",
 959 |      "metadata": {},
 960 |      "outputs": [],
 961 |      "prompt_number": 16
 962 |     },
 963 |     {
 964 |      "cell_type": "code",
 965 |      "collapsed": false,
 966 |      "input": [],
 967 |      "language": "python",
 968 |      "metadata": {},
 969 |      "outputs": [],
 970 |      "prompt_number": 16
 971 |     },
 972 |     {
 973 |      "cell_type": "code",
 974 |      "collapsed": false,
 975 |      "input": [],
 976 |      "language": "python",
 977 |      "metadata": {},
 978 |      "outputs": [],
 979 |      "prompt_number": 16
 980 |     },
 981 |     {
 982 |      "cell_type": "markdown",
 983 |      "metadata": {},
 984 |      "source": [
 985 |       "### Adding more metadata with DataShape\n",
 986 |       "\n",
 987 |       "Sometimes we store data in formats that lack metadata.  Adding a datashape can help to fill in these gaps.\n",
 988 |       "\n",
 989 |       "For example we might want to put the following latitude-longitude pairs into a DataFrame."
 990 |      ]
 991 |     },
 992 |     {
 993 |      "cell_type": "code",
 994 |      "collapsed": false,
 995 |      "input": [
 996 |       "data = [(33.1, -89.2), (37, -141.5), (41, -120.5)]\n",
 997 |       "into(pd.DataFrame, data)"
 998 |      ],
 999 |      "language": "python",
1000 |      "metadata": {},
1001 |      "outputs": [
1002 |       {
1003 |        "html": [
1004 |         "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
1005 |         "<table border=\"1\" class=\"dataframe\">\n",
1006 |         "  <thead>\n",
1007 |         "    <tr style=\"text-align: right;\">\n",
1008 |         "      <th></th>\n",
1009 |         "      <th>0</th>\n",
1010 |         "      <th>1</th>\n",
1011 |         "    </tr>\n",
1012 |         "  </thead>\n",
1013 |         "  <tbody>\n",
1014 |         "    <tr>\n",
1015 |         "      <th>0</th>\n",
1016 |         "      <td> 33.1</td>\n",
1017 |         "      <td> -89.2</td>\n",
1018 |         "    </tr>\n",
1019 |         "    <tr>\n",
1020 |         "      <th>1</th>\n",
1021 |         "      <td> 37.0</td>\n",
1022 |         "      <td>-141.5</td>\n",
1023 |         "    </tr>\n",
1024 |         "    <tr>\n",
1025 |         "      <th>2</th>\n",
1026 |         "      <td> 41.0</td>\n",
1027 |         "      <td>-120.5</td>\n",
1028 |         "    </tr>\n",
1029 |         "  </tbody>\n",
1030 |         "</table>\n",
1031 |         "</div>"
1032 |        ],
1033 |        "metadata": {},
1034 |        "output_type": "pyout",
1035 |        "prompt_number": 17,
1036 |        "text": [
1037 |         "      0      1\n",
1038 |         "0  33.1  -89.2\n",
1039 |         "1  37.0 -141.5\n",
1040 |         "2  41.0 -120.5"
1041 |        ]
1042 |       }
1043 |      ],
1044 |      "prompt_number": 17
1045 |     },
1046 |     {
1047 |      "cell_type": "markdown",
1048 |      "metadata": {},
1049 |      "source": [
1050 |       "Note that, because our original data was stored in a format that didn't include the column names, the output lacks them as well.  We complement our data with a datashape."
1051 |      ]
1052 |     },
1053 |     {
1054 |      "cell_type": "code",
1055 |      "collapsed": false,
1056 |      "input": [
1057 |       "ds = dshape('var * {lat: float64, long: float64}')\n",
1058 |       "into(pd.DataFrame, data, dshape=ds)"
1059 |      ],
1060 |      "language": "python",
1061 |      "metadata": {},
1062 |      "outputs": [
1063 |       {
1064 |        "html": [
1065 |         "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
1066 |         "<table border=\"1\" class=\"dataframe\">\n",
1067 |         "  <thead>\n",
1068 |         "    <tr style=\"text-align: right;\">\n",
1069 |         "      <th></th>\n",
1070 |         "      <th>lat</th>\n",
1071 |         "      <th>long</th>\n",
1072 |         "    </tr>\n",
1073 |         "  </thead>\n",
1074 |         "  <tbody>\n",
1075 |         "    <tr>\n",
1076 |         "      <th>0</th>\n",
1077 |         "      <td> 33.1</td>\n",
1078 |         "      <td> -89.2</td>\n",
1079 |         "    </tr>\n",
1080 |         "    <tr>\n",
1081 |         "      <th>1</th>\n",
1082 |         "      <td> 37.0</td>\n",
1083 |         "      <td>-141.5</td>\n",
1084 |         "    </tr>\n",
1085 |         "    <tr>\n",
1086 |         "      <th>2</th>\n",
1087 |         "      <td> 41.0</td>\n",
1088 |         "      <td>-120.5</td>\n",
1089 |         "    </tr>\n",
1090 |         "  </tbody>\n",
1091 |         "</table>\n",
1092 |         "</div>"
1093 |        ],
1094 |        "metadata": {},
1095 |        "output_type": "pyout",
1096 |        "prompt_number": 18,
1097 |        "text": [
1098 |         "    lat   long\n",
1099 |         "0  33.1  -89.2\n",
1100 |         "1  37.0 -141.5\n",
1101 |         "2  41.0 -120.5"
1102 |        ]
1103 |       }
1104 |      ],
1105 |      "prompt_number": 18
1106 |     },
1107 |     {
1108 |      "cell_type": "markdown",
1109 |      "metadata": {},
1110 |      "source": [
1111 |       "## Create new datasets with `resource` and datashape\n",
1112 |       "\n",
1113 |       "In the last section we used `resource` to acquire existing datasets from string URIs.  \n",
1114 |       "\n",
1115 |       "We also use the `resource` function to create new datasets given a URI and a datashape."
1116 |      ]
1117 |     },
1118 |     {
1119 |      "cell_type": "markdown",
1120 |      "metadata": {},
1121 |      "source": [
1122 |       "#### Example\n",
1123 |       "\n",
1124 |       "We create a new HDF5 dataset to store 100 by 100 integers."
1125 |      ]
1126 |     },
1127 |     {
1128 |      "cell_type": "code",
1129 |      "collapsed": false,
1130 |      "input": [
1131 |       "ds = dshape('100 * 100 * int64')\n",
1132 |       "dset = resource('myfile.hdf5::/x', dshape=ds)\n",
1133 |       "dset"
1134 |      ],
1135 |      "language": "python",
1136 |      "metadata": {},
1137 |      "outputs": [
1138 |       {
1139 |        "metadata": {},
1140 |        "output_type": "pyout",
1141 |        "prompt_number": 19,
1142 |        "text": [
1143 |         "<HDF5 dataset \"x\": shape (100, 100), type \"<i8\">"
1144 |        ]
1145 |       }
1146 |      ],
1147 |      "prompt_number": 19
1148 |     },
1149 |     {
1150 |      "cell_type": "markdown",
1151 |      "metadata": {},
1152 |      "source": [
1153 |       "#### Exercise\n",
1154 |       "\n",
1155 |       "Create a new SQLite table named `transactions` in `'data/my.db'` with the following datashape\n",
1156 |       "\n",
1157 |       "    var * {name: string, balance: int, timestamp: datetime}\n",
1158 |       "    \n",
1159 |       "Here `var` stands for \"variable length\" or generally \"a dimension to which we don't know a fixed size.\""
1160 |      ]
1161 |     },
1162 |     {
1163 |      "cell_type": "code",
1164 |      "collapsed": false,
1165 |      "input": [],
1166 |      "language": "python",
1167 |      "metadata": {},
1168 |      "outputs": [],
1169 |      "prompt_number": 19
1170 |     },
1171 |     {
1172 |      "cell_type": "code",
1173 |      "collapsed": false,
1174 |      "input": [],
1175 |      "language": "python",
1176 |      "metadata": {},
1177 |      "outputs": [],
1178 |      "prompt_number": 19
1179 |     },
1180 |     {
1181 |      "cell_type": "markdown",
1182 |      "metadata": {},
1183 |      "source": [
1184 |       "Note that you could also have built this table using raw SQLAlchemy code.  Knowing datashape lets you skip learning many libraries like SQLAlchemy and H5Py for simple tasks.  `into` serves as a single interface over many useful libraries."
1185 |      ]
1186 |     },
1187 |     {
1188 |      "cell_type": "code",
1189 |      "collapsed": false,
1190 |      "input": [
1191 |       "import sqlalchemy as sa\n",
1192 |       "\n",
1193 |       "engine = sa.create_engine('sqlite:///data/my.db')\n",
1194 |       "metadata = sa.MetaData(engine)\n",
1195 |       "transactions = sa.Table('transactions2', metadata,\n",
1196 |       "                        sa.Column('name', sa.String),\n",
1197 |       "                        sa.Column('balance', sa.Integer),\n",
1198 |       "                        sa.Column('timestamp', sa.DateTime))\n",
1199 |       "transactions.create()"
1200 |      ],
1201 |      "language": "python",
1202 |      "metadata": {},
1203 |      "outputs": [],
1204 |      "prompt_number": 20
1205 |     },
1206 |     {
1207 |      "cell_type": "code",
1208 |      "collapsed": false,
1209 |      "input": [],
1210 |      "language": "python",
1211 |      "metadata": {},
1212 |      "outputs": [],
1213 |      "prompt_number": 20
1214 |     }
1215 |    ],
1216 |    "metadata": {}
1217 |   }
1218 |  ]
1219 | }


--------------------------------------------------------------------------------