4 |
5 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2014, Martin Grund
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
24 |
--------------------------------------------------------------------------------
/MANIFEST:
--------------------------------------------------------------------------------
1 | # file GENERATED by distutils, do NOT edit
2 | LICENSE
3 | README.md
4 | setup.py
5 | pyxplorer/__init__.py
6 | pyxplorer/helper.py
7 | pyxplorer/loader.py
8 | pyxplorer/manager.py
9 | pyxplorer/types.py
10 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include *.md
2 | include LICENSE
3 |
4 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | pyxplorer -- Easy Interactive Data Profiling for Big Data (and Small Data)
2 | --------------------------------------------------------------------------
3 |
4 | The goal of pyxplorer is to provide a simple tool that allows interactive
5 | profiling of datasets that are accessible via a SQL like interface. The only
6 | requirement to run data profiling is that you are able to provide a Python
7 | DBAPI like interface to your data source and the data source is able to
8 | understand simplistic SQL queries.
9 |
10 | I built this piece of software while trying to get a better understanding of
11 | data distribution in a massive several hundred million record large dataset.
12 | Depending on the size of the dataset and the query engine the response time
13 | can ranging from seconds (Impala) to minutes (Hive) or even hours (MySQL)
14 |
15 | The typical use case is to use ```pyxplorer``` interactively from an iPython
16 | Notebook or iPython shell to incrementally extract information about your data.
17 |
18 | Usage
19 | ------
20 |
21 | Imagine that you are provided with access to a huge Hive/Impala database on
22 | your very own Hadoop cluster and you're asked to profile the data to get a
23 | better understanding for performing more specific data science later on.::
24 |
25 | import pyxplorer as pxp
26 | from impala.dbapi import connect
27 | conn = connect(host='impala_server', port=21050)
28 |
29 | db = pxp.Database("default", conn)
30 | db.tables()
31 |
32 | This simple code gives you access to all the tables in this database. So let's
33 | assume the result shows a ```sales_orders``` table, what can we do now?::
34 |
35 | orders = db["sales_orders"]
36 | orders.size() # 100M
37 | orders.columns() # [ol_w_id, ol_d_id, ol_o_id, ol_number, ol_i_id, ...]
38 |
39 | Ok, if we have so many columns, what can we find out about a single column?::
40 |
41 | orders.ol_d_id.min() # 1
42 | orders.ol_d_id.max() # 9999
43 | orders.ol_d_id.dcount() # 1000
44 |
45 | And like this there are some more key-figures about the data like uniqueness,
46 | constancy, most and least frequent values and distribution.
47 |
48 | In some cases, where it makes sense, the output of a method call will not be a
49 | simple array or list but directly a Pandas dataframe to facilitate plotting
50 | and further analysis.
51 |
52 | You will find an easier to digest tutorial here:
53 |
54 | * http://nbviewer.ipython.org/github/grundprinzip/pyxplorer/blob/master/pyxplorer_stuff.ipynb
55 |
56 |
57 | Supported Features
58 | -------------------
59 |
60 | * Column Count (Database / Table)
61 | * Table Count
62 | * Tuple Count (Database / Table)
63 | * Min / Max
64 | * Most Frequent / Least Frequent
65 | * Top-K Most Frequent / Top-K Least Frequent
66 | * Top-K Value Distribution (Database / Table )
67 | * Uniqueness
68 | * Constancy
69 | * Distinc Value Count
70 |
71 |
72 | Supported Platforms
73 | --------------------
74 |
75 | The following platforms are typically tested while using ```pyxplorer```
76 |
77 | * Hive
78 | * Impala
79 | * MySQL
80 |
81 |
82 | Dependencies
83 | -------------
84 |
85 | * pandas
86 | * phys2 for Hive based loading of data sets
87 | * pympala for connecting to Impala
88 | * snakebite for loading data from HDFS to Hive
89 |
--------------------------------------------------------------------------------
/dependencies.txt:
--------------------------------------------------------------------------------
1 | pympala
2 | snakebite
3 | pandas
4 | pyhs2
5 |
--------------------------------------------------------------------------------
/pyxplorer/__init__.py:
--------------------------------------------------------------------------------
1 |
2 | # The database helper
3 | from manager import Database
4 |
5 | # Relevant for getting Table information
6 | from types import Column, Table
7 |
8 | # The HDFS Loader
9 | from loader import Loader
10 |
--------------------------------------------------------------------------------
/pyxplorer/helper.py:
--------------------------------------------------------------------------------
1 | import functools
2 | from StringIO import StringIO
3 |
4 |
5 | def car(data):
6 | return [x[0] for x in data]
7 |
8 |
9 | def render_table(head, rows, limit=10):
10 | buf = StringIO()
11 | buf.write("
")
12 | for h in head:
13 | buf.write("
{0}
".format(h))
14 | buf.write("
")
15 |
16 | # Build the slices we need
17 | if limit == None or len(rows) <= limit:
18 | data = rows
19 | footer = None
20 | else:
21 | data = rows[:9]
22 | footer = rows[-1:]
23 |
24 | for r in data:
25 | buf.write("
")
26 | for c in r:
27 | buf.write("
{0}
".format(c))
28 | buf.write("
")
29 |
30 | if footer:
31 | for r in footer:
32 | buf.write("
")
33 | for c in r:
34 | buf.write("
{0}
".format(c))
35 | buf.write("
")
36 | buf.write("
")
37 | buf.write("
Rows: %d / Columns: %d
" % (len(rows), len(head)))
38 | return buf.getvalue()
39 |
40 |
41 | def memoize(obj):
42 | cache = obj.cache = {}
43 |
44 | @functools.wraps(obj)
45 | def memoizer(*args, **kwargs):
46 | key = str(args) + str(kwargs)
47 | if key not in cache:
48 | cache[key] = obj(*args, **kwargs)
49 | return cache[key]
50 |
51 |
52 | return memoizer
53 |
--------------------------------------------------------------------------------
/pyxplorer/loader.py:
--------------------------------------------------------------------------------
1 | __author__ = 'grund'
2 | import re
3 |
4 | from snakebite.client import Client
5 | import pyhs2
6 |
7 |
8 | class Loader:
9 | """
10 | The idea of the loader is to provide a convenient interface to create a new table
11 | based on some input files
12 | """
13 |
14 | def __init__(self, path, name_node, hive_server,
15 | user="root", hive_db="default", password=None, nn_port=8020, hive_port=10000):
16 |
17 | # HDFS Connection
18 | self._client = Client(name_node, nn_port)
19 |
20 | self._db = hive_db
21 |
22 | # Hive Connection
23 | self._hive = pyhs2.connect(host=hive_server,
24 | port=hive_port,
25 | authMechanism="PLAIN",
26 | database=hive_db,
27 | user=user,
28 | password=password)
29 | self._path = path
30 |
31 |
32 | def load(self):
33 | # Check data to see which kind it is
34 | files = self._client.ls([self._path])
35 |
36 | files = [f for f in files if f['file_type'] == 'f']
37 | if len(files) == 0:
38 | raise Exception("Cannot load empty directory")
39 |
40 | # Pick the first file and assume that it has the same content as the others
41 | data = self.head(files[0]['path'])
42 | res = self.check_separator(data)
43 | if res == None:
44 | # We cant load the data and better abort here
45 | print("cant load data, cannot find a separator")
46 | return
47 |
48 | sep = res[0]
49 | num_cols = res[1]
50 |
51 | # Build table statement
52 | table_statement, table_name = self._create_table(self._path, sep, num_cols)
53 | cursor = self._hive.cursor()
54 | cursor.execute(table_statement)
55 |
56 | return self._db, table_name
57 |
58 |
59 | def _create_table(self, path, sep, count):
60 | buf = """CREATE EXTERNAL TABLE pyxplorer_data (
61 | %s
62 | )ROW FORMAT DELIMITED FIELDS TERMINATED BY '%s'
63 | STORED AS TEXTFILE LOCATION '%s'
64 | """ % (",".join(["col_%d string" % x for x in range(count)]), sep, path)
65 | return buf, "pyxplorer_data"
66 |
67 | def check_separator(self, data):
68 | """
69 | THis method evaluates a list of separators on the input data to check which one
70 | is correct. This is done by first splitting the input by newline and then
71 | checking if the split by separator is equal for each input row except the last
72 | that might be incomplete due to the limited input data
73 |
74 | :param data: input data to check
75 | :return:
76 | """
77 |
78 | sep_list = [r'\t', r';', r',', r'|', r'\s+']
79 |
80 | data_copy = data
81 | for sep in sep_list:
82 | # Check if the count matches each line
83 | splitted = data_copy.split("\n")
84 | parts = [len(re.split(sep, line)) for line in splitted]
85 |
86 | # If we did not split anything continue
87 | if sum(parts) == len(splitted):
88 | continue
89 |
90 | diff = 0
91 |
92 | for i in range(len(parts[1:-1])):
93 | diff += abs(parts[i] - parts[i + 1])
94 |
95 | if diff == 0:
96 | return sep, parts[0]
97 |
98 | # If we reach this point we did not find a separator
99 | return None
100 |
101 |
102 | def head(self, file_path):
103 | """
104 | Onlye read the first packets that come, try to max out at 1024kb
105 |
106 | :return: up to 1024b of the first block of the file
107 | """
108 | processor = lambda path, node, tail_only=True, append=False: self._handle_head(
109 | path, node)
110 |
111 | # Find items and go
112 | for item in self._client._find_items([file_path], processor,
113 | include_toplevel=True,
114 | include_children=False, recurse=False):
115 | if item:
116 | return item
117 |
118 | def _handle_head(self, path, node, upper=1024 * 1024):
119 | data = ''
120 | for load in self._client._read_file(path, node, tail_only=False,
121 | check_crc=False):
122 | data += load
123 | if (len(data) > upper):
124 | return data
125 |
126 | return data
127 |
--------------------------------------------------------------------------------
/pyxplorer/manager.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 | import sys
3 |
4 | import pandas as pd
5 |
6 | import types as t
7 | import helper as h
8 |
9 |
10 | class Database:
11 | def __init__(self, db, conn):
12 | self.db = db
13 | self.connection = conn
14 |
15 | def __getitem__(self, item):
16 | for x in self.tables():
17 | if x.name() == item:
18 | return x
19 | raise KeyError(item)
20 |
21 | @h.memoize
22 | def tables(self):
23 | """
24 | :return: all tables stored in this database
25 | """
26 | cursor = self.connection.cursor()
27 | cursor.execute("show tables in %s" % self.db)
28 | self._tables = [t.Table(r[0], con=self.connection, db=self.db) for r in cursor.fetchall()]
29 | return self._tables
30 |
31 | def __len__(self):
32 | return len(self.tables())
33 |
34 | @h.memoize
35 | def tcounts(self):
36 | """
37 | :return: a data frame containing the names and sizes for all tables
38 | """
39 | df = pd.DataFrame([[t.name(), t.size()] for t in self.tables()], columns=["name", "size"])
40 | df.index = df.name
41 | return df
42 |
43 | @h.memoize
44 | def dcounts(self):
45 | """
46 | :return: a data frame with names and distinct counts and fractions for all columns in the database
47 | """
48 | print("WARNING: Distinct value count for all tables can take a long time...", file=sys.stderr)
49 | sys.stderr.flush()
50 |
51 | data = []
52 | for t in self.tables():
53 | for c in t.columns():
54 | data.append([t.name(), c.name(), c.dcount(), t.size(), c.dcount() / float(t.size())])
55 | df = pd.DataFrame(data, columns=["table", "column", "distinct", "size", "fraction"])
56 | return df
57 |
58 |
59 | def _repr_html_(self):
60 | return h.render_table(["Name", "Size"], [[x.name(), x.size()] for x in self.tables()])
61 |
62 |
63 | def num_tables(self):
64 | return len(self)
65 |
66 | def num_columns(self):
67 | return sum([len(x.columns()) for x in self.tables()])
68 |
69 | def num_tuples(self):
70 | return sum([x.size() for x in self.tables()])
71 |
--------------------------------------------------------------------------------
/pyxplorer/types.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import pandas as pd
4 | import helper as h
5 | import sys
6 |
7 |
8 | class Column:
9 | """
10 | Representation of a column and the profiling information
11 | """
12 |
13 | def _qexec(self, fld, group=None, order=None):
14 | c = self._con.cursor()
15 | if not group is None:
16 | group = " group by %s" % group
17 | else:
18 | group = ""
19 |
20 | if not order is None:
21 | order = " order by %s" % order
22 | else:
23 | order = ""
24 |
25 | query = "select %s from `%s`.`%s` %s %s" % (fld, self._table.db(), self._table.name(), group, order)
26 | c.execute(query)
27 | return c.fetchall()
28 |
29 | def __init__(self, name, type_name, con, table):
30 | self._name = name
31 | self._type_name = type_name
32 | self._con = con
33 | self._table = table
34 | self._distribution = None
35 | self._min = None
36 | self._max = None
37 | self._dcount = None
38 | self._most_frequent = None
39 | self._most_frequent_count = None
40 | self._least_frequent = None
41 | self._least_frequent_count = None
42 |
43 | def __repr__(self):
44 | return self.name()
45 |
46 | def __str__(self):
47 | buf = "%s\n" % self.name()
48 | funs = [self.min, self.max, self.dcount, self.most_frequent, self.least_frequent]
49 | for x in funs:
50 | buf += "%s:\t%s\n" % (x.__name__, x())
51 | return buf
52 |
53 | def name(self):
54 | return self._name
55 |
56 | @classmethod
57 | def build(cls, data, con, table):
58 | return Column(data[0], data[1], con, table)
59 |
60 | def __eq__(self, other):
61 | return self._name == other._name and self._type_name == other._type_name
62 |
63 | @h.memoize
64 | def min(self):
65 | """
66 | :returns the minimum of the column
67 | """
68 | res = self._qexec("min(%s)" % self._name)
69 | if len(res) > 0:
70 | self._min = res[0][0]
71 | return self._min
72 |
73 | @h.memoize
74 | def max(self):
75 | """
76 | :returns the maximum of the column
77 | """
78 | res = self._qexec("max(%s)" % self._name)
79 | if len(res) > 0:
80 | self._max = res[0][0]
81 | return self._max
82 |
83 | @h.memoize
84 | def dcount(self):
85 | res = self._qexec("count(distinct %s)" % self._name)
86 | if len(res) > 0:
87 | self._dcount = res[0][0]
88 | return self._dcount
89 |
90 | @h.memoize
91 | def distribution(self, limit=1024):
92 | """
93 | Build the distribution of distinct values
94 | """
95 | res = self._qexec("%s, count(*) as __cnt" % self.name(), group="%s" % self.name(),
96 | order="__cnt DESC LIMIT %d" % limit)
97 | dist = []
98 | cnt = self._table.size()
99 | for i, r in enumerate(res):
100 | dist.append(list(r) + [i, r[1] / float(cnt)])
101 |
102 | self._distribution = pd.DataFrame(dist, columns=["value", "cnt", "r", "fraction"])
103 | self._distribution.index = self._distribution.r
104 |
105 | return self._distribution
106 |
107 | @h.memoize
108 | def most_frequent(self):
109 | res = self.n_most_frequent(1)
110 | self._most_frequent = res[0][0]
111 | self._most_frequent_count = res[0][1]
112 | return self._most_frequent, self._most_frequent_count
113 |
114 | @h.memoize
115 | def least_frequent(self):
116 | res = self.n_least_frequent(1)
117 | self._least_frequent = res[0][0]
118 | self._least_frequent_count = res[0][1]
119 | return self._least_frequent, self._least_frequent_count
120 |
121 | @h.memoize
122 | def n_most_frequent(self, limit=10):
123 | res = self._qexec("%s, count(*) as __cnt" % self.name(), group="%s" % self.name(),
124 | order="__cnt DESC LIMIT %d" % limit)
125 | return res
126 |
127 | @h.memoize
128 | def n_least_frequent(self, limit=10):
129 | res = self._qexec("%s, count(*) as cnt" % self.name(), group="%s" % self.name(),
130 | order="cnt ASC LIMIT %d" % limit)
131 | return res
132 |
133 | def size(self):
134 | return self._table.size()
135 |
136 | def uniqueness(self):
137 | return self.dcount() / float(self.size())
138 |
139 | def constancy(self):
140 | tup = self.most_frequent()
141 | return tup[1] / float(self.size())
142 |
143 | def _repr_html_(self):
144 |
145 | funs = [("Min", self.min), ("Max", self.max), ("#Distinct Values", self.dcount),
146 | ("Most Frequent", lambda: "{0} ({1})".format(*self.most_frequent())),
147 | ("Least Frequent", lambda: "{0} ({1})".format(*self.least_frequent())),
148 | ("Top 10 MF", lambda: ",".join(map(str, h.car(self.n_most_frequent())))),
149 | ("Top 10 LF", lambda: ", ".join(map(str, h.car(self.n_least_frequent())))),
150 | ("Uniqueness", self.uniqueness),
151 | ("Constancy", self.constancy),
152 | ]
153 | return h.render_table(["Name", "Value"], [[x[0], x[1]()] for x in funs])
154 |
155 |
156 | class Table:
157 | """
158 | Generic Table Object
159 |
160 | This class provides simple access to the columns of the table. Most of the methods that perform actual data access
161 | are cached to avoid costly lookups.
162 |
163 |
164 | """
165 |
166 | def __init__(self, name, con, db="default"):
167 | self._cols = []
168 | self._db = db
169 | self._name = name
170 | self._connection = con
171 |
172 | def name(self):
173 | """
174 | :return: name of the table
175 | """
176 | return self._name
177 |
178 | def db(self):
179 | """
180 | :return: name of the database used
181 | """
182 | return self._db
183 |
184 | def column(self, col):
185 | """
186 | Given either a column index or name return the column structure
187 | :param col: either index or name
188 | :return: column data structure
189 | """
190 | if type(col) is str:
191 | for c in self._cols:
192 | if c.name == col:
193 | return c
194 | else:
195 | return self._cols[col]
196 |
197 | @h.memoize
198 | def __len__(self):
199 | """
200 | :return: number of rows in the table
201 | """
202 | c = self._connection.cursor()
203 | c.execute("select count(*) from `%s`.`%s`" % (self._db, self._name))
204 | self._count = c.fetchall()[0][0]
205 | return self._count
206 |
207 | def size(self):
208 | """
209 | alias to __len__()
210 | :return:
211 | """
212 | return len(self)
213 |
214 | @h.memoize
215 | def columns(self):
216 | """
217 | :return: the list of column in this table
218 | """
219 | c = self._connection.cursor()
220 | c.execute("describe `%s`.`%s`" % (self._db, self._name))
221 | self._cols = []
222 | for col in c.fetchall():
223 | self._cols.append(Column.build(col, table=self, con=self._connection))
224 | return self._cols
225 |
226 | def __getitem__(self, item):
227 | """
228 | Subscript access to the tables by name
229 | :param item:
230 | :return:
231 | """
232 | for x in self.columns():
233 | if x.name() == item:
234 | return x
235 | raise KeyError(item)
236 |
237 | def __dir__(self):
238 | """
239 | :return: an array of custom attributes, for code-completion in ipython
240 | """
241 | return [x.name() for x in self.columns()]
242 |
243 | def __repr__(self):
244 | return "" % (self._db, self._name)
245 |
246 | def __getattr__(self, item):
247 | """
248 | :param item: name of the column
249 | :return: column object for attribute-like access to the column
250 | """
251 | for x in self.columns():
252 | if x.name() == item:
253 | return x
254 | raise AttributeError("'%s' object has no attribute '%s'" % (type(self).__name__, item))
255 |
256 | def num_columns(self):
257 | """
258 | :return: number of columns of the table
259 | """
260 | return len(self.columns())
261 |
262 | def distinct_value_fractions(self):
263 | """
264 | :return: returns a data frame of name distinct value fractions
265 | """
266 | return pd.DataFrame([c.dcount() / float(self.size()) for c in self.columns()],
267 | index=[c.name() for c in self.columns()], columns=["fraction"])
268 |
--------------------------------------------------------------------------------
/pyxplorer_stuff.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "name": "",
4 | "signature": "sha256:4a51afdcc72d6c3700e33a0f3fa5a5570e51f803ccfcaeea43ab06c775c03da7"
5 | },
6 | "nbformat": 3,
7 | "nbformat_minor": 0,
8 | "worksheets": [
9 | {
10 | "cells": [
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "#Pyxplorer - interactive data set exploration\n",
16 | "\n",
17 | "The goal of pyxplorer is to provide a simple tool that allows interactive\n",
18 | "profiling of datasets that are accessible via a SQL like interface. The only\n",
19 | "requirement to run data profiling is that you are able to provide a Python\n",
20 | "DBAPI like interface to your data source and the data source is able to\n",
21 | "understand simplistic SQL queries.\n",
22 | "\n",
23 | "I built this piece of software while trying to get a better understanding of\n",
24 | "data distribution in a massive several hundred million record large dataset.\n",
25 | "Depending on the size of the dataset and the query engine the resposne time\n",
26 | "can ranging from seconds (Impala) to minutes (Hive) or even hourse (MySQL)\n",
27 | "\n",
28 | "The typical use case is to use `pyxplorer` interactively from an iPython\n",
29 | "Notebook or iPython shell to incrementally extract information about your data."
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | " $> pip install pyxplorer pympala"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "Questions, Ideas, Comments:\n",
44 | "\n",
45 | "https://github.com/grundprinzip/pyxplorer"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Example using Impala\n",
53 | "\n",
54 | "Basically `pyexplorer` works with all DBAPI like interfaces, but to show the advantages of running a high-performance data analysis on large amounts of data we will use Impala to store our data."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "collapsed": false,
60 | "input": [
61 | "from impala.dbapi import connect\n",
62 | "conn = connect(host='diufpc57', port=21050)"
63 | ],
64 | "language": "python",
65 | "metadata": {},
66 | "outputs": [],
67 | "prompt_number": 2
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Database Operations"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {},
79 | "source": [
80 | "Imagine that you are provided with access to a huge Hive/Impala database on\n",
81 | "your very own Hadoop cluster and you're asked to profile the data to get a\n",
82 | "better understanding for performing more specific data science later on. \n",
83 | "Based on this connection, we can now instantiate a new explorer object."
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "collapsed": false,
89 | "input": [
90 | "import pyxplorer as pxp\n",
91 | "data = pxp.Database(\"tpcc3\", conn)\n",
92 | "data"
93 | ],
94 | "language": "python",
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "html": [
99 | "
Name
Size
customerp
30000000
districtp
10000
historyp
30000000
itemp
100000
new_orderp
9000000
oorderp
30000000
order_linep
299991280
stockp
100000000
warehousep
1000
Rows: 9 / Columns: 2
"
100 | ],
101 | "metadata": {},
102 | "output_type": "pyout",
103 | "prompt_number": 4,
104 | "text": [
105 | ""
106 | ]
107 | }
108 | ],
109 | "prompt_number": 4
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "This simple code gives you access to all the tables in this database. Let's further investigate how many tables and columns exist in the database."
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "collapsed": false,
121 | "input": [
122 | "len(data)"
123 | ],
124 | "language": "python",
125 | "metadata": {},
126 | "outputs": [
127 | {
128 | "metadata": {},
129 | "output_type": "pyout",
130 | "prompt_number": 12,
131 | "text": [
132 | "9"
133 | ]
134 | }
135 | ],
136 | "prompt_number": 12
137 | },
138 | {
139 | "cell_type": "markdown",
140 | "metadata": {},
141 | "source": [
142 | "The above is the idiomatic python way, but sometimes, it might not be as easy to gasp what is meant, which gives you the chance to use as well"
143 | ]
144 | },
145 | {
146 | "cell_type": "code",
147 | "collapsed": false,
148 | "input": [
149 | "data.num_tables()"
150 | ],
151 | "language": "python",
152 | "metadata": {},
153 | "outputs": [
154 | {
155 | "metadata": {},
156 | "output_type": "pyout",
157 | "prompt_number": 13,
158 | "text": [
159 | "9"
160 | ]
161 | }
162 | ],
163 | "prompt_number": 13
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "Get the total number of columns:"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "collapsed": false,
175 | "input": [
176 | "sum([len(x.columns()) for x in data.tables()])"
177 | ],
178 | "language": "python",
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "metadata": {},
183 | "output_type": "pyout",
184 | "prompt_number": 7,
185 | "text": [
186 | "92"
187 | ]
188 | }
189 | ],
190 | "prompt_number": 7
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "Or we can directly use the number of columns method on the database object"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "collapsed": false,
202 | "input": [
203 | "data.num_columns()"
204 | ],
205 | "language": "python",
206 | "metadata": {},
207 | "outputs": [
208 | {
209 | "metadata": {},
210 | "output_type": "pyout",
211 | "prompt_number": 9,
212 | "text": [
213 | "92"
214 | ]
215 | }
216 | ],
217 | "prompt_number": 9
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "It seems like we have a better understanding of the dataset, but how many tuples are we talking about?"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "collapsed": false,
229 | "input": [
230 | "data.num_tuples()"
231 | ],
232 | "language": "python",
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "metadata": {},
237 | "output_type": "pyout",
238 | "prompt_number": 15,
239 | "text": [
240 | "499102280"
241 | ]
242 | }
243 | ],
244 | "prompt_number": 15
245 | },
246 | {
247 | "cell_type": "markdown",
248 | "metadata": {},
249 | "source": [
250 | "## Singel Table Operation"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "Using the above operations, we can perform simple operations on all tables, but let's have a further look at single table operations to extract more information from instances.\n",
258 | "\n",
259 | "In this example, we want to investigate the `order_line` table."
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "collapsed": false,
265 | "input": [
266 | "tab = data['order_linep']\n",
267 | "tab"
268 | ],
269 | "language": "python",
270 | "metadata": {},
271 | "outputs": [
272 | {
273 | "metadata": {},
274 | "output_type": "pyout",
275 | "prompt_number": 14,
276 | "text": [
277 | ""
278 | ]
279 | }
280 | ],
281 | "prompt_number": 14
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "Let's start by doing some basic inspection of the table, like extracting the number of rows and the number of columns"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "collapsed": false,
293 | "input": [
294 | "tab.size()"
295 | ],
296 | "language": "python",
297 | "metadata": {},
298 | "outputs": [
299 | {
300 | "metadata": {},
301 | "output_type": "pyout",
302 | "prompt_number": 17,
303 | "text": [
304 | "299991280"
305 | ]
306 | }
307 | ],
308 | "prompt_number": 17
309 | },
310 | {
311 | "cell_type": "code",
312 | "collapsed": false,
313 | "input": [
314 | "len(tab.columns())"
315 | ],
316 | "language": "python",
317 | "metadata": {},
318 | "outputs": [
319 | {
320 | "metadata": {},
321 | "output_type": "pyout",
322 | "prompt_number": 18,
323 | "text": [
324 | "10"
325 | ]
326 | }
327 | ],
328 | "prompt_number": 18
329 | },
330 | {
331 | "cell_type": "code",
332 | "collapsed": false,
333 | "input": [
334 | "tab.columns()"
335 | ],
336 | "language": "python",
337 | "metadata": {},
338 | "outputs": [
339 | {
340 | "metadata": {},
341 | "output_type": "pyout",
342 | "prompt_number": 19,
343 | "text": [
344 | "[ol_w_id,\n",
345 | " ol_d_id,\n",
346 | " ol_o_id,\n",
347 | " ol_number,\n",
348 | " ol_i_id,\n",
349 | " ol_delivery_d,\n",
350 | " ol_amount,\n",
351 | " ol_supply_w_id,\n",
352 | " ol_quantity,\n",
353 | " ol_dist_info]"
354 | ]
355 | }
356 | ],
357 | "prompt_number": 19
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "Columns are special objects that can be easily and interactively inspected in iPython Notebooks, the default information per column are the `min` and `max` value, the most frequent and least frequent value and the total number of distinct values. Based on these measrues we provide information about the column.\n",
364 | "\n",
365 | "$uniqueness = \\frac{distinct}{rows}$\n",
366 | "\n",
367 | "$constancy = \\frac{count_{mf}}{rows}$"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "collapsed": false,
373 | "input": [
374 | "tab['ol_w_id']"
375 | ],
376 | "language": "python",
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "html": [
381 | "