├── .gitignore
├── README.md
├── environment.yml
├── generate_sample_data
    └── write_sample_data_to_db.py
└── pca_pipeline
    ├── __init__.py
    ├── luigi.cfg
    ├── pipeline.py
    ├── sql
        └── functions.sql
    └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | pca_pipeline/target/
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # luigi_gdb_pipeline_demo
 2 | An example to illustrate using Luigi to manage a data science workflow in Greenplum Database
 3 | For more info see: http://engineering.pivotal.io/post/luigi-data-science/
 4 | 
 5 | ## Use Case Description
 6 | 
 7 | The use case for this demo is a simple unsupervised approach for identifying anomalous or highly variant user network behavior based on activity patterns.  The main idea is to create 24 matrices, one for each hour of the day.  The rows of the index are indexed by network users and the columns by days of the week.  Row (i,j) for the matrix corresponding to hour k counts the number of events from user i in hour k of day j.  The reason for creating a matrix for each hour is to accounts for differing volumes of traffic throughout the day.  We then run principal component analysis (PCA) on each of these matrices and examine the large entries of the principal component directions which will correspond to ‘erratic’ users who drive most of the variance in the data.  In later iterations we can extend the pipeline to use PCA for dimension reduction, clustering and other more sophisticated methods.
 8 | 
 9 | We break up our workflow into 5 types of tasks Luigi tasks as follows:
10 | 
11 | 1. Initializing all the necessary schema and user defined functions in our pipeline
12 | 2. Creating a table counting the number of flows for each user during each hour of the data set
13 | 3. For each of the 24 hours we  use the output of step 2.)  to create a table that is a sparse matrix representation of the hourly flow count matrix.
14 | 4.  Run the MADlib PCA function to compute the principal direction for each of the 24 flow count matrices.  The MADlib PCA algorithm is computed in parallel across GPDB nodes.
15 | 5. For each of 24 principal directions, Identify users with large entries and write a list of these users with outlier score (magnitude of the entry) to a file to investigate.   
16 | 
17 | ## Steps to running: 
18 | 
19 | 1. Install Pivotal Greenplum Sandbox Virtual Machine.  Available for free download on the [Pivotal Network]
20 | (https://network.pivotal.io/products/pivotal-gpdb#/releases/567/file_groups/337).
21 |  Once we start the Sandbox with VirtualBox or VMware Fusion, it is a two step process to start the GPDB: 
22 |   1. Login as gpadmin with the password provided at startup.  
23 |   2. Type ./start_all.sh to start the GPDB instance.
24 | 2.  Create an ssh tunnel to make the GPDB instance available to a psycopg2 connection on localhost at port 9912
25 |   * $ssh gpadmin@\<VM IP address\> -L 9912:127.0.0.1:5432
26 | 3.  Set environment variables for scripts to point to GPDB
27 |   * export GPDB_DATABASE=gpadmin
28 | 	* export GPDB_USER=gpadmin
29 | 	* export GPDB_PASSWORD=\<vm_password\>
30 | 	* export GPDB_HOST=localhost
31 | 	* export GPDB_PORT=9912
32 |   
33 | 4. Run [write_sample_data_to_db.py](https://github.com/ericwayman/luigi_gdb_pipeline_demo/blob/master/generate_sample_data/write_sample_data_to_db.py)
34 | To populate database with fake data.
35 | 
36 | 5.  Run the Luigi Pipeline.  From inside pca_pipeline/
37 |   1. $luigid —background
38 |   2. $python -m luigi —module pipeline PipelineTask
39 |   
40 | We can view the status of our pipeline and the pipeline dependency graph by opening a browser to http://localhost:8082
41 | After running the pipeline, by default the output files will be available in pca_pipeline/target.
42 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: luigi
 2 | channels: !!python/tuple
 3 | - !!python/unicode
 4 |   'defaults'
 5 | dependencies:
 6 | - appnope=0.1.0=py35_0
 7 | - cached-property=1.3.0=py35_0
 8 | - decorator=4.0.10=py35_0
 9 | - entrypoints=0.2.2=py35_0
10 | - freetype=2.5.5=1
11 | - h5py=2.6.0=np111py35_2
12 | - hdf5=1.8.17=1
13 | - ipykernel=4.3.1=py35_0
14 | - ipython=5.0.0=py35_0
15 | - ipython_genutils=0.1.0=py35_0
16 | - ipywidgets=4.1.1=py35_0
17 | - jinja2=2.8=py35_1
18 | - jsonschema=2.5.1=py35_0
19 | - jupyter=1.0.0=py35_3
20 | - jupyter_client=4.3.0=py35_0
21 | - jupyter_console=5.0.0=py35_0
22 | - jupyter_core=4.1.0=py35_0
23 | - libpng=1.6.22=0
24 | - lockfile=0.12.2=py35_0
25 | - markupsafe=0.23=py35_2
26 | - mistune=0.7.2=py35_1
27 | - mkl=11.3.3=0
28 | - nbconvert=4.2.0=py35_0
29 | - nbformat=4.0.1=py35_0
30 | - nose=1.3.7=py35_1
31 | - notebook=4.2.2=py35_0
32 | - numpy=1.11.1=py35_0
33 | - openssl=1.0.2h=1
34 | - pandas=0.18.1=np111py35_0
35 | - path.py=8.2.1=py35_0
36 | - pexpect=4.0.1=py35_0
37 | - pickleshare=0.7.3=py35_0
38 | - pip=8.1.2=py35_0
39 | - prompt_toolkit=1.0.3=py35_0
40 | - ptyprocess=0.5.1=py35_0
41 | - pygments=2.1.3=py35_0
42 | - pyparsing=2.1.4=py35_0
43 | - pyqt=4.11.4=py35_4
44 | - python=3.5.2=0
45 | - python-daemon=2.1.1=py35_0
46 | - python-dateutil=2.5.3=py35_0
47 | - python.app=1.2=py35_4
48 | - pytz=2016.6.1=py35_0
49 | - pyzmq=15.4.0=py35_0
50 | - qt=4.8.7=4
51 | - qtconsole=4.2.1=py35_0
52 | - readline=6.2=2
53 | - scikit-learn=0.17.1=np111py35_2
54 | - scipy=0.18.0=np111py35_0
55 | - setuptools=25.1.6=py35_0
56 | - simplegeneric=0.8.1=py35_1
57 | - sip=4.18=py35_0
58 | - six=1.10.0=py35_0
59 | - sqlalchemy=1.1.4=py35_0
60 | - sqlite=3.13.0=0
61 | - terminado=0.6=py35_0
62 | - tk=8.5.18=0
63 | - tornado=4.4.1=py35_0
64 | - traitlets=4.2.2=py35_0
65 | - wcwidth=0.1.7=py35_0
66 | - wheel=0.29.0=py35_0
67 | - xz=5.2.2=0
68 | - zlib=1.2.8=3
69 | - pip:
70 |   - appdirs==1.4.0
71 |   - docutils==0.13.1
72 |   - ipython-genutils==0.1.0
73 |   - jupyter-client==4.3.0
74 |   - jupyter-console==5.0.0
75 |   - jupyter-core==4.1.0
76 |   - keras==1.0.7
77 |   - luigi==2.5.0
78 |   - packaging==16.8
79 |   - prompt-toolkit==1.0.3
80 |   - psycopg2==2.6.2
81 |   - pyyaml==3.11
82 |   - theano==0.9.0.dev2
83 | prefix: //anaconda/envs/luigi
84 | 
85 | 


--------------------------------------------------------------------------------
/generate_sample_data/write_sample_data_to_db.py:
--------------------------------------------------------------------------------
 1 | #generate table with users and time stamps
 2 | import time
 3 | import datetime
 4 | from random import sample,randint,seed
 5 | import psycopg2 as pg
 6 | from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
 7 | import os
 8 | 
 9 | #gpdb connection
10 | connection = pg.connect(
11 |         host=os.getenv("GPDB_HOST"),
12 |         database=os.getenv("GPDB_DATABASE"),
13 |         user=os.getenv("GPDB_USER"),
14 |         password=os.getenv("GPDB_PASSWORD"),
15 |         port = os.getenv("GPDB_PORT"))
16 | connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
17 | #parameters
18 | USERS = list('abcdefghijklm')
19 | MAX_DATE = "15/01/2017"
20 | MIN_DATE = "01/01/2017"
21 | SEED = 1
22 | 
23 | #helper functions
24 | def unix_time_from_string(date_str,date_format="%d/%m/%Y"):
25 |     "converts string of form dd/mm/yyyy to unix time stamp"
26 |     ut = int(time.mktime(datetime.datetime.strptime(date_str, date_format).timetuple()))
27 |     return ut
28 | 
29 | def generate_unix_time_stamps(num_dates,min_date,max_date,date_format="%d/%m/%Y"):
30 |     """generate num_dates by hour in min_date,max_date range"""
31 |     min_seconds = unix_time_from_string(min_date,date_format)
32 |     max_seconds = unix_time_from_string(max_date,date_format)
33 |     unix_times = sample(range(min_seconds,max_seconds),num_dates)
34 |     return unix_times
35 | 
36 | 
37 | 
38 | def main():
39 |     seed(SEED)
40 |     with connection.cursor() as curs:
41 |         curs.execute(
42 |             """DROP TABLE IF EXISTS sample_data;
43 |             CREATE TABLE sample_data
44 |             (
45 |             user_name     varchar,
46 |             start_time   bigint
47 |             );"""
48 |         )
49 |     total_rows=0
50 |     print(USERS)
51 |     for user in USERS:
52 |         NUM_FLOWS = randint(800,2400)
53 |         date_list = generate_unix_time_stamps(NUM_FLOWS,min_date=MIN_DATE,max_date=MAX_DATE)
54 |         #add spike for user 'a' in hour 12.
55 |         if user in ['a']:
56 |             spike_volume = 300
57 |             spikes = generate_unix_time_stamps(spike_volume,min_date='12/01/2017 12',max_date='12/01/2017 13',date_format="%d/%m/%Y %H")
58 |             date_list.extend(spikes)
59 |             NUM_FLOWS = NUM_FLOWS+spike_volume
60 |         rows = [(user,str(date)) for date in date_list]
61 |         total_rows = total_rows+NUM_FLOWS
62 |         print("user: {u}, num_flows: {n}".format(u=user,n=NUM_FLOWS))
63 |         records_list_template = ','.join(['%s'] * len(rows))
64 |         insert_query = 'INSERT INTO sample_data (user_name, start_time) VALUES {}'.format(records_list_template)
65 |         with connection.cursor() as curs:
66 |             curs.execute(insert_query, rows)
67 |     print("Total rows added: {}".format(total_rows))
68 | 
69 | if __name__ == "__main__":
70 |     main()


--------------------------------------------------------------------------------
/pca_pipeline/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ericwayman/luigi_gdb_pipeline_demo/30398d95a6259b547c820ca93cbf1bde711609c2/pca_pipeline/__init__.py


--------------------------------------------------------------------------------
/pca_pipeline/luigi.cfg:
--------------------------------------------------------------------------------
 1 | [DatabaseConfig]
 2 | base_table=sample_data
 3 | feature_input_table=sample.hourly_counts
 4 | pca_input_base_table=sample.pca_input
 5 | pca_output_base_table=sample.pca_output
 6 | outlier_base_table=sample.outliers
 7 | 
 8 | [PathConfig]
 9 | 
10 | [ModelConfig]
11 | user_col=user_name
12 | percentage_val=.99
13 | threshold=.1
14 | 


--------------------------------------------------------------------------------
/pca_pipeline/pipeline.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import luigi
  3 | from datetime import date
  4 | from psycopg2.extensions import register_adapter
  5 | from utils import PSQLConn,create_hourly_counts_table,create_pca_input_table,find_principal_components,extract_large_pca_components
  6 | from utils import QuotedIdentifier
  7 | from utils import initialize_user_defined_functions, test_for_nulls
  8 | 
  9 | cred = PSQLConn(os.getenv("GPDB_DATABASE"),
 10 |                 os.getenv("GPDB_USER"),
 11 |                 os.getenv("GPDB_PASSWORD"),
 12 |                 os.getenv("GPDB_HOST"),
 13 |                 os.getenv("GPDB_PORT")
 14 |                 )
 15 | 
 16 | register_adapter(QuotedIdentifier, lambda x: x)
 17 | 
 18 | class DatabaseConfig(luigi.Config):
 19 |     base_table=luigi.Parameter()
 20 |     feature_input_table=luigi.Parameter()
 21 |     pca_input_base_table=luigi.Parameter()
 22 |     pca_output_base_table=luigi.Parameter()
 23 |     outlier_base_table=luigi.Parameter()
 24 | 
 25 | class ModelConfig(luigi.Config):
 26 |     user_col=luigi.Parameter()
 27 |     percentage_val=luigi.FloatParameter()
 28 |     threshold = luigi.FloatParameter()
 29 | 
 30 | #some default params
 31 | TARGET_PATH=os.path.join(os.path.dirname(__file__),'target/{feature}_{date}'.format(
 32 |                 feature=ModelConfig().user_col,
 33 |                 date=date.today())
 34 |                 )
 35 | 
 36 | #config classes
 37 | class PathConfig(luigi.Config):
 38 |     target_path=luigi.Parameter(default=TARGET_PATH)
 39 | 
 40 | def table_names_dict(id):
 41 |         pca_tables = {
 42 |         'hour':id,
 43 |         'pca_input':DatabaseConfig().pca_input_base_table+'_{}'.format(id),
 44 |         'pca_output':DatabaseConfig().pca_output_base_table+'_{}'.format(id),
 45 |         'outlier_output':DatabaseConfig().outlier_base_table+'_{}'.format(id)
 46 |         }
 47 |         return pca_tables
 48 | 
 49 | #tasks
 50 | class InitializeUserDefinedFunctions(luigi.Task):
 51 |     """Define user defined functions"""
 52 |     date = luigi.DateParameter()
 53 | 
 54 |     def run(self):
 55 |         conn = cred.connect()
 56 |         with conn.cursor() as curs:
 57 |             curs.execute(initialize_user_defined_functions())
 58 | 
 59 |         with self.output().open('w') as out_file:
 60 |             out_file.write("successfully created functions")
 61 | 
 62 |     def output(self):
 63 |         return luigi.LocalTarget(os.path.join(TARGET_PATH,"initialize_user_defined_functions"))
 64 | 
 65 | class CreateHourlyCountsTable(luigi.Task):
 66 |     """
 67 |     Task to compute table of hourly record counts for each user on each day
 68 |     """
 69 |     date = luigi.DateParameter()
 70 | 
 71 |     def requires(self):
 72 |         return [InitializeUserDefinedFunctions(date=self.date)]
 73 | 
 74 |     def run(self):
 75 |         conn = cred.connect()
 76 |         test_for_nulls(conn,table_name=DatabaseConfig().base_table)
 77 |         create_hourly_counts_table(conn,
 78 |                                     input_table=DatabaseConfig().base_table,
 79 |                                     output_table=DatabaseConfig().feature_input_table,
 80 |                                     user_column=ModelConfig().user_col
 81 |                                     )
 82 |         with conn.cursor() as curs:
 83 |             query = curs.mogrify("SELECT * FROM %s LIMIT 1",(QuotedIdentifier(DatabaseConfig().feature_input_table),))
 84 |             curs.execute(query)
 85 |             rows = curs.fetchall()
 86 |         test_for_nulls(conn,table_name=DatabaseConfig().feature_input_table)
 87 |         with self.output().open('w') as out_file:
 88 |             for row in rows:
 89 |                 out_file.write(str(row))
 90 | 
 91 |     def output(self):
 92 |         return luigi.LocalTarget(os.path.join(TARGET_PATH,'create_hourly_counts_table'))
 93 | 
 94 | class CreatePCAInputTable(luigi.Task):
 95 |     """
 96 |     Task to create a PCA input table for data with given feature_col and (day of week, hour of day) pair from the NumFlowsForFeatureTable
 97 |     """
 98 |     date = luigi.DateParameter()
 99 |     hour_id = luigi.IntParameter()
100 | 
101 |     def requires(self):
102 |         return [CreateHourlyCountsTable(date=self.date)]
103 | 
104 |     def run(self):
105 |         conn = cred.connect()
106 |         output_table = table_names_dict(self.hour_id)['pca_input']
107 |         create_pca_input_table(conn,
108 |                 input_table=DatabaseConfig().feature_input_table,
109 |                 output_table=output_table,
110 |                 user_column=ModelConfig().user_col,
111 |                 id=self.hour_id
112 |             )
113 |         test_for_nulls(conn,table_name=output_table)
114 |         with conn.cursor() as curs:
115 |             query = curs.mogrify("SELECT * FROM %s LIMIT 1",(QuotedIdentifier(output_table),))
116 |             curs.execute(query)
117 |             rows = curs.fetchall()
118 | 
119 |         with self.output().open('w') as out_file:
120 |             for row in rows:
121 |                 out_file.write(str(row))
122 | 
123 |     def output(self):
124 |         return luigi.LocalTarget(os.path.join(TARGET_PATH,"pca_input_table_for_hour_{}".format(self.hour_id)))
125 | 
126 | 
127 | class RunPCATask(luigi.Task):
128 |     """
129 |     Task to run PCA on tables PCA input tables generated by CreatePCAInputTable task
130 |     """
131 |     date = luigi.DateParameter()
132 |     hour_id = luigi.IntParameter()
133 | 
134 |     def requires(self):
135 |         return [CreatePCAInputTable(date=self.date,hour_id=self.hour_id)]
136 | 
137 |     def run(self):
138 |         conn = cred.connect()
139 |         tables = table_names_dict(self.hour_id)
140 |         input_table = tables['pca_input']
141 |         output_table = tables['pca_output']
142 |         find_principal_components(conn,
143 |                                     input_table=input_table,
144 |                                     output_table=output_table,
145 |                                     percentage_val=ModelConfig().percentage_val
146 |                                 )
147 |         test_for_nulls(conn,table_name=output_table)
148 |         with conn.cursor() as curs:
149 |             query = curs.mogrify("SELECT COUNT(*) FROM %s LIMIT 1",(QuotedIdentifier(output_table),))
150 |             curs.execute(query)
151 |             rows = curs.fetchall()
152 | 
153 |         with self.output().open('w') as out_file:
154 |             for row in rows:
155 |                 out_file.write(str(row))
156 | 
157 |     def output(self):
158 |         return luigi.LocalTarget(os.path.join(TARGET_PATH,"pca_output_table_for_hour_{}".format(self.hour_id)))
159 | 
160 | class ComputeOutlierFeaturesFromPCA(luigi.Task):
161 |     """
162 |     Task to compute outlier features by looking at entries in principal components with value above a given threshold
163 |     """
164 |     date = luigi.DateParameter()
165 |     hour_id = luigi.IntParameter()
166 | 
167 |     def requires(self):
168 |         return [RunPCATask(date=self.date,hour_id=self.hour_id)]
169 | 
170 |     def run(self):
171 |         conn = cred.connect()
172 |         tables = table_names_dict(self.hour_id)
173 |         pca_input_table = tables['pca_input']
174 |         pca_table = tables['pca_output']
175 |         output_table = tables['outlier_output']
176 |         extract_large_pca_components(conn,
177 |                                         output_table=output_table,
178 |                                         base_feature_table=pca_input_table,
179 |                                         pca_table=pca_table,
180 |                                         user_column=ModelConfig().user_col,
181 |                                         threshold=ModelConfig().threshold
182 |                                     )
183 |         with conn.cursor() as curs:
184 |             query = curs.mogrify("SELECT * FROM %s ORDER BY outlier_score DESC LIMIT 100",(QuotedIdentifier(output_table),))
185 |             curs.execute(query)
186 |             rows = curs.fetchall()
187 | 
188 |         with self.output().open('w') as out_file:
189 |             for row in rows:
190 |                 out_file.write(str(row))
191 | 
192 |     def output(self):
193 |         return luigi.LocalTarget(os.path.join(
194 |             TARGET_PATH,
195 |             "outliers/pca_outliers_hour_{hour}_threshold_{threshold}".format(hour=self.hour_id,threshold=ModelConfig().threshold)
196 |             ))
197 | 
198 | #pipeline classes
199 | class PipelineTask(luigi.WrapperTask):
200 |     """Wrap up all the tasks for the pipeline into a single task
201 |     So we can run this pipeline by calling this dummy task"""
202 |     date = luigi.DateParameter(default=date.today())
203 |     def requires(self):
204 |         base_tasks = [
205 |                 InitializeUserDefinedFunctions(date=self.date),
206 |                 CreateHourlyCountsTable(date=self.date)
207 |                 ]
208 |         
209 |         ids = list(range(0,24))
210 | 
211 |         pca_input_tasks = [CreatePCAInputTable(date=self.date,hour_id=id) for id in ids]
212 |         pca_output_tasks = [RunPCATask(date=self.date,hour_id=id) for id in ids]
213 |         outlier_tasks = [ComputeOutlierFeaturesFromPCA(date=self.date,hour_id=id) for id in ids]
214 |         tasks = base_tasks + pca_input_tasks + pca_output_tasks + outlier_tasks
215 |         return tasks
216 | 
217 |     def run(self):
218 |         with self.output().open('w') as out_file:
219 |             out_file.write("successly ran pipeline on {}".format(self.date))
220 | 
221 |     def output(self):
222 |         return luigi.LocalTarget(os.path.join(TARGET_PATH,"dummy"))
223 | 


--------------------------------------------------------------------------------
/pca_pipeline/sql/functions.sql:
--------------------------------------------------------------------------------
  1 | -------TEST FUNCTION------------
  2 | DROP FUNCTION IF EXISTS test();
  3 | CREATE FUNCTION test()
  4 | RETURNS text
  5 | AS
  6 | $BODY$
  7 | BEGIN
  8 | return 'success';
  9 | END;
 10 | $BODY$ LANGUAGE plpgsql VOLATILE;
 11 | ---------------------------------
 12 | -- CREATE INPUT TABLE WITH hourly counts for each user
 13 | --user_column = user_name
 14 | DROP FUNCTION IF EXISTS create_hourly_counts_table(text,text,text);
 15 | CREATE OR REPLACE FUNCTION create_hourly_counts_table(input_table text,output_table text, user_column text)
 16 | RETURNS VOID
 17 | AS
 18 | $BODY$
 19 | BEGIN
 20 | EXECUTE 'DROP TABLE IF EXISTS '||output_table;
 21 | EXECUTE 'CREATE TABLE '||output_table||' AS
 22 | SELECT '||user_column||', 
 23 |     date_trunc(''day'',to_timestamp(start_time)) AS day,
 24 |     EXTRACT(HOUR FROM to_timestamp(start_time)) AS hour_of_day, 
 25 |     COUNT(*) AS hourly_count
 26 | FROM '||input_table||'
 27 | GROUP BY 1, 2, 3
 28 | DISTRIBUTED RANDOMLY';
 29 | END;
 30 | $BODY$ LANGUAGE plpgsql VOLATILE;
 31 | 
 32 | --generate a PCA input table for a given hour of the day.
 33 | --name_id is to assign the usernames a row number,
 34 | DROP FUNCTION IF EXISTS create_pca_input_table(text,text,text,int);
 35 | CREATE OR REPLACE FUNCTION create_pca_input_table(input_table text, output_table text,user_column text,hour_of_day int)
 36 | RETURNS void
 37 | AS
 38 | $BODY$
 39 | BEGIN
 40 | EXECUTE 'DROP TABLE IF EXISTS '||output_table;
 41 | EXECUTE 'CREATE TABLE '||output_table||' AS
 42 | (
 43 | WITH a AS 
 44 | (
 45 |     SELECT * FROM '||input_table||' where hour_of_day = '||hour_of_day||'
 46 | ), 
 47 | 
 48 | b AS 
 49 | (
 50 | select day, day_id
 51 | FROM 
 52 | (
 53 |     select day, rank() over (ORDER BY day) AS day_id
 54 |     FROM a
 55 |     group by 1
 56 | ) t1
 57 | group by 1, 2
 58 | ), 
 59 | 
 60 | c AS 
 61 | (
 62 |     select '||user_column||', rank() over (ORDER BY '||user_column||') AS name_id
 63 |     FROM a
 64 |     group by 1
 65 | )
 66 | 
 67 | select  b.day_id, c.name_id, t1.*
 68 | FROM  a t1
 69 | INNER JOIN b ON t1.day = b.day
 70 | INNER JOIN c ON t1.'||user_column||' = c.'||user_column||'
 71 | )
 72 | DISTRIBUTED RANDOMLY';
 73 | END;
 74 | $BODY$ LANGUAGE plpgsql VOLATILE;
 75 | 
 76 | --find principal components
 77 | DROP FUNCTION IF EXISTS find_principal_components(
 78 | text,text,text,text,text,float8
 79 | );
 80 | CREATE OR REPLACE FUNCTION find_principal_components(
 81 |     input_table text,
 82 |     output_table text,
 83 |     percentage_val float8)
 84 |    RETURNS void AS
 85 | $BODY$
 86 | DECLARE
 87 |     num_time_int integer;
 88 |     num_users integer;
 89 | 
 90 | BEGIN
 91 | 
 92 | EXECUTE 'select count(distinct(day_id)) FROM '||input_table into num_time_int;
 93 | EXECUTE 'select count(distinct(name_id)) FROM '||input_table into num_users;
 94 | 
 95 | EXECUTE 'DROP TABLE IF EXISTS '||output_table;
 96 | EXECUTE 'DROP TABLE IF EXISTS '||output_table||'_mean';
 97 | EXECUTE 'SELECT madlib.pca_sparse_train( '''||input_table||''',
 98 |                   '''||output_table||''',
 99 |                   ''day_id'',
100 |                   ''name_id'',
101 |                   ''hourly_count'',
102 |                   '||num_time_int||',
103 |                   '||num_users||',
104 |                   '||percentage_val||'
105 |                 )';
106 | 
107 | END;
108 | $BODY$
109 |   LANGUAGE plpgsql VOLATILE STRICT;
110 | 
111 | DROP FUNCTION IF EXISTS extract_large_pca_components(text,text,text,text,float);
112 | CREATE OR REPLACE FUNCTION extract_large_pca_components(
113 |                                                         output_table text,
114 |                                                         base_feature_table text,
115 |                                                         pca_table text,
116 |                                                         user_column text,
117 |                                                         treshold float)
118 | RETURNS VOID
119 | AS
120 | $BODY$
121 | BEGIN
122 | EXECUTE 'DROP TABLE IF EXISTS '||output_table;
123 | EXECUTE 'CREATE TABLE '||output_table||' AS
124 | SELECT
125 | a.'||user_column||',
126 | b.outlier_score
127 | FROM
128 | (
129 | SELECT
130 | '||user_column||',
131 | name_id
132 | FROM
133 | '||base_feature_table||'
134 | GROUP BY 1,2
135 | ) a
136 | INNER JOIN
137 | (
138 | SELECT
139 | featureno,
140 | MAX(pcval) AS outlier_score
141 | FROM
142 | (
143 |     SELECT
144 |     featureno,
145 |     pcval
146 |     FROM
147 |         (
148 |         SELECT
149 |         row_id,
150 |         generate_series(1,array_upper(principal_components,1)) as featureno,
151 |         unnest(principal_components) as pcval
152 |         FROM '||pca_table||'
153 |         ) t1
154 |     where pcval > abs('||treshold||')
155 |     )t2
156 | GROUP BY 1
157 | ) b
158 | ON (a.name_id=b.featureno)
159 | DISTRIBUTED RANDOMLY';
160 | END;
161 | $BODY$ LANGUAGE plpgsql VOLATILE;
162 | 


--------------------------------------------------------------------------------
/pca_pipeline/utils.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import psycopg2 as pg
  3 | from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
  4 | from psycopg2.extensions import register_adapter
  5 | import os
  6 | 
  7 | class NotSqlIdentifierError(Exception):
  8 |     pass
  9 | 
 10 | valid_pattern = r'^[a-zA-Z0-9_\.\$]*$'
 11 | 
 12 | class QuotedIdentifier(object):
 13 |     def __init__(self, obj_str):
 14 |         self.obj_str = obj_str
 15 | 
 16 |     def getquoted(self):
 17 |         if re.match(valid_pattern, self.obj_str):
 18 |             return self.obj_str
 19 |         else:
 20 |             raise NotSqlIdentifierError(repr(self.obj_str))
 21 | 
 22 | register_adapter(QuotedIdentifier, lambda x: x)
 23 | 
 24 | class PSQLConn(object):
 25 |     """Stores the connection to psql."""
 26 |     def __init__(self, db, user, password, host,port=5432):
 27 |         self.db = db
 28 |         self.user = user
 29 |         self.password = password
 30 |         self.host = host
 31 |         self.port = port
 32 | 
 33 |     def connect(self):
 34 |         connection = pg.connect(
 35 |                 host=self.host,
 36 |                 database=self.db,
 37 |                 user=self.user,
 38 |                 password=self.password,
 39 |                 port = self.port)
 40 |         connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
 41 |         return connection
 42 | 
 43 | 
 44 | #helper function to open functions.sql file
 45 | src_path = os.path.dirname(os.path.abspath(__file__))
 46 | sql_path = os.path.join(src_path,'sql/')
 47 | 
 48 | def initialize_user_defined_functions():
 49 |     user_defined_functions_file = os.path.join(sql_path,'functions.sql')
 50 |     with open(user_defined_functions_file) as f:
 51 |         cmd = f.read()
 52 |     return cmd
 53 | 
 54 | def create_hourly_counts_table(conn,input_table, output_table, user_column):
 55 |     with conn.cursor() as curs:
 56 |         query = curs.mogrify("""
 57 |         SELECT create_hourly_counts_table('%s','%s','%s')
 58 |         """,(QuotedIdentifier(input_table),QuotedIdentifier(output_table),QuotedIdentifier(user_column)))
 59 |         curs.execute(query)
 60 | 
 61 | def create_pca_input_table(conn,input_table, output_table,user_column,id):
 62 |     with conn.cursor() as curs:
 63 |         query = curs.mogrify("""
 64 |             SELECT create_pca_input_table('%s','%s','%s',%s)
 65 |             """,(QuotedIdentifier(input_table),
 66 |                 QuotedIdentifier(output_table),
 67 |                 QuotedIdentifier(user_column),
 68 |                 id
 69 |             )
 70 |         )
 71 |         curs.execute(query)
 72 | 
 73 | def find_principal_components(conn,input_table,output_table,percentage_val):
 74 |     with conn.cursor() as curs:
 75 |         query = curs.mogrify("""
 76 |             SELECT  find_principal_components('%s','%s','%s')
 77 |         """,(QuotedIdentifier(input_table),
 78 |             QuotedIdentifier(output_table),
 79 |             percentage_val
 80 |             )
 81 |         )
 82 |         curs.execute(query)
 83 | 
 84 | def extract_large_pca_components(conn,output_table,base_feature_table,pca_table,user_column,threshold):
 85 |     with conn.cursor() as curs:
 86 |         query=curs.mogrify("""
 87 |             SELECT extract_large_pca_components('%s','%s','%s','%s',%s)
 88 |             """,(QuotedIdentifier(output_table),
 89 |                 QuotedIdentifier(base_feature_table),
 90 |                 QuotedIdentifier(pca_table),
 91 |                 QuotedIdentifier(user_column),
 92 |                 threshold
 93 |                 )
 94 |         )
 95 |         curs.execute(query)
 96 | 
 97 | def test_for_nulls(conn,table_name):
 98 |     """
 99 |     check all columns of table_name to make sure no nulls are present
100 |     """
101 |     with conn.cursor() as curs:
102 |         pieces = table_name.split('.')
103 |         if len(pieces) == 1:
104 |             column_query = curs.mogrify("""
105 |                 SELECT column_name from information_schema.columns where table_name = '%s'
106 |                 """,(QuotedIdentifier(pieces[0]),)
107 |                 )
108 |         else:
109 |             column_query = curs.mogrify("""
110 |                 SELECT column_name from information_schema.columns where table_name = '%s'  AND table_schema = '%s'
111 |                 """, (QuotedIdentifier(pieces[1]),
112 |                     QuotedIdentifier(pieces[0])
113 |                     )
114 |                 )
115 |         curs.execute(column_query)
116 |         columns = curs.fetchall()
117 |         columns =  [c[0] for c in columns]
118 |         for col in columns:
119 |             query = curs.mogrify("""
120 |             SELECT COUNT(*) - COUNT(%s) FROM %s;
121 |                 """,
122 |                 (QuotedIdentifier(col),
123 |                 QuotedIdentifier(table_name)
124 |                 )
125 |             )
126 |             curs.execute(query)
127 |             result = curs.fetchone()
128 |             num_nulls = result[0]
129 |             assert num_nulls == 0
130 | 


--------------------------------------------------------------------------------