├── .gitignore ├── README.md ├── environment.yml ├── generate_sample_data └── write_sample_data_to_db.py └── pca_pipeline ├── __init__.py ├── luigi.cfg ├── pipeline.py ├── sql └── functions.sql └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | pca_pipeline/target/ 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # luigi_gdb_pipeline_demo 2 | An example to illustrate using Luigi to manage a data science workflow in Greenplum Database 3 | For more info see: http://engineering.pivotal.io/post/luigi-data-science/ 4 | 5 | ## Use Case Description 6 | 7 | The use case for this demo is a simple unsupervised approach for identifying anomalous or highly variant user network behavior based on activity patterns. The main idea is to create 24 matrices, one for each hour of the day. The rows of the index are indexed by network users and the columns by days of the week. Row (i,j) for the matrix corresponding to hour k counts the number of events from user i in hour k of day j. The reason for creating a matrix for each hour is to accounts for differing volumes of traffic throughout the day. We then run principal component analysis (PCA) on each of these matrices and examine the large entries of the principal component directions which will correspond to ‘erratic’ users who drive most of the variance in the data. In later iterations we can extend the pipeline to use PCA for dimension reduction, clustering and other more sophisticated methods. 8 | 9 | We break up our workflow into 5 types of tasks Luigi tasks as follows: 10 | 11 | 1. Initializing all the necessary schema and user defined functions in our pipeline 12 | 2. Creating a table counting the number of flows for each user during each hour of the data set 13 | 3. For each of the 24 hours we use the output of step 2.) to create a table that is a sparse matrix representation of the hourly flow count matrix. 14 | 4. Run the MADlib PCA function to compute the principal direction for each of the 24 flow count matrices. The MADlib PCA algorithm is computed in parallel across GPDB nodes. 15 | 5. For each of 24 principal directions, Identify users with large entries and write a list of these users with outlier score (magnitude of the entry) to a file to investigate. 16 | 17 | ## Steps to running: 18 | 19 | 1. Install Pivotal Greenplum Sandbox Virtual Machine. Available for free download on the [Pivotal Network] 20 | (https://network.pivotal.io/products/pivotal-gpdb#/releases/567/file_groups/337). 21 | Once we start the Sandbox with VirtualBox or VMware Fusion, it is a two step process to start the GPDB: 22 | 1. Login as gpadmin with the password provided at startup. 23 | 2. Type ./start_all.sh to start the GPDB instance. 24 | 2. Create an ssh tunnel to make the GPDB instance available to a psycopg2 connection on localhost at port 9912 25 | * $ssh gpadmin@\ -L 9912:127.0.0.1:5432 26 | 3. Set environment variables for scripts to point to GPDB 27 | * export GPDB_DATABASE=gpadmin 28 | * export GPDB_USER=gpadmin 29 | * export GPDB_PASSWORD=\ 30 | * export GPDB_HOST=localhost 31 | * export GPDB_PORT=9912 32 | 33 | 4. Run [write_sample_data_to_db.py](https://github.com/ericwayman/luigi_gdb_pipeline_demo/blob/master/generate_sample_data/write_sample_data_to_db.py) 34 | To populate database with fake data. 35 | 36 | 5. Run the Luigi Pipeline. From inside pca_pipeline/ 37 | 1. $luigid —background 38 | 2. $python -m luigi —module pipeline PipelineTask 39 | 40 | We can view the status of our pipeline and the pipeline dependency graph by opening a browser to http://localhost:8082 41 | After running the pipeline, by default the output files will be available in pca_pipeline/target. 42 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: luigi 2 | channels: !!python/tuple 3 | - !!python/unicode 4 | 'defaults' 5 | dependencies: 6 | - appnope=0.1.0=py35_0 7 | - cached-property=1.3.0=py35_0 8 | - decorator=4.0.10=py35_0 9 | - entrypoints=0.2.2=py35_0 10 | - freetype=2.5.5=1 11 | - h5py=2.6.0=np111py35_2 12 | - hdf5=1.8.17=1 13 | - ipykernel=4.3.1=py35_0 14 | - ipython=5.0.0=py35_0 15 | - ipython_genutils=0.1.0=py35_0 16 | - ipywidgets=4.1.1=py35_0 17 | - jinja2=2.8=py35_1 18 | - jsonschema=2.5.1=py35_0 19 | - jupyter=1.0.0=py35_3 20 | - jupyter_client=4.3.0=py35_0 21 | - jupyter_console=5.0.0=py35_0 22 | - jupyter_core=4.1.0=py35_0 23 | - libpng=1.6.22=0 24 | - lockfile=0.12.2=py35_0 25 | - markupsafe=0.23=py35_2 26 | - mistune=0.7.2=py35_1 27 | - mkl=11.3.3=0 28 | - nbconvert=4.2.0=py35_0 29 | - nbformat=4.0.1=py35_0 30 | - nose=1.3.7=py35_1 31 | - notebook=4.2.2=py35_0 32 | - numpy=1.11.1=py35_0 33 | - openssl=1.0.2h=1 34 | - pandas=0.18.1=np111py35_0 35 | - path.py=8.2.1=py35_0 36 | - pexpect=4.0.1=py35_0 37 | - pickleshare=0.7.3=py35_0 38 | - pip=8.1.2=py35_0 39 | - prompt_toolkit=1.0.3=py35_0 40 | - ptyprocess=0.5.1=py35_0 41 | - pygments=2.1.3=py35_0 42 | - pyparsing=2.1.4=py35_0 43 | - pyqt=4.11.4=py35_4 44 | - python=3.5.2=0 45 | - python-daemon=2.1.1=py35_0 46 | - python-dateutil=2.5.3=py35_0 47 | - python.app=1.2=py35_4 48 | - pytz=2016.6.1=py35_0 49 | - pyzmq=15.4.0=py35_0 50 | - qt=4.8.7=4 51 | - qtconsole=4.2.1=py35_0 52 | - readline=6.2=2 53 | - scikit-learn=0.17.1=np111py35_2 54 | - scipy=0.18.0=np111py35_0 55 | - setuptools=25.1.6=py35_0 56 | - simplegeneric=0.8.1=py35_1 57 | - sip=4.18=py35_0 58 | - six=1.10.0=py35_0 59 | - sqlalchemy=1.1.4=py35_0 60 | - sqlite=3.13.0=0 61 | - terminado=0.6=py35_0 62 | - tk=8.5.18=0 63 | - tornado=4.4.1=py35_0 64 | - traitlets=4.2.2=py35_0 65 | - wcwidth=0.1.7=py35_0 66 | - wheel=0.29.0=py35_0 67 | - xz=5.2.2=0 68 | - zlib=1.2.8=3 69 | - pip: 70 | - appdirs==1.4.0 71 | - docutils==0.13.1 72 | - ipython-genutils==0.1.0 73 | - jupyter-client==4.3.0 74 | - jupyter-console==5.0.0 75 | - jupyter-core==4.1.0 76 | - keras==1.0.7 77 | - luigi==2.5.0 78 | - packaging==16.8 79 | - prompt-toolkit==1.0.3 80 | - psycopg2==2.6.2 81 | - pyyaml==3.11 82 | - theano==0.9.0.dev2 83 | prefix: //anaconda/envs/luigi 84 | 85 | -------------------------------------------------------------------------------- /generate_sample_data/write_sample_data_to_db.py: -------------------------------------------------------------------------------- 1 | #generate table with users and time stamps 2 | import time 3 | import datetime 4 | from random import sample,randint,seed 5 | import psycopg2 as pg 6 | from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT 7 | import os 8 | 9 | #gpdb connection 10 | connection = pg.connect( 11 | host=os.getenv("GPDB_HOST"), 12 | database=os.getenv("GPDB_DATABASE"), 13 | user=os.getenv("GPDB_USER"), 14 | password=os.getenv("GPDB_PASSWORD"), 15 | port = os.getenv("GPDB_PORT")) 16 | connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) 17 | #parameters 18 | USERS = list('abcdefghijklm') 19 | MAX_DATE = "15/01/2017" 20 | MIN_DATE = "01/01/2017" 21 | SEED = 1 22 | 23 | #helper functions 24 | def unix_time_from_string(date_str,date_format="%d/%m/%Y"): 25 | "converts string of form dd/mm/yyyy to unix time stamp" 26 | ut = int(time.mktime(datetime.datetime.strptime(date_str, date_format).timetuple())) 27 | return ut 28 | 29 | def generate_unix_time_stamps(num_dates,min_date,max_date,date_format="%d/%m/%Y"): 30 | """generate num_dates by hour in min_date,max_date range""" 31 | min_seconds = unix_time_from_string(min_date,date_format) 32 | max_seconds = unix_time_from_string(max_date,date_format) 33 | unix_times = sample(range(min_seconds,max_seconds),num_dates) 34 | return unix_times 35 | 36 | 37 | 38 | def main(): 39 | seed(SEED) 40 | with connection.cursor() as curs: 41 | curs.execute( 42 | """DROP TABLE IF EXISTS sample_data; 43 | CREATE TABLE sample_data 44 | ( 45 | user_name varchar, 46 | start_time bigint 47 | );""" 48 | ) 49 | total_rows=0 50 | print(USERS) 51 | for user in USERS: 52 | NUM_FLOWS = randint(800,2400) 53 | date_list = generate_unix_time_stamps(NUM_FLOWS,min_date=MIN_DATE,max_date=MAX_DATE) 54 | #add spike for user 'a' in hour 12. 55 | if user in ['a']: 56 | spike_volume = 300 57 | spikes = generate_unix_time_stamps(spike_volume,min_date='12/01/2017 12',max_date='12/01/2017 13',date_format="%d/%m/%Y %H") 58 | date_list.extend(spikes) 59 | NUM_FLOWS = NUM_FLOWS+spike_volume 60 | rows = [(user,str(date)) for date in date_list] 61 | total_rows = total_rows+NUM_FLOWS 62 | print("user: {u}, num_flows: {n}".format(u=user,n=NUM_FLOWS)) 63 | records_list_template = ','.join(['%s'] * len(rows)) 64 | insert_query = 'INSERT INTO sample_data (user_name, start_time) VALUES {}'.format(records_list_template) 65 | with connection.cursor() as curs: 66 | curs.execute(insert_query, rows) 67 | print("Total rows added: {}".format(total_rows)) 68 | 69 | if __name__ == "__main__": 70 | main() -------------------------------------------------------------------------------- /pca_pipeline/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ericwayman/luigi_gdb_pipeline_demo/30398d95a6259b547c820ca93cbf1bde711609c2/pca_pipeline/__init__.py -------------------------------------------------------------------------------- /pca_pipeline/luigi.cfg: -------------------------------------------------------------------------------- 1 | [DatabaseConfig] 2 | base_table=sample_data 3 | feature_input_table=sample.hourly_counts 4 | pca_input_base_table=sample.pca_input 5 | pca_output_base_table=sample.pca_output 6 | outlier_base_table=sample.outliers 7 | 8 | [PathConfig] 9 | 10 | [ModelConfig] 11 | user_col=user_name 12 | percentage_val=.99 13 | threshold=.1 14 | -------------------------------------------------------------------------------- /pca_pipeline/pipeline.py: -------------------------------------------------------------------------------- 1 | import os 2 | import luigi 3 | from datetime import date 4 | from psycopg2.extensions import register_adapter 5 | from utils import PSQLConn,create_hourly_counts_table,create_pca_input_table,find_principal_components,extract_large_pca_components 6 | from utils import QuotedIdentifier 7 | from utils import initialize_user_defined_functions, test_for_nulls 8 | 9 | cred = PSQLConn(os.getenv("GPDB_DATABASE"), 10 | os.getenv("GPDB_USER"), 11 | os.getenv("GPDB_PASSWORD"), 12 | os.getenv("GPDB_HOST"), 13 | os.getenv("GPDB_PORT") 14 | ) 15 | 16 | register_adapter(QuotedIdentifier, lambda x: x) 17 | 18 | class DatabaseConfig(luigi.Config): 19 | base_table=luigi.Parameter() 20 | feature_input_table=luigi.Parameter() 21 | pca_input_base_table=luigi.Parameter() 22 | pca_output_base_table=luigi.Parameter() 23 | outlier_base_table=luigi.Parameter() 24 | 25 | class ModelConfig(luigi.Config): 26 | user_col=luigi.Parameter() 27 | percentage_val=luigi.FloatParameter() 28 | threshold = luigi.FloatParameter() 29 | 30 | #some default params 31 | TARGET_PATH=os.path.join(os.path.dirname(__file__),'target/{feature}_{date}'.format( 32 | feature=ModelConfig().user_col, 33 | date=date.today()) 34 | ) 35 | 36 | #config classes 37 | class PathConfig(luigi.Config): 38 | target_path=luigi.Parameter(default=TARGET_PATH) 39 | 40 | def table_names_dict(id): 41 | pca_tables = { 42 | 'hour':id, 43 | 'pca_input':DatabaseConfig().pca_input_base_table+'_{}'.format(id), 44 | 'pca_output':DatabaseConfig().pca_output_base_table+'_{}'.format(id), 45 | 'outlier_output':DatabaseConfig().outlier_base_table+'_{}'.format(id) 46 | } 47 | return pca_tables 48 | 49 | #tasks 50 | class InitializeUserDefinedFunctions(luigi.Task): 51 | """Define user defined functions""" 52 | date = luigi.DateParameter() 53 | 54 | def run(self): 55 | conn = cred.connect() 56 | with conn.cursor() as curs: 57 | curs.execute(initialize_user_defined_functions()) 58 | 59 | with self.output().open('w') as out_file: 60 | out_file.write("successfully created functions") 61 | 62 | def output(self): 63 | return luigi.LocalTarget(os.path.join(TARGET_PATH,"initialize_user_defined_functions")) 64 | 65 | class CreateHourlyCountsTable(luigi.Task): 66 | """ 67 | Task to compute table of hourly record counts for each user on each day 68 | """ 69 | date = luigi.DateParameter() 70 | 71 | def requires(self): 72 | return [InitializeUserDefinedFunctions(date=self.date)] 73 | 74 | def run(self): 75 | conn = cred.connect() 76 | test_for_nulls(conn,table_name=DatabaseConfig().base_table) 77 | create_hourly_counts_table(conn, 78 | input_table=DatabaseConfig().base_table, 79 | output_table=DatabaseConfig().feature_input_table, 80 | user_column=ModelConfig().user_col 81 | ) 82 | with conn.cursor() as curs: 83 | query = curs.mogrify("SELECT * FROM %s LIMIT 1",(QuotedIdentifier(DatabaseConfig().feature_input_table),)) 84 | curs.execute(query) 85 | rows = curs.fetchall() 86 | test_for_nulls(conn,table_name=DatabaseConfig().feature_input_table) 87 | with self.output().open('w') as out_file: 88 | for row in rows: 89 | out_file.write(str(row)) 90 | 91 | def output(self): 92 | return luigi.LocalTarget(os.path.join(TARGET_PATH,'create_hourly_counts_table')) 93 | 94 | class CreatePCAInputTable(luigi.Task): 95 | """ 96 | Task to create a PCA input table for data with given feature_col and (day of week, hour of day) pair from the NumFlowsForFeatureTable 97 | """ 98 | date = luigi.DateParameter() 99 | hour_id = luigi.IntParameter() 100 | 101 | def requires(self): 102 | return [CreateHourlyCountsTable(date=self.date)] 103 | 104 | def run(self): 105 | conn = cred.connect() 106 | output_table = table_names_dict(self.hour_id)['pca_input'] 107 | create_pca_input_table(conn, 108 | input_table=DatabaseConfig().feature_input_table, 109 | output_table=output_table, 110 | user_column=ModelConfig().user_col, 111 | id=self.hour_id 112 | ) 113 | test_for_nulls(conn,table_name=output_table) 114 | with conn.cursor() as curs: 115 | query = curs.mogrify("SELECT * FROM %s LIMIT 1",(QuotedIdentifier(output_table),)) 116 | curs.execute(query) 117 | rows = curs.fetchall() 118 | 119 | with self.output().open('w') as out_file: 120 | for row in rows: 121 | out_file.write(str(row)) 122 | 123 | def output(self): 124 | return luigi.LocalTarget(os.path.join(TARGET_PATH,"pca_input_table_for_hour_{}".format(self.hour_id))) 125 | 126 | 127 | class RunPCATask(luigi.Task): 128 | """ 129 | Task to run PCA on tables PCA input tables generated by CreatePCAInputTable task 130 | """ 131 | date = luigi.DateParameter() 132 | hour_id = luigi.IntParameter() 133 | 134 | def requires(self): 135 | return [CreatePCAInputTable(date=self.date,hour_id=self.hour_id)] 136 | 137 | def run(self): 138 | conn = cred.connect() 139 | tables = table_names_dict(self.hour_id) 140 | input_table = tables['pca_input'] 141 | output_table = tables['pca_output'] 142 | find_principal_components(conn, 143 | input_table=input_table, 144 | output_table=output_table, 145 | percentage_val=ModelConfig().percentage_val 146 | ) 147 | test_for_nulls(conn,table_name=output_table) 148 | with conn.cursor() as curs: 149 | query = curs.mogrify("SELECT COUNT(*) FROM %s LIMIT 1",(QuotedIdentifier(output_table),)) 150 | curs.execute(query) 151 | rows = curs.fetchall() 152 | 153 | with self.output().open('w') as out_file: 154 | for row in rows: 155 | out_file.write(str(row)) 156 | 157 | def output(self): 158 | return luigi.LocalTarget(os.path.join(TARGET_PATH,"pca_output_table_for_hour_{}".format(self.hour_id))) 159 | 160 | class ComputeOutlierFeaturesFromPCA(luigi.Task): 161 | """ 162 | Task to compute outlier features by looking at entries in principal components with value above a given threshold 163 | """ 164 | date = luigi.DateParameter() 165 | hour_id = luigi.IntParameter() 166 | 167 | def requires(self): 168 | return [RunPCATask(date=self.date,hour_id=self.hour_id)] 169 | 170 | def run(self): 171 | conn = cred.connect() 172 | tables = table_names_dict(self.hour_id) 173 | pca_input_table = tables['pca_input'] 174 | pca_table = tables['pca_output'] 175 | output_table = tables['outlier_output'] 176 | extract_large_pca_components(conn, 177 | output_table=output_table, 178 | base_feature_table=pca_input_table, 179 | pca_table=pca_table, 180 | user_column=ModelConfig().user_col, 181 | threshold=ModelConfig().threshold 182 | ) 183 | with conn.cursor() as curs: 184 | query = curs.mogrify("SELECT * FROM %s ORDER BY outlier_score DESC LIMIT 100",(QuotedIdentifier(output_table),)) 185 | curs.execute(query) 186 | rows = curs.fetchall() 187 | 188 | with self.output().open('w') as out_file: 189 | for row in rows: 190 | out_file.write(str(row)) 191 | 192 | def output(self): 193 | return luigi.LocalTarget(os.path.join( 194 | TARGET_PATH, 195 | "outliers/pca_outliers_hour_{hour}_threshold_{threshold}".format(hour=self.hour_id,threshold=ModelConfig().threshold) 196 | )) 197 | 198 | #pipeline classes 199 | class PipelineTask(luigi.WrapperTask): 200 | """Wrap up all the tasks for the pipeline into a single task 201 | So we can run this pipeline by calling this dummy task""" 202 | date = luigi.DateParameter(default=date.today()) 203 | def requires(self): 204 | base_tasks = [ 205 | InitializeUserDefinedFunctions(date=self.date), 206 | CreateHourlyCountsTable(date=self.date) 207 | ] 208 | 209 | ids = list(range(0,24)) 210 | 211 | pca_input_tasks = [CreatePCAInputTable(date=self.date,hour_id=id) for id in ids] 212 | pca_output_tasks = [RunPCATask(date=self.date,hour_id=id) for id in ids] 213 | outlier_tasks = [ComputeOutlierFeaturesFromPCA(date=self.date,hour_id=id) for id in ids] 214 | tasks = base_tasks + pca_input_tasks + pca_output_tasks + outlier_tasks 215 | return tasks 216 | 217 | def run(self): 218 | with self.output().open('w') as out_file: 219 | out_file.write("successly ran pipeline on {}".format(self.date)) 220 | 221 | def output(self): 222 | return luigi.LocalTarget(os.path.join(TARGET_PATH,"dummy")) 223 | -------------------------------------------------------------------------------- /pca_pipeline/sql/functions.sql: -------------------------------------------------------------------------------- 1 | -------TEST FUNCTION------------ 2 | DROP FUNCTION IF EXISTS test(); 3 | CREATE FUNCTION test() 4 | RETURNS text 5 | AS 6 | $BODY$ 7 | BEGIN 8 | return 'success'; 9 | END; 10 | $BODY$ LANGUAGE plpgsql VOLATILE; 11 | --------------------------------- 12 | -- CREATE INPUT TABLE WITH hourly counts for each user 13 | --user_column = user_name 14 | DROP FUNCTION IF EXISTS create_hourly_counts_table(text,text,text); 15 | CREATE OR REPLACE FUNCTION create_hourly_counts_table(input_table text,output_table text, user_column text) 16 | RETURNS VOID 17 | AS 18 | $BODY$ 19 | BEGIN 20 | EXECUTE 'DROP TABLE IF EXISTS '||output_table; 21 | EXECUTE 'CREATE TABLE '||output_table||' AS 22 | SELECT '||user_column||', 23 | date_trunc(''day'',to_timestamp(start_time)) AS day, 24 | EXTRACT(HOUR FROM to_timestamp(start_time)) AS hour_of_day, 25 | COUNT(*) AS hourly_count 26 | FROM '||input_table||' 27 | GROUP BY 1, 2, 3 28 | DISTRIBUTED RANDOMLY'; 29 | END; 30 | $BODY$ LANGUAGE plpgsql VOLATILE; 31 | 32 | --generate a PCA input table for a given hour of the day. 33 | --name_id is to assign the usernames a row number, 34 | DROP FUNCTION IF EXISTS create_pca_input_table(text,text,text,int); 35 | CREATE OR REPLACE FUNCTION create_pca_input_table(input_table text, output_table text,user_column text,hour_of_day int) 36 | RETURNS void 37 | AS 38 | $BODY$ 39 | BEGIN 40 | EXECUTE 'DROP TABLE IF EXISTS '||output_table; 41 | EXECUTE 'CREATE TABLE '||output_table||' AS 42 | ( 43 | WITH a AS 44 | ( 45 | SELECT * FROM '||input_table||' where hour_of_day = '||hour_of_day||' 46 | ), 47 | 48 | b AS 49 | ( 50 | select day, day_id 51 | FROM 52 | ( 53 | select day, rank() over (ORDER BY day) AS day_id 54 | FROM a 55 | group by 1 56 | ) t1 57 | group by 1, 2 58 | ), 59 | 60 | c AS 61 | ( 62 | select '||user_column||', rank() over (ORDER BY '||user_column||') AS name_id 63 | FROM a 64 | group by 1 65 | ) 66 | 67 | select b.day_id, c.name_id, t1.* 68 | FROM a t1 69 | INNER JOIN b ON t1.day = b.day 70 | INNER JOIN c ON t1.'||user_column||' = c.'||user_column||' 71 | ) 72 | DISTRIBUTED RANDOMLY'; 73 | END; 74 | $BODY$ LANGUAGE plpgsql VOLATILE; 75 | 76 | --find principal components 77 | DROP FUNCTION IF EXISTS find_principal_components( 78 | text,text,text,text,text,float8 79 | ); 80 | CREATE OR REPLACE FUNCTION find_principal_components( 81 | input_table text, 82 | output_table text, 83 | percentage_val float8) 84 | RETURNS void AS 85 | $BODY$ 86 | DECLARE 87 | num_time_int integer; 88 | num_users integer; 89 | 90 | BEGIN 91 | 92 | EXECUTE 'select count(distinct(day_id)) FROM '||input_table into num_time_int; 93 | EXECUTE 'select count(distinct(name_id)) FROM '||input_table into num_users; 94 | 95 | EXECUTE 'DROP TABLE IF EXISTS '||output_table; 96 | EXECUTE 'DROP TABLE IF EXISTS '||output_table||'_mean'; 97 | EXECUTE 'SELECT madlib.pca_sparse_train( '''||input_table||''', 98 | '''||output_table||''', 99 | ''day_id'', 100 | ''name_id'', 101 | ''hourly_count'', 102 | '||num_time_int||', 103 | '||num_users||', 104 | '||percentage_val||' 105 | )'; 106 | 107 | END; 108 | $BODY$ 109 | LANGUAGE plpgsql VOLATILE STRICT; 110 | 111 | DROP FUNCTION IF EXISTS extract_large_pca_components(text,text,text,text,float); 112 | CREATE OR REPLACE FUNCTION extract_large_pca_components( 113 | output_table text, 114 | base_feature_table text, 115 | pca_table text, 116 | user_column text, 117 | treshold float) 118 | RETURNS VOID 119 | AS 120 | $BODY$ 121 | BEGIN 122 | EXECUTE 'DROP TABLE IF EXISTS '||output_table; 123 | EXECUTE 'CREATE TABLE '||output_table||' AS 124 | SELECT 125 | a.'||user_column||', 126 | b.outlier_score 127 | FROM 128 | ( 129 | SELECT 130 | '||user_column||', 131 | name_id 132 | FROM 133 | '||base_feature_table||' 134 | GROUP BY 1,2 135 | ) a 136 | INNER JOIN 137 | ( 138 | SELECT 139 | featureno, 140 | MAX(pcval) AS outlier_score 141 | FROM 142 | ( 143 | SELECT 144 | featureno, 145 | pcval 146 | FROM 147 | ( 148 | SELECT 149 | row_id, 150 | generate_series(1,array_upper(principal_components,1)) as featureno, 151 | unnest(principal_components) as pcval 152 | FROM '||pca_table||' 153 | ) t1 154 | where pcval > abs('||treshold||') 155 | )t2 156 | GROUP BY 1 157 | ) b 158 | ON (a.name_id=b.featureno) 159 | DISTRIBUTED RANDOMLY'; 160 | END; 161 | $BODY$ LANGUAGE plpgsql VOLATILE; 162 | -------------------------------------------------------------------------------- /pca_pipeline/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import psycopg2 as pg 3 | from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT 4 | from psycopg2.extensions import register_adapter 5 | import os 6 | 7 | class NotSqlIdentifierError(Exception): 8 | pass 9 | 10 | valid_pattern = r'^[a-zA-Z0-9_\.\$]*$' 11 | 12 | class QuotedIdentifier(object): 13 | def __init__(self, obj_str): 14 | self.obj_str = obj_str 15 | 16 | def getquoted(self): 17 | if re.match(valid_pattern, self.obj_str): 18 | return self.obj_str 19 | else: 20 | raise NotSqlIdentifierError(repr(self.obj_str)) 21 | 22 | register_adapter(QuotedIdentifier, lambda x: x) 23 | 24 | class PSQLConn(object): 25 | """Stores the connection to psql.""" 26 | def __init__(self, db, user, password, host,port=5432): 27 | self.db = db 28 | self.user = user 29 | self.password = password 30 | self.host = host 31 | self.port = port 32 | 33 | def connect(self): 34 | connection = pg.connect( 35 | host=self.host, 36 | database=self.db, 37 | user=self.user, 38 | password=self.password, 39 | port = self.port) 40 | connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) 41 | return connection 42 | 43 | 44 | #helper function to open functions.sql file 45 | src_path = os.path.dirname(os.path.abspath(__file__)) 46 | sql_path = os.path.join(src_path,'sql/') 47 | 48 | def initialize_user_defined_functions(): 49 | user_defined_functions_file = os.path.join(sql_path,'functions.sql') 50 | with open(user_defined_functions_file) as f: 51 | cmd = f.read() 52 | return cmd 53 | 54 | def create_hourly_counts_table(conn,input_table, output_table, user_column): 55 | with conn.cursor() as curs: 56 | query = curs.mogrify(""" 57 | SELECT create_hourly_counts_table('%s','%s','%s') 58 | """,(QuotedIdentifier(input_table),QuotedIdentifier(output_table),QuotedIdentifier(user_column))) 59 | curs.execute(query) 60 | 61 | def create_pca_input_table(conn,input_table, output_table,user_column,id): 62 | with conn.cursor() as curs: 63 | query = curs.mogrify(""" 64 | SELECT create_pca_input_table('%s','%s','%s',%s) 65 | """,(QuotedIdentifier(input_table), 66 | QuotedIdentifier(output_table), 67 | QuotedIdentifier(user_column), 68 | id 69 | ) 70 | ) 71 | curs.execute(query) 72 | 73 | def find_principal_components(conn,input_table,output_table,percentage_val): 74 | with conn.cursor() as curs: 75 | query = curs.mogrify(""" 76 | SELECT find_principal_components('%s','%s','%s') 77 | """,(QuotedIdentifier(input_table), 78 | QuotedIdentifier(output_table), 79 | percentage_val 80 | ) 81 | ) 82 | curs.execute(query) 83 | 84 | def extract_large_pca_components(conn,output_table,base_feature_table,pca_table,user_column,threshold): 85 | with conn.cursor() as curs: 86 | query=curs.mogrify(""" 87 | SELECT extract_large_pca_components('%s','%s','%s','%s',%s) 88 | """,(QuotedIdentifier(output_table), 89 | QuotedIdentifier(base_feature_table), 90 | QuotedIdentifier(pca_table), 91 | QuotedIdentifier(user_column), 92 | threshold 93 | ) 94 | ) 95 | curs.execute(query) 96 | 97 | def test_for_nulls(conn,table_name): 98 | """ 99 | check all columns of table_name to make sure no nulls are present 100 | """ 101 | with conn.cursor() as curs: 102 | pieces = table_name.split('.') 103 | if len(pieces) == 1: 104 | column_query = curs.mogrify(""" 105 | SELECT column_name from information_schema.columns where table_name = '%s' 106 | """,(QuotedIdentifier(pieces[0]),) 107 | ) 108 | else: 109 | column_query = curs.mogrify(""" 110 | SELECT column_name from information_schema.columns where table_name = '%s' AND table_schema = '%s' 111 | """, (QuotedIdentifier(pieces[1]), 112 | QuotedIdentifier(pieces[0]) 113 | ) 114 | ) 115 | curs.execute(column_query) 116 | columns = curs.fetchall() 117 | columns = [c[0] for c in columns] 118 | for col in columns: 119 | query = curs.mogrify(""" 120 | SELECT COUNT(*) - COUNT(%s) FROM %s; 121 | """, 122 | (QuotedIdentifier(col), 123 | QuotedIdentifier(table_name) 124 | ) 125 | ) 126 | curs.execute(query) 127 | result = curs.fetchone() 128 | num_nulls = result[0] 129 | assert num_nulls == 0 130 | --------------------------------------------------------------------------------