├── README.md ├── data └── test_data.dat ├── datapartitioning.py ├── parallel_join_sort.py └── queryprocessor.py /README.md: -------------------------------------------------------------------------------- 1 | # distributed-database 2 | The primary goal of the project is to implement some of key concepts in distributed and parallel databases systems. For example operations like fragmentation, parallel sort, range query etc. This project is done as part of CSE 512 Distributed and Parallel Database Systems taught by [Mohamed Sarwat](http://faculty.engineering.asu.edu/sarwat/) 3 | 4 | These concepts are built upon open source relational database [postgres](https://www.postgresql.org/). I have used [python](https://www.python.org/) for programming and psycopg as database driver for postgres. You can find getteting started guide for psycopg [here](http://prashant47.github.io/2017/Sep/20/psycopg_postgresql_adapter_for_python.html). 5 | 6 | The project covers 3 mains concepts 7 | 1. Data fragmentation acorss partitions. (Sharding) 8 | 2. Query processor that accesses data from the partitioned table. 9 | 3. Parallel sort and parallel join algorithm. 10 | 11 | ## Data Fragmentation 12 | In centralized database sysytems, all the data is present in single node whereas in distributed and parallel database systems data is paritioned into multiple nodes. 13 | 14 | ## Query Processor 15 | It involves building a simplified query processor that accesses data from the partitioned table. As part of this two queries were implemented RangeQuery() and PointQuery(). 16 |
17 | RangeQuery() takes input as range of attribute and returns the tuples that come along with given range from fragmented partitions done in first step. 18 |
19 | PointQuery() takes input as specific value of attribute and returns all the tuples having the same value of attribute from gragmented paritions. 20 | 21 | 22 | ## Parallel Sort & Join 23 | This task involves implementation generic parallel sort and join algorithm. 24 | 25 | 26 | 27 | 28 | ## Contribution 29 | In case you like this utility or you find fun working with this project then feel free to contribute. For contributing you just need working knowledge of python, postgres & bit about distributed database concepts. 30 |
31 | Some initial ideas would be adding few more queries in query processor .! 32 | 33 | 34 | ## Issues 35 | 36 | If you find any issue, bug, error or any unhandles exception, feel free to [report one](https://github.com/Prashant47/distributed-database/issues/new) 37 | -------------------------------------------------------------------------------- /data/test_data.dat: -------------------------------------------------------------------------------- 1 | 1::122::5::838985046 2 | 1::185::4.5::838983525 3 | 1::231::4::838983392 4 | 1::292::3.5::838983421 5 | 1::316::3::838983392 6 | 1::329::2.5::838983392 7 | 1::355::2::838984474 8 | 1::356::1.5::838983653 9 | 1::362::1::838984885 10 | 1::364::0.5::838983707 11 | 1::370::0::838984596 12 | 1::377::3.5::838983834 13 | 1::420::5::838983834 14 | 1::466::4::838984679 15 | 1::480::5::838983653 16 | 1::520::2.5::838984679 17 | 1::539::5::838984068 18 | 1::586::3.5::838984068 19 | 1::588::5::838983339 20 | 1::589::1.5::838983778 -------------------------------------------------------------------------------- /datapartitioning.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | # 3 | # This project done as part of CSE 512 Fall 2017 4 | # 5 | 6 | __author__ = "Prashant Gonarkar" 7 | __version__ = "v0.1" 8 | __email__ = "pgonarka@asu.edu" 9 | 10 | import psycopg2 11 | import csv 12 | import os 13 | 14 | DATABASE_NAME = 'dds_assgn2' 15 | USER_ID_COLNAME = 'userId' 16 | MOVIE_ID_COLNAME = 'movieId' 17 | RATING_COLNAME = 'rating' 18 | RANGE_TABLE_PREFIX = 'range_part' 19 | RROBIN_TABLE_PREFIX = 'rrobin_part' 20 | RATINGS_TABLE = 'ratings' 21 | 22 | 23 | def getopenconnection(user='postgres', password='1234', dbname='dds_assgn1'): 24 | return psycopg2.connect("dbname='" + dbname + "' user='" + user + "' host='localhost' password='" + password + "'") 25 | 26 | 27 | def loadratings(ratingstablename, ratingsfilepath, openconnection): 28 | 29 | cur = openconnection.cursor() 30 | createtable(ratingstablename,cur) 31 | 32 | # For inserting data COPY method of postgres is used being the fastest among other approaches. 33 | # But input data has multi-byte characters as delimiters (::) which postgres doesn't support 34 | # hence input data is preprocessed and converted into tab seprated data 35 | 36 | 37 | # Preprocessing the data before importing into postgres 38 | try: 39 | with open(ratingsfilepath) as finput, open('/tmp/_output_zxyvk.csv','w') as foutput: 40 | csv_output = csv.writer(foutput,delimiter='\t') 41 | for lines in finput: 42 | line = lines.rstrip('\n') 43 | splits = line.split("::") 44 | #print("value of splits: ",splits) 45 | csv_output.writerow(splits[0:3]) 46 | except Exception as ex: 47 | print("File processing error: ",ex) 48 | 49 | with open('/tmp/_output_zxyvk.csv') as datafile: 50 | try: 51 | cur.copy_from(datafile,ratingstablename ) 52 | openconnection.commit() 53 | cur.close() 54 | except Exception as ex: 55 | print("Failed to copy file in database: ",ex) 56 | 57 | # removing temporary intermediate file created while data preprocessing 58 | os.remove('/tmp/_output_zxyvk.csv') 59 | return 60 | 61 | def rangepartition(ratingstablename, numberofpartitions, openconnection): 62 | 63 | # find max and min to get the range but here range is being given. 64 | # As given range is given as [0-5] 65 | ratinglowerbound = 0.0 66 | ratingupperbound = 5.0 67 | partitioninterval = abs(ratingupperbound-ratinglowerbound) / numberofpartitions 68 | 69 | cur = openconnection.cursor() 70 | for i in range( 0, numberofpartitions ): 71 | partitiontablename = RANGE_TABLE_PREFIX + repr(i) 72 | createtable(partitiontablename,cur) 73 | 74 | # upper and lower bounds for created partition table 75 | lowerbound = i * partitioninterval 76 | upperbound = lowerbound + partitioninterval 77 | 78 | # inserting values according to range in table 79 | if lowerbound == ratinglowerbound: 80 | query = " INSERT INTO {0} SELECT * FROM {1} WHERE {2} >= {3} and {2} <= {4}".format( partitiontablename, 81 | ratingstablename, 82 | RATING_COLNAME, 83 | lowerbound, 84 | upperbound ) 85 | else: 86 | query = " INSERT INTO {0} SELECT * FROM {1} WHERE {2} > {3} and {2} <= {4}".format( partitiontablename, 87 | ratingstablename, 88 | RATING_COLNAME, 89 | lowerbound, 90 | upperbound ) 91 | cur.execute(query) 92 | openconnection.commit() 93 | print("Created partition table: ",partitiontablename) 94 | pass 95 | 96 | def roundrobinpartition(ratingstablename, numberofpartitions, openconnection): 97 | 98 | # create round robin partitions 99 | modvalue = 0 100 | cur = openconnection.cursor() 101 | for i in range( 0, numberofpartitions ): 102 | partitiontablename = RROBIN_TABLE_PREFIX + repr(i) 103 | createtable(partitiontablename,cur) 104 | print("partitiontablename: ",partitiontablename) 105 | 106 | # modvalue acts as partition selector in query 107 | # e.g for modvalue 1 with no of partitions 4, the query will select rows number 1, 5, 9, .. 108 | 109 | if (i != (numberofpartitions - 1)): 110 | modvalue = i + 1; 111 | else: 112 | modvalue = 0; 113 | print("mod value ",modvalue) 114 | 115 | try: 116 | query = "INSERT INTO {0} " \ 117 | "SELECT {1},{2},{3} " \ 118 | "FROM (SELECT ROW_NUMBER() OVER() as row_number,* FROM {4}) as foo " \ 119 | "WHERE MOD(row_number,{5}) = cast ('{6}' as bigint) ".format(partitiontablename,USER_ID_COLNAME, MOVIE_ID_COLNAME, 120 | RATING_COLNAME, ratingstablename, numberofpartitions, modvalue) 121 | 122 | cur.execute(query) 123 | openconnection.commit() 124 | except Exception as ex: 125 | print(ex) 126 | pass 127 | 128 | 129 | def roundrobininsert(ratingstablename, userid, itemid, rating, openconnection): 130 | 131 | # Round Robin insert approach: start comparing count of adjacent two tables 132 | # if all tables have same count then insert into first table, and if next table 133 | # has less count than previous table then insert into next table 134 | # 135 | 136 | cur = openconnection.cursor() 137 | # calculate number of partitions 138 | partitioncount = tablecount(cur, RROBIN_TABLE_PREFIX) 139 | print ('partition count ', partitioncount) 140 | 141 | partitiontoinsert = 0 142 | previouscount = countrowsintable(cur, RROBIN_TABLE_PREFIX + repr(0) ) 143 | 144 | for i in range(1,partitioncount): 145 | nextcount = countrowsintable(cur, RROBIN_TABLE_PREFIX + repr(i) ) 146 | if ( nextcount < previouscount ): 147 | partitiontoinsert = i 148 | break 149 | 150 | # inserting in ratings table 151 | query = " INSERT INTO {0} VALUES ({1}, {2}, {3})".format( ratingstablename, 152 | userid, itemid, rating ) 153 | cur.execute(query) 154 | 155 | # inserting in appropriate round robin partition 156 | query = " INSERT INTO {0} VALUES ({1}, {2}, {3})".format( RROBIN_TABLE_PREFIX+repr(partitiontoinsert), 157 | userid, itemid, rating ) 158 | cur.execute(query) 159 | openconnection.commit() 160 | print("Inserted value in partition: ",RROBIN_TABLE_PREFIX+repr(partitiontoinsert)) 161 | pass 162 | 163 | 164 | def rangeinsert(ratingstablename, userid, itemid, rating, openconnection): 165 | 166 | cur = openconnection.cursor() 167 | # calculate number of partitions 168 | partitioncount = tablecount(cur, RANGE_TABLE_PREFIX) 169 | print ('partition count ', partitioncount ) 170 | 171 | # As given range is given as [0-5] 172 | ratinglowerbound = 0.0 173 | ratingupperbound = 5.0 174 | partitioninterval = abs(ratingupperbound-ratinglowerbound) / partitioncount 175 | 176 | for i in range( 0, partitioncount): 177 | # upper and lower bounds for created partition table 178 | lowerbound = i * partitioninterval 179 | upperbound = lowerbound + partitioninterval 180 | 181 | if lowerbound == ratinglowerbound: 182 | if (rating >= lowerbound) and (rating <= upperbound): 183 | break 184 | elif (rating > lowerbound) and (rating <= upperbound): 185 | break 186 | partitiontoinsert = i 187 | 188 | # inserting in ratings table 189 | query = " INSERT INTO {0} VALUES ({1}, {2}, {3})".format( ratingstablename, 190 | userid, itemid, rating ) 191 | cur.execute(query) 192 | 193 | # inserting in appropriate range partition table 194 | query = " INSERT INTO {0} VALUES ({1}, {2}, {3})".format( RANGE_TABLE_PREFIX + repr(partitiontoinsert), 195 | userid, itemid, rating ) 196 | cur.execute(query) 197 | openconnection.commit() 198 | print("Inserted value in partition: ", RANGE_TABLE_PREFIX + repr(partitiontoinsert)) 199 | pass 200 | 201 | def deletepartitionsandexit(openconnection): 202 | 203 | cur = openconnection.cursor() 204 | # delete range paritions 205 | partitioncount = tablecount(cur, RANGE_TABLE_PREFIX) 206 | print ('partition count %s', partitioncount) 207 | for i in range(0, partitioncount): 208 | partitionname = RANGE_TABLE_PREFIX + repr(i) 209 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(partitionname)) 210 | openconnection.commit() 211 | 212 | # delete round robin paritions 213 | partitioncount = tablecount(cur, RROBIN_TABLE_PREFIX) 214 | print ('partition count %s', partitioncount) 215 | for i in range(0, partitioncount): 216 | partitionname = RROBIN_TABLE_PREFIX + repr(i) 217 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(partitionname)) 218 | openconnection.commit() 219 | 220 | # delete ratings parition 221 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(RATINGS_TABLE)) 222 | openconnection.commit() 223 | 224 | def tablecount(cur, tableprefix): 225 | query = "SELECT COUNT(table_name) FROM information_schema.tables WHERE table_schema = 'public' AND table_name LIKE '{0}%';".format(tableprefix) 226 | cur.execute(query) 227 | partitioncount = int(cur.fetchone()[0]) 228 | return partitioncount 229 | 230 | def countrowsintable( cur, tablename): 231 | 232 | query = "SELECT count(*) FROM {0}".format(tablename) 233 | cur.execute(query) 234 | count = int(cur.fetchone()[0]) 235 | return count 236 | 237 | def createtable(tablename,cursor): 238 | 239 | try: 240 | query = "CREATE TABLE {0} ( {1} integer, {2} integer,{3} real);".format(tablename, 241 | USER_ID_COLNAME,MOVIE_ID_COLNAME,RATING_COLNAME) 242 | cursor.execute(query) 243 | except Exception as ex: 244 | print("Failed to create table: ",ex) 245 | print("Created table: ",tablename) 246 | 247 | 248 | def create_db(dbname): 249 | """ 250 | We create a DB by connecting to the default user and database of Postgres 251 | The function first checks if an existing database exists for a given name, else creates it. 252 | :return:None 253 | """ 254 | # Connect to the default database 255 | con = getopenconnection(dbname='postgres') 256 | con.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT) 257 | cur = con.cursor() 258 | 259 | # Check if an existing database with the same name exists 260 | cur.execute('SELECT COUNT(*) FROM pg_catalog.pg_database WHERE datname=\'%s\'' % (dbname,)) 261 | count = cur.fetchone()[0] 262 | if count == 0: 263 | cur.execute('CREATE DATABASE %s' % (dbname,)) # Create the database 264 | else: 265 | print 'A database named {0} already exists'.format(dbname) 266 | 267 | # Clean up 268 | cur.close() 269 | con.close() 270 | 271 | 272 | # Middleware 273 | def before_db_creation_middleware(): 274 | # Use it if you want to 275 | pass 276 | 277 | 278 | def after_db_creation_middleware(databasename): 279 | # Use it if you want to 280 | pass 281 | 282 | 283 | def before_test_script_starts_middleware(openconnection, databasename): 284 | # Use it if you want to 285 | pass 286 | 287 | 288 | def after_test_script_ends_middleware(openconnection, databasename): 289 | # Use it if you want to 290 | pass 291 | 292 | 293 | if __name__ == '__main__': 294 | try: 295 | 296 | # Use this function to do any set up before creating the DB, if any 297 | before_db_creation_middleware() 298 | 299 | create_db(DATABASE_NAME) 300 | 301 | # Use this function to do any set up after creating the DB, if any 302 | after_db_creation_middleware(DATABASE_NAME) 303 | 304 | with getopenconnection() as con: 305 | # Use this function to do any set up before I starting calling your functions to test, if you want to 306 | before_test_script_starts_middleware(con, DATABASE_NAME) 307 | 308 | # Here is where I will start calling your functions to test them. For example, 309 | loadratings('ratings.dat', con) 310 | # ################################################################################### 311 | # Anything in this area will not be executed as I will call your functions directly 312 | # so please add whatever code you want to add in main, in the middleware functions provided "only" 313 | # ################################################################################### 314 | 315 | # Use this function to do any set up after I finish testing, if you want to 316 | after_test_script_ends_middleware(con, DATABASE_NAME) 317 | 318 | except Exception as detail: 319 | print "OOPS! This is the error ==> ", detail 320 | -------------------------------------------------------------------------------- /parallel_join_sort.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | # 3 | # This project done as part of CSE 512 Fall 2017 4 | # 5 | __author__ = "Prashant Gonarkar" 6 | __version__ = "v0.1" 7 | __email__ = "pgonarka@asu.edu" 8 | 9 | import psycopg2 10 | import os 11 | import sys 12 | import threading 13 | 14 | TOTAL_THREADS = 5 15 | RANGE_PARTITION = "rangeparition" 16 | JOIN_RANGE_PARTITION = "joinrangepartition" 17 | TABLE1_RANGE_PARTITION = "table1_rangeparition" 18 | TABLE2_RANGE_PARTITION = "table2_rangeparition" 19 | 20 | ##################### This needs to changed based on what kind of table we want to sort. ################## 21 | ##################### To know how to change this, see Assignment 3 Instructions carefully ################# 22 | FIRST_TABLE_NAME = 'table1' 23 | SECOND_TABLE_NAME = 'table2' 24 | SORT_COLUMN_NAME_FIRST_TABLE = 'column1' 25 | SORT_COLUMN_NAME_SECOND_TABLE = 'column2' 26 | JOIN_COLUMN_NAME_FIRST_TABLE = 'column1' 27 | JOIN_COLUMN_NAME_SECOND_TABLE = 'column2' 28 | ########################################################################################################## 29 | 30 | 31 | # Donot close the connection inside this file i.e. do not perform openconnection.close() 32 | def ParallelSort (InputTable, SortingColumnName, OutputTable, openconnection): 33 | 34 | cur = openconnection.cursor() 35 | 36 | #STEP-1 Range Partition 37 | # Find min in SortingColumn 38 | query = "select min({0}) from {1} ".format(SortingColumnName, InputTable) 39 | cur.execute(query) 40 | lowerbound = cur.fetchone()[0] 41 | 42 | # Find max in SortingColumn 43 | query = "select max({0}) from {1} ".format(SortingColumnName, InputTable) 44 | cur.execute(query) 45 | upperbound = cur.fetchone()[0] 46 | 47 | partitioninterval = abs(upperbound - lowerbound) / float(TOTAL_THREADS) 48 | 49 | #print(upperbound,lowerbound,partitioninterval) 50 | 51 | # Creating tables for range partition step 52 | # There would be one table per thread to work upon 53 | for i in range(TOTAL_THREADS): 54 | outputtablename = RANGE_PARTITION + repr(i); 55 | copytable(InputTable, outputtablename , cur) 56 | 57 | #STEP-2 Parallel sorting 58 | 59 | # Creating output table 60 | copytable(InputTable, OutputTable, cur) 61 | 62 | # thread pool for performing parallel sort 63 | threadspool = range(TOTAL_THREADS); 64 | 65 | for i in range(TOTAL_THREADS): 66 | if i == 0: 67 | start = lowerbound 68 | end = lowerbound + partitioninterval 69 | else: 70 | start = end 71 | end = end + partitioninterval 72 | rangetable = RANGE_PARTITION + repr(i) 73 | #print(i,lowerbound,upperbound) 74 | threadspool[i] = threading.Thread(target = parallelsorting, args = (InputTable, rangetable, 75 | SortingColumnName, start, end, openconnection)) 76 | 77 | threadspool[i].start() 78 | 79 | # wait to finish all threads 80 | for i in range(TOTAL_THREADS): 81 | threadspool[i].join() 82 | 83 | # Combine the result in output table 84 | for i in range(TOTAL_THREADS): 85 | tablename = RANGE_PARTITION + repr(i) 86 | query = "INSERT INTO {0} SELECT * FROM {1}".format( OutputTable, tablename) 87 | cur.execute(query) 88 | 89 | # delete all temp stuff 90 | for i in range(TOTAL_THREADS): 91 | tablename = RANGE_PARTITION + repr(i) 92 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(tablename)) 93 | 94 | openconnection.commit() 95 | 96 | 97 | 98 | def ParallelJoin (InputTable1, InputTable2, Table1JoinColumn, Table2JoinColumn, OutputTable, openconnection): 99 | 100 | cur = openconnection.cursor() 101 | 102 | #STEP-1 RangePartition 103 | 104 | # Find min and max in input table 1 105 | query = "select min({0}) from {1} ".format(Table1JoinColumn, InputTable1) 106 | cur.execute(query) 107 | table1min = cur.fetchone()[0] 108 | 109 | query = "select max({0}) from {1} ".format(Table1JoinColumn, InputTable1) 110 | cur.execute(query) 111 | table1max = cur.fetchone()[0] 112 | 113 | # Find max and max of input table 2 114 | query = "select min({0}) from {1} ".format(Table2JoinColumn, InputTable2) 115 | cur.execute(query) 116 | table2min = cur.fetchone()[0] 117 | 118 | query = "select max({0}) from {1} ".format(Table2JoinColumn, InputTable2) 119 | cur.execute(query) 120 | table2max = cur.fetchone()[0] 121 | 122 | 123 | allmin = min(table1min,table2min) 124 | allmax = max(table1max,table2max) 125 | 126 | partitioninterval = abs(allmax - allmin) / float(TOTAL_THREADS) 127 | #print(table1min,table1max,table2min,table2max,partitioninterval) 128 | rangepartitioning(InputTable1, Table1JoinColumn, partitioninterval, allmin, allmax, TABLE1_RANGE_PARTITION , cur) 129 | rangepartitioning(InputTable2, Table2JoinColumn, partitioninterval, allmin, allmax, TABLE2_RANGE_PARTITION, cur) 130 | 131 | 132 | #STEP-2 Parallel Join 133 | 134 | #Create temp range join tables for each thread 135 | 136 | for i in range(TOTAL_THREADS): 137 | outputtablename = JOIN_RANGE_PARTITION + repr(i) 138 | createjoinrangetable(InputTable1, InputTable2, outputtablename, cur) 139 | 140 | # thread pool for performing parallel sort 141 | threadspool = range(TOTAL_THREADS); 142 | 143 | for i in range(TOTAL_THREADS): 144 | inputtable1 = TABLE1_RANGE_PARTITION + repr(i) 145 | inputtable2 = TABLE2_RANGE_PARTITION + repr(i) 146 | outputtable = JOIN_RANGE_PARTITION + repr(i) 147 | 148 | threadspool[i] = threading.Thread(target = paralleljoin, args = (inputtable1, inputtable2, Table1JoinColumn, Table2JoinColumn, 149 | outputtable, openconnection)) 150 | 151 | threadspool[i].start() 152 | 153 | # wait to finish all threads 154 | for i in range(TOTAL_THREADS): 155 | threadspool[i].join() 156 | 157 | # Create output table 158 | createjoinrangetable(InputTable1, InputTable2, OutputTable, cur) 159 | 160 | # insert all results in the output table 161 | for i in range(TOTAL_THREADS): 162 | tablename = JOIN_RANGE_PARTITION + repr(i) 163 | query = "INSERT INTO {0} SELECT * FROM {1}".format(OutputTable,tablename) 164 | cur.execute(query) 165 | 166 | # Delete all intermediate partitions 167 | for i in range(TOTAL_THREADS): 168 | table1 = TABLE1_RANGE_PARTITION + repr(i) 169 | table2 = TABLE2_RANGE_PARTITION + repr(i) 170 | table3 = JOIN_RANGE_PARTITION + repr(i) 171 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(table1)) 172 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(table2)) 173 | cur.execute('DROP TABLE IF EXISTS {0} CASCADE'.format(table3)) 174 | 175 | openconnection.commit() 176 | 177 | 178 | ####Support functions##### 179 | 180 | def copytable(sourcetable,destinationtable,cur): 181 | 182 | query = "CREATE TABLE {0} AS SELECT * FROM {1} WHERE 1=2".format(destinationtable,sourcetable) 183 | cur.execute(query); 184 | 185 | def parallelsorting(InputTable, rangetable, SortingColumnName, lowerbound, upperbound, openconnection): 186 | 187 | cur = openconnection.cursor(); 188 | if rangetable == 'rangeparition0': 189 | query = "INSERT INTO {0} SELECT * FROM {1} WHERE {2} >= {3} AND {2} <= {4} ORDER BY {2} ASC".format(rangetable,InputTable, 190 | SortingColumnName,lowerbound,upperbound) 191 | else: 192 | query = "INSERT INTO {0} SELECT * FROM {1} WHERE {2} > {3} AND {2} <= {4} ORDER BY {2} ASC".format(rangetable,InputTable, 193 | SortingColumnName,lowerbound,upperbound) 194 | cur.execute(query) 195 | 196 | def rangepartitioning(inputtable, tablejoincolumn, partitioninterval, allmin, allmax, tableprefix, cur): 197 | 198 | for i in range(TOTAL_THREADS): 199 | tablename = tableprefix + repr(i) 200 | 201 | if i == 0: 202 | lowerbound = allmin 203 | upperbound = lowerbound + partitioninterval 204 | #print(tablename,lowerbound,upperbound) 205 | query = "CREATE TABLE {0} AS SELECT * FROM {1} WHERE {2} >= {3} AND {2} <= {4};".format(tablename,inputtable, 206 | tablejoincolumn,lowerbound,upperbound) 207 | else: 208 | lowerbound = upperbound 209 | upperbound = lowerbound + partitioninterval 210 | #print(tablename,lowerbound,upperbound) 211 | query = "CREATE TABLE {0} AS SELECT * FROM {1} WHERE {2} > {3} AND {2} <= {4};".format(tablename,inputtable, 212 | tablejoincolumn,lowerbound,upperbound) 213 | 214 | cur.execute(query) 215 | 216 | def createjoinrangetable(inputtable1, inputtable2, outputtablename, cur): 217 | query = "CREATE TABLE {0} AS SELECT * FROM {1},{2} WHERE 1=2".format(outputtablename, inputtable1, inputtable2) 218 | cur.execute(query); 219 | 220 | def paralleljoin(inputtable1, inputtable2, Table1JoinColumn, Table2JoinColumn, outputtable, openconnection): 221 | cur = openconnection.cursor() 222 | query = "insert into {0} select * from {1} INNER JOIN {2} ON {1}.{3} = {2}.{4}".format(outputtable, 223 | inputtable1,inputtable2,Table1JoinColumn,Table2JoinColumn) 224 | cur.execute(query) 225 | 226 | 227 | ################### DO NOT CHANGE ANYTHING BELOW THIS ############################# 228 | 229 | 230 | # Donot change this function 231 | def getOpenConnection(user='postgres', password='1234', dbname='ddsassignment3'): 232 | return psycopg2.connect("dbname='" + dbname + "' user='" + user + "' host='localhost' password='" + password + "'") 233 | 234 | # Donot change this function 235 | def createDB(dbname='ddsassignment3'): 236 | """ 237 | We create a DB by connecting to the default user and database of Postgres 238 | The function first checks if an existing database exists for a given name, else creates it. 239 | :return:None 240 | """ 241 | # Connect to the default database 242 | con = getOpenConnection(dbname='postgres') 243 | con.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT) 244 | cur = con.cursor() 245 | 246 | # Check if an existing database with the same name exists 247 | cur.execute('SELECT COUNT(*) FROM pg_catalog.pg_database WHERE datname=\'%s\'' % (dbname,)) 248 | count = cur.fetchone()[0] 249 | if count == 0: 250 | cur.execute('CREATE DATABASE %s' % (dbname,)) # Create the database 251 | else: 252 | print 'A database named {0} already exists'.format(dbname) 253 | 254 | # Clean up 255 | cur.close() 256 | con.commit() 257 | con.close() 258 | 259 | # Donot change this function 260 | def deleteTables(ratingstablename, openconnection): 261 | try: 262 | cursor = openconnection.cursor() 263 | if ratingstablename.upper() == 'ALL': 264 | cursor.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'") 265 | tables = cursor.fetchall() 266 | for table_name in tables: 267 | cursor.execute('DROP TABLE %s CASCADE' % (table_name[0])) 268 | else: 269 | cursor.execute('DROP TABLE %s CASCADE' % (ratingstablename)) 270 | openconnection.commit() 271 | except psycopg2.DatabaseError, e: 272 | if openconnection: 273 | openconnection.rollback() 274 | print 'Error %s' % e 275 | sys.exit(1) 276 | except IOError, e: 277 | if openconnection: 278 | openconnection.rollback() 279 | print 'Error %s' % e 280 | sys.exit(1) 281 | finally: 282 | if cursor: 283 | cursor.close() 284 | 285 | # Donot change this function 286 | def saveTable(ratingstablename, fileName, openconnection): 287 | try: 288 | cursor = openconnection.cursor() 289 | cursor.execute("Select * from %s" %(ratingstablename)) 290 | data = cursor.fetchall() 291 | openFile = open(fileName, "w") 292 | for row in data: 293 | for d in row: 294 | openFile.write(`d`+",") 295 | openFile.write('\n') 296 | openFile.close() 297 | except psycopg2.DatabaseError, e: 298 | if openconnection: 299 | openconnection.rollback() 300 | print 'Error %s' % e 301 | sys.exit(1) 302 | except IOError, e: 303 | if openconnection: 304 | openconnection.rollback() 305 | print 'Error %s' % e 306 | sys.exit(1) 307 | finally: 308 | if cursor: 309 | cursor.close() 310 | 311 | if __name__ == '__main__': 312 | try: 313 | # Creating Database ddsassignment3 314 | print "Creating Database named as ddsassignment3" 315 | createDB(); 316 | 317 | # Getting connection to the database 318 | print "Getting connection from the ddsassignment3 database" 319 | con = getOpenConnection(); 320 | 321 | # Calling ParallelSort 322 | print "Performing Parallel Sort" 323 | ParallelSort(FIRST_TABLE_NAME, SORT_COLUMN_NAME_FIRST_TABLE, 'parallelSortOutputTable', con); 324 | 325 | # Calling ParallelJoin 326 | print "Performing Parallel Join" 327 | ParallelJoin(FIRST_TABLE_NAME, SECOND_TABLE_NAME, JOIN_COLUMN_NAME_FIRST_TABLE, JOIN_COLUMN_NAME_SECOND_TABLE, 'parallelJoinOutputTable', con); 328 | 329 | # Saving parallelSortOutputTable and parallelJoinOutputTable on two files 330 | saveTable('parallelSortOutputTable', 'parallelSortOutputTable.txt', con); 331 | saveTable('parallelJoinOutputTable', 'parallelJoinOutputTable.txt', con); 332 | 333 | # Deleting parallelSortOutputTable and parallelJoinOutputTable 334 | deleteTables('parallelSortOutputTable', con); 335 | deleteTables('parallelJoinOutputTable', con); 336 | 337 | if con: 338 | con.close() 339 | 340 | except Exception as detail: 341 | print "Something bad has happened!!! This is the error ==> ", detail 342 | -------------------------------------------------------------------------------- /queryprocessor.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python2.7 2 | # 3 | # This project done as part of CSE 512 Fall 2017 4 | # 5 | 6 | __author__ = "Prashant Gonarkar" 7 | __version__ = "v0.1" 8 | __email__ = "pgonarka@asu.edu" 9 | 10 | import psycopg2 11 | import os 12 | import sys 13 | 14 | RANGE_RATINGS_METADATA = 'rangeratingsmetadata' 15 | ROUND_ROBIN_RATINGS_METADATA = 'roundrobinratingsmetadata' 16 | 17 | RANGE_PARTITION_PREFIX = 'rangeratingspart' 18 | ROUND_ROBIN_PARTITION_PREFIX = 'roundrobinratingspart' 19 | 20 | RANGE_PARTITION_OUTPUT_NAME = 'RangeRatingsPart' 21 | ROUND_ROBIN_PARTITION_OUTPUT_NAME = 'RoundRobinRatingsPart' 22 | 23 | RANGE_QUERY_OUTPUT_FILE = 'RangeQueryOut.txt' 24 | POINT_QUERY_OUTPUT_FILE = 'PointQueryOut.txt' 25 | 26 | # Donot close the connection inside this file i.e. do not perform openconnection.close() 27 | def RangeQuery(ratingsTableName, ratingMinValue, ratingMaxValue, openconnection): 28 | 29 | # 30 | # Range query on range partitions 31 | 32 | try: 33 | cur = openconnection.cursor() 34 | 35 | # finding min boundary range of partition from metadata for given ratingMinValue 36 | query = "select max(minrating) from {0} where minrating <= {1}".format(RANGE_RATINGS_METADATA,ratingMinValue) 37 | cur.execute(query) 38 | minpartboundary = cur.fetchone()[0] 39 | 40 | # finding min boundary range of partition from metadata for given ratingMinValue 41 | query = "select min(maxrating) from {0} where maxrating >= {1}".format(RANGE_RATINGS_METADATA,ratingMaxValue) 42 | cur.execute(query) 43 | maxpartboundary = cur.fetchone()[0] 44 | 45 | # calculating the paratitions from metadata table where tuples of given ranges lies 46 | query = "select partitionnum from {0} where maxrating >= {1} and maxrating <= {2}".format(RANGE_RATINGS_METADATA,minpartboundary,maxpartboundary) 47 | cur.execute(query) 48 | rows = cur.fetchall() 49 | 50 | if os.path.exists(RANGE_QUERY_OUTPUT_FILE): 51 | os.remove(RANGE_QUERY_OUTPUT_FILE) 52 | 53 | for i in rows: 54 | partitionname = RANGE_PARTITION_OUTPUT_NAME + repr(i[0]) 55 | query = "select * from {0} where rating >= {1} and rating <= {2}".format(partitionname, ratingMinValue, ratingMaxValue) 56 | cur.execute(query) 57 | rows2 = cur.fetchall() 58 | with open(RANGE_QUERY_OUTPUT_FILE,'a+') as f: 59 | for j in rows2: 60 | f.write("%s," % partitionname) 61 | f.write("%s," % str(j[0])) 62 | f.write("%s," % str(j[1])) 63 | f.write("%s\n" % str(j[2])) 64 | # 65 | # Range query on round robin paritions 66 | 67 | # get no of round robin partitions 68 | query = "select partitionnum from {0} ".format(ROUND_ROBIN_RATINGS_METADATA) 69 | cur.execute(query) 70 | rrpartitioncount = int(cur.fetchone()[0]) 71 | 72 | for i in range(rrpartitioncount): 73 | partitionname = ROUND_ROBIN_PARTITION_OUTPUT_NAME + repr(i) 74 | query = "select * from {0} where rating >= {1} and rating <= {2}".format(partitionname, ratingMinValue, ratingMaxValue) 75 | cur.execute(query) 76 | rows2 = cur.fetchall() 77 | with open(RANGE_QUERY_OUTPUT_FILE,'a+') as f: 78 | for j in rows2: 79 | f.write("%s," % partitionname) 80 | f.write("%s," % str(j[0])) 81 | f.write("%s," % str(j[1])) 82 | f.write("%s\n" % str(j[2])) 83 | 84 | except Exception as ex: 85 | print("Exception while processing RangeQuery: ",ex) 86 | 87 | 88 | def PointQuery(ratingsTableName, ratingValue, openconnection): 89 | 90 | # Point query for range partition 91 | try: 92 | 93 | cur = openconnection.cursor() 94 | if ratingValue == 0: 95 | rangepartitionnum = 0 96 | else: 97 | query = "select partitionnum from {0} where minrating < {1} and maxrating >= {1}".format(RANGE_RATINGS_METADATA,ratingValue) 98 | cur.execute(query) 99 | rangepartitionnum = cur.fetchone()[0] 100 | 101 | partitionname = RANGE_PARTITION_OUTPUT_NAME + repr(rangepartitionnum) 102 | query = "select * from {0} where rating = {1} ".format(partitionname, ratingValue ) 103 | cur.execute(query) 104 | rows2 = cur.fetchall() 105 | 106 | if os.path.exists(POINT_QUERY_OUTPUT_FILE): 107 | os.remove(POINT_QUERY_OUTPUT_FILE) 108 | 109 | with open(POINT_QUERY_OUTPUT_FILE,'a+') as f: 110 | for j in rows2: 111 | f.write("%s," % partitionname) 112 | f.write("%s," % str(j[0])) 113 | f.write("%s," % str(j[1])) 114 | f.write("%s\n" % str(j[2])) 115 | 116 | # 117 | # Point query for round robin partition 118 | query = "select partitionnum from {0} ".format(ROUND_ROBIN_RATINGS_METADATA) 119 | cur.execute(query) 120 | rrpartitioncount = int(cur.fetchone()[0]) 121 | 122 | for i in range(rrpartitioncount): 123 | partitionname = ROUND_ROBIN_PARTITION_OUTPUT_NAME + repr(i) 124 | query = "select * from {0} where rating = {1} ".format(partitionname, ratingValue ) 125 | cur.execute(query) 126 | rows2 = cur.fetchall() 127 | with open(POINT_QUERY_OUTPUT_FILE,'a+') as f: 128 | for j in rows2: 129 | f.write("%s," % partitionname) 130 | f.write("%s," % str(j[0])) 131 | f.write("%s," % str(j[1])) 132 | f.write("%s\n" % str(j[2])) 133 | 134 | except Exception as ex: 135 | print("Exception while processing RangeQuery: ",ex) 136 | --------------------------------------------------------------------------------