├── README.md ├── anonymizer ├── data-sampler.py ├── log-anonymizer.py ├── run-sampler.sh └── run.sh ├── clusterer ├── generate-cluster-coverage.py ├── logical_clustering_utility │ ├── buildVectors.py │ └── schemaParser.py ├── online_clustering.py ├── online_logical_clustering.py ├── run_generate_cluster_coverage.sh └── run_sensitivity.sh ├── forecaster ├── Utilities.py ├── calc_mse.py ├── exp_multi_online_continuous.py ├── generate_ensemble_hybrid.py ├── models │ ├── FNN_Model.py │ ├── PSRNN_Model.py │ └── RNN_Model.py ├── plot-prediction-median-error.py ├── plot-sensitivity.py ├── run.sh ├── run_logical.sh ├── run_sample.sh ├── run_sensitivity.sh └── spectral │ └── Two_Stage_Regression.py ├── planner-simulator ├── planner_simulator.py └── schemaParser.py ├── pre-processor ├── csv-combiner.py └── templatizer.py ├── run.sh └── workload-simulator ├── preprocessing.py ├── run.sh └── workload-simulator.py /README.md: -------------------------------------------------------------------------------- 1 | # QueryBot 5000 2 | **QueryBot 5000 (QB5000)** is a robust forecasting framework that allows a DBMS to predict the expected arrival rate of queries 3 | in the future based on historical data. This is the source code for our 4 | [SIGMOD paper](http://www.cs.cmu.edu/~malin199/publications/2018.forecasting.sigmod.pdf): **_Query-based Workload Forecasting for Self-Driving Database Management Systems_**. 5 | 6 | ## Run forecasting on a sample of BusTracker workload: 7 | ./run.sh 8 | We provide an example of the workload forecasting for a sample subset of **BusTracker** workload. The prediction specified in the script is on 1 hour interval and 3 day horizon. The predicted arrival rates with different models for each cluster are in the _prediction-results_ folder. All the query templates of the workload can be found at _templates.txt_. 9 | 10 | The default experimental setting is to run under CPU. If you have a GPU, you can change [this parameter](https://github.com/malin1993ml/QueryBot5000/blob/master/forecaster/exp_multi_online_continuous.py#L101) to _True_ to enable GPU training. 11 | 12 | ### Dependencies 13 | python>=3.5 14 | scikit-learn>=0.18.1 15 | sortedcontainers>=1.5.7 16 | statsmodels>=0.8.0 17 | scipy>=0.19.0 18 | numpy>=1.14.2 19 | matplotlib>=2.0.2 20 | pytorch>=0.2.0_1 (you need to install the GPU version if you want to use GPU) 21 | 22 | ## Framework Pipeline: 23 | 24 | ### Anonymization 25 | We first anonymize all the queries from the real-world traces used in our experiments for privacy purposes. The components below use the anonymization results from this step as their input. 26 | 27 | cd anonymizer 28 | ./log-anonymizer.py --help 29 | 30 | ### Pre-processor 31 | This component extracts the **template**s from the anonymized queries and records the arrival rate history for each template. 32 | 33 | cd pre-processor 34 | ./templatizer.py --help 35 | 36 | ### Clusterer 37 | This component groups query templates with similar arrival rate patterns into **cluster**s. 38 | 39 | cd clusterer 40 | ./online_clustering.py --help 41 | _generate-cluster-coverage.py_ generates the time series for the largest _MAX_CLUSTER_NUM_ clusters on each day, which are used in the forecasting evaluation. 42 | 43 | ### Forecaster 44 | This component uses a combination of linear regression, recurrent neural network, and kernel regression to predict the arrival rate pattern of each query cluster on different prediction **horizon**s and **interval**s. 45 | 46 | cd forecaster 47 | ./exp_multi_online_continuous.py --help 48 | 49 | ### Workload Simulator 50 | This simulator populates a synthetic database with a given schema file, removes all the secondary indexes, replays the query trace of the workload, and builds appropriate indexes with the real-time workload forecasting results. 51 | 52 | cd workload-simulator 53 | ./workload-simulator.py --help 54 | 55 | ## Inquiry about Data 56 | Due to legal and privacy constraints, unfortunately we cannot publish the full datasets that we used in the experiments for the publication (especially for the two student-related **Admissions** and **MOOC** workloads). To the best of our effort, we managed to publish a subset (2% random sampling) of the **BusTracker** [workload trace](https://drive.google.com/file/d/1imVPNXk8mGU0v9OOhdp0d9wFDuYqARwZ/view?usp=sharing) and the [schema file](https://drive.google.com/file/d/1d4z3SAwIOmv_PJTlsUfPCNHxZu2r-g_O/view?usp=sharing). 57 | 58 | We use [this script](https://github.com/malin1993ml/QueryBot5000/blob/master/anonymizer/run-sampler.sh) to generate the sample subset of the original workload trace. 59 | 60 | ## NOTE 61 | This repo does not have an end-to-end running framework. We build different components separately and pass the results through a workload simulator that connects to MySQL/PostgreSQL for experimental purposes. We are integrating the full framework into [Peloton](http://pelotondb.io/) Self-Driving DBMS. Please check out our [source code](https://github.com/cmu-db/peloton/tree/master/src/include/brain) there for more reference. 62 | 63 | ## License 64 | Copyright 2018, Carnegie Mellon University 65 | 66 | Licensed under the Apache License, Version 2.0 (the "License"); 67 | you may not use this file except in compliance with the License. 68 | You may obtain a copy of the License at 69 | 70 | http://www.apache.org/licenses/LICENSE-2.0 71 | 72 | Unless required by applicable law or agreed to in writing, software 73 | distributed under the License is distributed on an "AS IS" BASIS, 74 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 75 | See the License for the specific language governing permissions and 76 | limitations under the License. 77 | -------------------------------------------------------------------------------- /anonymizer/data-sampler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3.5 2 | 3 | import sys 4 | import glob 5 | import collections 6 | import time 7 | import csv 8 | import os 9 | import datetime 10 | import gzip 11 | import re 12 | import argparse 13 | from multiprocessing import Process 14 | 15 | csv.field_size_limit(sys.maxsize) 16 | 17 | SAMPLE_STEP = 50 18 | 19 | OUTPUT = csv.writer(sys.stdout, quoting=csv.QUOTE_ALL) 20 | 21 | def ProcessData(path, num_logs): 22 | data = [] 23 | processed_queries = 0 24 | templated_workload = dict() 25 | 26 | min_timestamp = datetime.datetime.max 27 | max_timestamp = datetime.datetime.min 28 | 29 | #try: 30 | f = gzip.open(path, mode='rt') 31 | reader = csv.reader(f, delimiter=',') 32 | 33 | for i, query_info in enumerate(reader): 34 | processed_queries += 1 35 | 36 | if (not num_logs is None) and processed_queries > num_logs: 37 | break 38 | 39 | if i % SAMPLE_STEP == 0: 40 | OUTPUT.writerow(query_info) 41 | 42 | # ============================================== 43 | # main 44 | # ============================================== 45 | if __name__ == '__main__': 46 | aparser = argparse.ArgumentParser(description='Templatize SQL Queries') 47 | aparser.add_argument('input', help='Input file') 48 | aparser.add_argument('--max_log', type=int, help='Maximum number of logs to process in a' 49 | 'data file. Process the whole file if not provided') 50 | args = vars(aparser.parse_args()) 51 | 52 | ProcessData(args['input'], args['max_log']) 53 | 54 | 55 | -------------------------------------------------------------------------------- /anonymizer/log-anonymizer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3.5 2 | 3 | import sys 4 | import re 5 | import gzip 6 | import csv 7 | import sqlparse 8 | import hashlib 9 | import string 10 | import logging 11 | import argparse 12 | import zipfile 13 | 14 | from pprint import pprint 15 | global ANONYMIZE 16 | 17 | # ============================================== 18 | # LOGGING CONFIGURATION 19 | # ============================================== 20 | 21 | LOG = logging.getLogger(__name__) 22 | LOG_handler = logging.StreamHandler() 23 | LOG_formatter = logging.Formatter( 24 | fmt='%(asctime)s[%(funcName)s:%(lineno)03d]%(levelname)-5s:%(message)s', 25 | datefmt='%m-%d-%Y %H:%M:%S') 26 | LOG_handler.setFormatter(LOG_formatter) 27 | LOG.addHandler(LOG_handler) 28 | LOG.setLevel(logging.INFO) 29 | 30 | CMD_TYPES = [ 31 | "Connect", 32 | "Quit", 33 | "Init DB", 34 | "Query", 35 | "Field List", 36 | "Statistics" 37 | ] 38 | # SQL commands that we just want to simply ignore 39 | IGNORED_CMDS = [] 40 | 41 | CLEAN_CMDS = [ 42 | re.compile(r"(WHERE|ON)[\s]{2,}", re.IGNORECASE) 43 | ] 44 | 45 | # T