├── application.py
├── Images
    ├── info.png
    ├── lowest.png
    ├── XGboost.jpg
    ├── highest.png
    ├── predict.jpg
    ├── application.jpg
    ├── earthquakes.png
    ├── DecisionTree.jpg
    ├── RandomForest.png
    ├── class_distrib.png
    └── featureengineer.png
├── Webapp
    ├── __init__.py
    ├── __init__.pyc
    ├── __pycache__
    │   ├── main.cpython-36.pyc
    │   └── __init__.cpython-36.pyc
    ├── templates
    │   └── index.html
    └── main.py
├── Data
    ├── Earthquakedata.db
    └── Earthquakedata_predict.db
├── requirements.txt
└── README.md


/application.py:
--------------------------------------------------------------------------------
1 | from Webapp import app
2 | app.run(debug=True)
3 | 


--------------------------------------------------------------------------------
/Images/info.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/info.png


--------------------------------------------------------------------------------
/Images/lowest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/lowest.png


--------------------------------------------------------------------------------
/Images/XGboost.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/XGboost.jpg


--------------------------------------------------------------------------------
/Images/highest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/highest.png


--------------------------------------------------------------------------------
/Images/predict.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/predict.jpg


--------------------------------------------------------------------------------
/Webapp/__init__.py:
--------------------------------------------------------------------------------
1 | from flask import Flask
2 | 
3 | app = Flask(__name__)
4 | 
5 | from Webapp import main
6 | 
7 | 


--------------------------------------------------------------------------------
/Webapp/__init__.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__init__.pyc


--------------------------------------------------------------------------------
/Data/Earthquakedata.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Data/Earthquakedata.db


--------------------------------------------------------------------------------
/Images/application.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/application.jpg


--------------------------------------------------------------------------------
/Images/earthquakes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/earthquakes.png


--------------------------------------------------------------------------------
/Images/DecisionTree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/DecisionTree.jpg


--------------------------------------------------------------------------------
/Images/RandomForest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/RandomForest.png


--------------------------------------------------------------------------------
/Images/class_distrib.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/class_distrib.png


--------------------------------------------------------------------------------
/Images/featureengineer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/featureengineer.png


--------------------------------------------------------------------------------
/Data/Earthquakedata_predict.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Data/Earthquakedata_predict.db


--------------------------------------------------------------------------------
/Webapp/__pycache__/main.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__pycache__/main.cpython-36.pyc


--------------------------------------------------------------------------------
/Webapp/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | click==7.1.2
 2 | Flask==1.1.2
 3 | gunicorn==20.0.4
 4 | itsdangerous==1.1.0
 5 | Jinja2==2.11.2
 6 | joblib==0.16.0
 7 | MarkupSafe==1.1.1
 8 | numpy==1.19.1
 9 | pandas==1.1.0
10 | python-dateutil==2.8.1
11 | pytz==2020.1
12 | scikit-learn==0.23.1
13 | scipy==1.5.2
14 | six==1.15.0
15 | sklearn==0.0
16 | SQLAlchemy==1.3.18
17 | threadpoolctl==2.1.0
18 | Werkzeug==1.0.1
19 | xgboost==1.1.1
20 | 


--------------------------------------------------------------------------------
/Webapp/templates/index.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html>
  3 | <head>
  4 |     <meta name="viewport" content="width=device-width, initial-scale=1">
  5 |     <meta charset="UTF=8">
  6 |     <title>Predicting Earthquakes</title>
  7 |     <script src="//ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
  8 |   <link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap-theme.min.css">
  9 |     <link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css">
 10 |     <script src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></script>
 11 | 
 12 |     <script async defer
 13 |         src="https://maps.googleapis.com/maps/api/js?key=AIzaSyCrQNVakMvvuOxCfuc6JwtJAwSfMk0lM1U&libraries=visualization&callback=initMap">
 14 |     </script>
 15 | 
 16 |     <style>
 17 |       /* Always set the map height explicitly to define the size of the div
 18 |        * element that contains the map. */
 19 |       #map {
 20 |         height: 50%;
 21 |         width: 700px;
 22 |       }
 23 |       /* Optional: Makes the sample page fill the window. */
 24 |       html, body {
 25 |         height: 100%;
 26 |         margin: 0;
 27 |         padding: 0;
 28 |       }
 29 |       #floating-panel {
 30 |         position: absolute;
 31 |         top: 10px;
 32 |         left: 25%;
 33 |         z-index: 5;
 34 |         background-color: #fff;
 35 |         padding: 5px;
 36 |         border: 1px solid #999;
 37 |         text-align: center;
 38 |         font-family: 'Roboto','sans-serif';
 39 |         line-height: 30px;
 40 |         padding-left: 10px;
 41 |       }
 42 |       #floating-panel {
 43 |         background-color: #fff;
 44 |         border: 1px solid #999;
 45 |         left: 25%;
 46 |         padding: 5px;
 47 |         position: absolute;
 48 |         top: 10px;
 49 |         z-index: 5;
 50 |       }
 51 |     </style>
 52 |   </head>
 53 | 
 54 |   <body>
 55 |         <table border=0 cellpadding="1" style="width: 700px; background-color:black;">
 56 |             <tr>
 57 |               <td><h1><p style="text-align:center"><font color="white">Worldwide Earthquake Forecaster</font></p></h1></td>
 58 |             </tr>
 59 |         </table>
 60 | 
 61 | 
 62 |     <div id="map"></div>
 63 |     <script> 
 64 | 
 65 |       var map, heatmap;
 66 | 
 67 |       function initMap() {
 68 |         map = new google.maps.Map(document.getElementById('map'), {
 69 |           zoom: 1.5,
 70 |           center: {lat: 0, lng: 0},
 71 |           mapTypeId: 'roadmap'
 72 |         });
 73 | 
 74 |         heatmap = new google.maps.visualization.HeatmapLayer({
 75 |           data: getPoints(),
 76 |           map: map
 77 |         });
 78 |       }
 79 | 
 80 |       function toggleHeatmap() {
 81 |         heatmap.setMap(heatmap.getMap() ? null : map);
 82 |       }
 83 | 
 84 |       function changeGradient() {
 85 |         var gradient = [
 86 |           'rgba(0, 255, 255, 0)',
 87 |           'rgba(0, 255, 255, 1)',
 88 |           'rgba(0, 191, 255, 1)',
 89 |           'rgba(0, 127, 255, 1)',
 90 |           'rgba(0, 63, 255, 1)',
 91 |           'rgba(0, 0, 255, 1)',
 92 |           'rgba(0, 0, 223, 1)',
 93 |           'rgba(0, 0, 191, 1)',
 94 |           'rgba(0, 0, 159, 1)',
 95 |           'rgba(0, 0, 127, 1)',
 96 |           'rgba(63, 0, 91, 1)',
 97 |           'rgba(127, 0, 63, 1)',
 98 |           'rgba(191, 0, 31, 1)',
 99 |           'rgba(255, 0, 0, 1)'
100 |         ]
101 |         heatmap.set('gradient', heatmap.get('gradient') ? null : gradient);
102 |       }
103 | 
104 |       // Heatmap data
105 |       function getPoints() {
106 |         return [{{earthquake_horizon}}];
107 |       }
108 |     </script>
109 | 
110 |         <table border=1 cellpadding="1" style="width: 700px; background-color:black;">
111 |             <tr>
112 |                 <td><p style="text-align:center">
113 |                   <form id='submit_params' method="POST" action="{{ url_for('build_page') }}">
114 |                     <div class="slidecontainer" style='width: 100%;'>
115 |                     <label><font color="white">Select future date: <span id="label_slider_value">{{date_horizon}}</span></font></label><BR>
116 |                     <input type="range" min="0" max="{{days_out_to_predict}}" value="{{current_value}}" name="slider_date_horizon" 
117 |                       id="slider_date_horizon" step="1" style='width: 100%;'>
118 |                   </div>
119 |                 </form>
120 |              </td>
121 |           </tr>
122 |       </table>
123 | 
124 | 
125 |      <script>
126 |         // Slider logic
127 |         var slider1 = document.getElementById("slider_date_horizon");
128 |         var output1 = document.getElementById("label_slider_value");
129 | 
130 |           slider1.onmouseup = function () {
131 |             document.getElementById("submit_params").submit();
132 |           }
133 | 
134 |         slider1.oninput = function() {
135 |             var horizon_date = new Date();
136 |             horizon_date.setDate(horizon_date.getDate() + Math.trunc(parseInt(this.value)));
137 |             output1.innerHTML = (horizon_date.getMonth()+1) + "/" + horizon_date.getDate() + "/" + horizon_date.getFullYear(); //this.value;
138 | 
139 |         }
140 | 
141 |      </script>
142 |     </body>
143 | </html>
144 | 


--------------------------------------------------------------------------------
/Webapp/main.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | from flask import render_template, flash, request
  3 | import logging, io, base64, os, datetime
  4 | from datetime import datetime
  5 | from datetime import timedelta
  6 | import pandas as pd
  7 | import numpy as np
  8 | import xgboost as xgb
  9 | from Webapp import app
 10 | 
 11 | 
 12 | # global variables
 13 | earthquake_live = None
 14 | days_out_to_predict = 7
 15 | 
 16 | 
 17 | #app = Flask(__name__)
 18 | 
 19 | def prepare_earthquake_data_and_model(days_out_to_predict = 7, max_depth=3, eta=0.1):
 20 |     '''
 21 |     Desccription : From extraction to model preparation. This function takes in how many days to predict or rolling window
 22 |                     period, max_depth for XGboost and learning rate. We extract data directly from https://earthquake.usgs.gov/
 23 |                     instead of loading from existing database since we want real time data that is updated every minute.
 24 |     
 25 |     Arguments : int (days_to_predict rolling window), int (maximum depth hyperparameter for xgboost), float (learning rate of alogrithm)
 26 | 
 27 |     Return : Pandas Dataframe (Prediction dataframe with live/ future NaN values in outcome magnitutde of quake that has to be predicted)
 28 |     '''
 29 |     # get latest data from USGS servers
 30 |     df = pd.read_csv('https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv')
 31 |     df = df.sort_values('time', ascending=True)
 32 |     # truncate time from datetime
 33 |     df['date'] = df['time'].str[0:10]
 34 | 
 35 |     # only keep the columns needed
 36 |     df = df[['date', 'latitude', 'longitude', 'depth', 'mag', 'place']]
 37 |     temp_df = df['place'].str.split(', ', expand=True) 
 38 |     df['place'] = temp_df[1]
 39 |     df = df[['date', 'latitude', 'longitude', 'depth', 'mag', 'place']]
 40 | 
 41 |     # calculate mean lat lon for simplified locations
 42 |     df_coords = df[['place', 'latitude', 'longitude']]
 43 |     df_coords = df_coords.groupby(['place'], as_index=False).mean()
 44 |     df_coords = df_coords[['place', 'latitude', 'longitude']]
 45 | 
 46 |     df = df[['date', 'depth', 'mag', 'place']]
 47 |     df = pd.merge(left=df, right=df_coords, how='inner', on=['place'])
 48 | 
 49 |     # loop through each zone and apply MA
 50 |     eq_data = []
 51 |     df_live = []
 52 |     for symbol in list(set(df['place'])):
 53 |         temp_df = df[df['place'] == symbol].copy()
 54 |         temp_df['depth_avg_22'] = temp_df['depth'].rolling(window=22,center=False).mean() 
 55 |         temp_df['depth_avg_15'] = temp_df['depth'].rolling(window=15,center=False).mean()
 56 |         temp_df['depth_avg_7'] = temp_df['depth'].rolling(window=7,center=False).mean()
 57 |         temp_df['mag_avg_22'] = temp_df['mag'].rolling(window=22,center=False).mean() 
 58 |         temp_df['mag_avg_15'] = temp_df['mag'].rolling(window=15,center=False).mean()
 59 |         temp_df['mag_avg_7'] = temp_df['mag'].rolling(window=7,center=False).mean()
 60 |         temp_df.loc[:, 'mag_outcome'] = temp_df.loc[:, 'mag_avg_7'].shift(days_out_to_predict * -1)
 61 | 
 62 |         df_live.append(temp_df.tail(days_out_to_predict))
 63 | 
 64 |         eq_data.append(temp_df)
 65 | 
 66 |     # concat all location-based dataframes into master dataframe
 67 |     df = pd.concat(eq_data)
 68 | 
 69 |     # remove any NaN fields
 70 |     df = df[np.isfinite(df['depth_avg_22'])]
 71 |     df = df[np.isfinite(df['mag_avg_22'])]
 72 |     df = df[np.isfinite(df['mag_outcome'])]
 73 | 
 74 |     # prepare outcome variable
 75 |     df['mag_outcome'] = np.where(df['mag_outcome'] > 2.5, 1,0)
 76 | 
 77 |     df = df[['date',
 78 |              'latitude',
 79 |              'longitude',
 80 |              'depth_avg_22',
 81 |              'depth_avg_15',
 82 |              'depth_avg_7',
 83 |              'mag_avg_22', 
 84 |              'mag_avg_15',
 85 |              'mag_avg_7',
 86 |              'mag_outcome']]
 87 | 
 88 |     # keep only data where we can make predictions
 89 |     df_live = pd.concat(df_live)
 90 |     df_live = df_live[np.isfinite(df_live['mag_avg_22'])]
 91 | 
 92 |     # let's train the model whenever the webserver is restarted
 93 |     from sklearn.model_selection import train_test_split
 94 |     features = [f for f in list(df) if f not in ['date', 'mag_outcome', 'latitude',
 95 |      'longitude']]
 96 | 
 97 |     X_train, X_test, y_train, y_test = train_test_split(df[features],
 98 |                          df['mag_outcome'], test_size=0.3, random_state=42)
 99 | 
100 |     dtrain = xgb.DMatrix(X_train[features], label=y_train)
101 |     dtest = xgb.DMatrix(X_test[features], label=y_test)
102 | 
103 |     param = {
104 |             'objective': 'binary:logistic',
105 |             'booster': 'gbtree',
106 |             'eval_metric': 'auc',
107 |             'max_depth': max_depth,  # the maximum depth of each tree
108 |             'eta': eta,  # the training step for each iteration
109 |             }  # logging mode - quiet}  # the number of classes that exist in this datset
110 | 
111 |     num_round = 1000  # the number of training iterations    
112 |     early_stopping_rounds=30
113 |     xgb_model = xgb.train(param, dtrain, num_round) 
114 | 
115 | 
116 |     # train on live data
117 |     dlive = xgb.DMatrix(df_live[features])  
118 |     preds = xgb_model.predict(dlive)
119 | 
120 |     # add preds to live data
121 |     df_live = df_live[['date', 'place', 'latitude', 'longitude']]
122 |     # add predictions back to dataset 
123 |     df_live = df_live.assign(preds=pd.Series(preds).values)
124 | 
125 |     # aggregate down dups
126 |     df_live = df_live.groupby(['date', 'place'], as_index=False).mean()
127 | 
128 |     # increment date to include DAYS_OUT_TO_PREDICT
129 |     df_live['date']= pd.to_datetime(df_live['date'],format='%Y-%m-%d') 
130 |     df_live['date'] = df_live['date'] + pd.to_timedelta(days_out_to_predict,unit='d')
131 | 
132 |     return(df_live)
133 | 
134 | def get_earth_quake_estimates(desired_date, df_live):
135 |     '''
136 |     Description : gets desired date to predict earthquake and live prediction dataframe with NaN values as outcome magnitude 
137 |                   probablity that has to be predicted. The function also deals with converting to google maps api format 
138 |                   of location co-ordinates to mark it on the map.
139 | 
140 |     Arguments : DateTime object (desired_date to predict), Pandas DataFrame (dataframe of prediction with NaN values as outcome)
141 | 
142 |     Return : string (Google maps api format location coordinates)
143 | 
144 |     '''
145 |     from datetime import datetime
146 |     live_set_tmp = df_live[df_live['date'] == desired_date]
147 | 
148 |     # format lat/lons like Google Maps expects
149 |     LatLngString = ''
150 |     if (len(live_set_tmp) > 0):
151 |         for lat, lon, pred in zip(live_set_tmp['latitude'], live_set_tmp['longitude'], live_set_tmp['preds']): 
152 |             # this is the threashold of probability to decide what to show and what not to show
153 |             if (pred > 0.3):
154 |                 LatLngString += "new google.maps.LatLng(" + str(lat) + "," + str(lon) + "),"
155 | 
156 |     return(LatLngString)
157 | 
158 | 
159 | @app.before_first_request
160 | def startup():
161 |     global earthquake_live
162 | 
163 |     # prepare earthquake data, model and get live data set with earthquake forecasts
164 |     earthquake_live = prepare_earthquake_data_and_model()
165 | 
166 | 
167 | @app.route("/", methods=['POST', 'GET'])
168 | def build_page():
169 |         if request.method == 'POST':
170 | 
171 |             horizon_int = int(request.form.get('slider_date_horizon'))
172 |             horizon_date = datetime.today() + timedelta(days=horizon_int)
173 | 
174 |             return render_template('index.html',
175 |                 date_horizon = horizon_date.strftime('%m/%d/%Y'),
176 |                 earthquake_horizon = get_earth_quake_estimates(str(horizon_date)[:10], earthquake_live),
177 |                 current_value=horizon_int, 
178 |                 days_out_to_predict=days_out_to_predict)
179 | 
180 |         else:
181 |             # set blank map
182 |             return render_template('index.html',
183 |                 date_horizon = datetime.today().strftime('%m/%d/%Y'),
184 |                 earthquake_horizon = '',
185 |                 current_value=0,
186 |                 days_out_to_predict=days_out_to_predict)
187 | 
188 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Realtime Earthquake Predictor application 
  2 | 
  3 | A realtime earthquake predictor web app with google maps API, that forecasts earthquake possible epicenters and places in window of next 7 days.
  4 | 
  5 | ![web-app](https://github.com/aditya-167/Realtime-Earthquake-forecasting/blob/master/Images/application.jpg)
  6 | 
  7 | Web app link : [http://srichaditya3098.pythonanywhere.com/](http://srichaditya3098.pythonanywhere.com/)
  8 | 
  9 | ### Code files
 10 | 
 11 | * `Data/` : Notebook and HTML file `ETL_USGS_EarthQuake.ipybn` for ETL and EDA part of the project, and it also contains cleaned data in Earthquake.db & Earthquake_data.db format saved after ETL process
 12 | 
 13 | * `models/` : Notebook and HTML file `Earthquake-prediction-ML-workflow.ipybn` which has all the implementation after related to Prediction steps and Machine Learning pipeline.
 14 | 
 15 | * `Webapp/` : all the necessary routing python files in `main.py` for flask application i.e from data extraction to modeling application and convert prediction co-ordinates to google maps api format.
 16 | 
 17 | I have implemented all the neccesary steps in these IPYBN notebooks. I recommend for project walkthrough follow -
 18 | 
 19 |   1. For ETL walkthrough open `Data/ETL_USGS_EarthQuake.ipybn` or `Data/ETL_USGS_EarthQuake.html`
 20 | 
 21 |   2. Next, go to `models/Earthquake-prediction-ML-workflow.ipybn` or `models/Earthquake-prediction-ML-workflow.html` for ML and workflow. 
 22 | 
 23 | 
 24 | ### Instructions to run the project
 25 | 
 26 | 
 27 | **Requirements**
 28 |  
 29 | 1. click==7.1.2
 30 | 2. Flask==1.1.2
 31 | 3. gunicorn==20.0.4
 32 | 4. itsdangerous==1.1.0
 33 | 5. Jinja2==2.11.2
 34 | 6. joblib==0.16.0
 35 | 7. MarkupSafe==1.1.1
 36 | 8. numpy==1.19.1
 37 | 9. pandas==1.1.0
 38 | 10. python-dateutil==2.8.1
 39 | 11. pytz==2020.1
 40 | 12. scikit-learn==0.23.1
 41 | 13. scipy==1.5.2
 42 | 14. six==1.15.0
 43 | 15. sklearn==0.0
 44 | 16. SQLAlchemy==1.3.18
 45 | 17. threadpoolctl==2.1.0
 46 | 18. Werkzeug==1.0.1
 47 | 19. xgboost==1.1.1
 48 | 20. python3.x
 49 | 
 50 | 
 51 | **Linux/Mac Users**
 52 | 
 53 | Note for **windows user** : install gitbash and proceed with same instruction as linux.
 54 | 
 55 | `step 1` : `$ git clone https://github.com/aditya-167/Realtime-Earthquake-forecasting.git`
 56 | 
 57 | `step 2` : `$ cd Realtime-Earthquake-forecasting`
 58 | 
 59 | `step 3` : `$ python3 -m venv <<any environment name>>` (If error occurs, download virtual 
 60 | environment for python)
 61 | 
 62 | `step 4` : `$ source <<any environment name>>/bin/activate`
 63 | 
 64 | `step 5` : `$ pip install --upgrade pip `
 65 | 
 66 | `step 6` : `$ pip install -r requirements.txt` (If error occurs in xgboost installation, upgrade pip 
 67 | using step 5)
 68 | 
 69 | `step 7` : Run application with `$ python application.py` i.e in root directory of project repo.
 70 | 
 71 | `step 8` : Go to local host when application starts and use slider to choose dates for prediction in app.
 72 | 
 73 | 
 74 | ## Contents
 75 | 
 76 |    * Project Overview
 77 |    * Problem Statement and approach to solution
 78 |    * Metrics 
 79 |    * Dataset 
 80 |    * Exploratory Data Analysis and Data processing
 81 |    * Model implementation
 82 |    * Improvement and evaluation
 83 |    * Prediction and web application
 84 |    * Improvement and conclusion
 85 |    * acknowledgement
 86 | 
 87 | ### Project Overview
 88 | Countless dollars and entire scientific careers have been dedicated to predicting where and when the next big earthquake will strike. But unlike weather forecasting, which has significantly improved with the use of better satellites and more powerful mathematical models, earthquake prediction has been marred by repeated failure due to highly uncertain conditions of earth and its surroundings.
 89 | Now, with the help of artificial intelligence, a growing number of scientists say changes in the way they can analyze massive amounts of seismic data can help them better understand earthquakes, anticipate how they will behave, and provide quicker and more accurate early warnings. This helps in hazzard assessments for many builders and real estate business for infrastructure planning from business perspective. Also many lives can be saved through early warning. This project aims a simple solution to above problem by predicting or forecasting likely places to have earthquake in next 7 days. For user-friendly part, this project has a web application that extracts live data updated every minute by USGS.gov and predicts next likely place world wide to get hit by an earthquake, hence a realtime solution is provided.
 90 | 
 91 | ### Problem Statement and approach to solution
 92 | Anticipating seismic tremors is a pivotal issue in Earth science because of their overwhelming and huge scope outcomes. The goal of this project is to predict where likely in the world and on what dates the earthquake will happen. Application and impact of the project​ includes potential to improve earthquake hazard assessments that could spare lives and billions of dollars in infrastructure and planning. Given geological locations, magnitude and other factors in dataset from https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php for 30 days past which is updated every minute, we predict or forecast 7 days time in future that is yet to come, the places where quake would likely happen. Since this is event series problem type, proposed solution in this project follows considering binary classification of earthquake occurance with training period includes fixed rolling window moving averages of past days while for which its labels, a fixed window size shifted ahead in time. The model will be trained with Adaboost classifier (RandomForestClassifier and DecisionTreeClassifier) and compared with XGBoost based on AUC ROC score and recall score due to the nature of problem (i.e binary classification). Model with better AUC score and recall will be considered for web app that uses Google maps api to predict places where earthquake might occur.
 93 | 
 94 | ### Metrics
 95 | 
 96 | The problem addressed above is about binary classification, `Earthquake occur = 1` and `Earthquake not occur = 0` and with these prediction we try to locate co-cordinates corrosponding to the predictions and display it on the google maps api web app. More suitable metrics for binary clsssification problems are **ROC (Reciever operator characteristics), AUC (Area Under Curve), Confusion matrix for Precision, recall, accuracy and sensitivity**. One important thing about choosing metrics and model is what exactly we need from predictions and what not. To be precise, we need to **minimize or get less False negative predictions** since we dont want our model to predict as `0` or `no earthquake occured` at particular location when in reality it had actually happend as this is more dangerous than the prediction case in which prediction is `true/1` or `earthquake occured` but in reality it did not because its always **better safe than sorry!!!**. Hence apart from `roc_auc score`, I have considered
 97 | `Recall` as well for evaluation and model selection with `higher auc_roc score and recall`, where `recall = (TP/TP+FN)`.
 98 | 
 99 | ### Dataset
100 | 
101 | Real time data that updates every minute on https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php for past 30 days. Below is the feature description of the dataset with 22 features and 14150 samples at the time of training.
102 | 
103 | * time ---------------------- Time when the event occurred. Times are reported in milliseconds since the epoch 
104 | * latitude ------------------- Decimal degrees latitude. Negative values for southern latitudes.
105 | * longitude ------------------ Decimal degrees longitude. Negative values for western longitudes.
106 | * depth ---------------------- Depth of the event in kilometers.
107 | * mag ------------------------ Magnitude of event occured.
108 | * magType -------------------- The method or algorithm used to calculate the preferred magnitude
109 | * nst ------------------------ The total number of seismic stations used to determine earthquake location.
110 | * gap ------------------------ The largest azimuthal gap between azimuthally adjacent stations (in degrees).
111 | * dmin ----------------------- Horizontal distance from the epicenter to the nearest station (in degrees).
112 | * rms ------------------------ The root-mean-square (RMS) travel time residual, in sec, using all weights.
113 | * net ------------------------- The ID of a data source contributor for event occured.
114 | * id -------------------------- A unique identifier for the event. 
115 | * types ----------------------- A comma-separated list of product types associated to this event.
116 | * place ----------------------- named geographic region near to the event.
117 | * type ------------------------ Type of seismic event.
118 | * locationSource -------------- The network that originally authored the reported location of this event.
119 | * magSource ------------------- Network that originally authored the reported magnitude for this event.
120 | * horizontalError ------------- Uncertainty of reported location of the event in kilometers.
121 | * depthError ------------------ The depth error, three principal errors on a vertical line.
122 | * magError -------------------- Uncertainty of reported magnitude of the event.
123 | * magNst ---------------------- The total number of seismic stations to calculate the magnitude of earthquake.
124 | * status ---------------------- Indicates whether the event has been reviewed by a human.
125 | 
126 | 
127 | 
128 | ### Exploratory Data Analysis and Data preprocessing
129 | 
130 | Data Info:-
131 | 
132 | **Null values**
133 | Input to model from dataset has many important features to consider as `time`,`latitude & longitude`,`depth of quake`,`magnitude`,`place`, rest other features are error and non supporting features for classification, below shows the null value counts for some features and what to do with that.
134 | <img src="Images/info.png" width="600" height="300" />
135 | 
136 | * We can see lots of null values of certain features, but as part of prediction most of the features that address 'error' in measurement have missing values, thus for feature selection we consider only certain features in final dataframe, hence I choose simply **drop or ignore the null values**.
137 | 
138 | * Apart from features in dataset we focus on, I have done some feature Engineering based on some considerations on my model as follows:
139 | 
140 |    * Set rolling window size for future prediction based on past values with fixed window size in past
141 |    * I have created 6 new features based on rolling window size on average depth and average  magnitude.
142 |    * A final outcome 'mag_outcome' has been defined as target values and the output is considered as shifted values from set rolling window of past days eg: '7'. 
143 | **New features include** : avg_depth, magnitude_avg for 22,15,7 days rolling window period for training.
144 | 
145 | <img src="Images/featureengineer.png" width="900" height="300" />
146 | 
147 | * After feature engineering and dealing with null values, the model has imbalance class distribution
148 | 
149 | <img src="Images/class_distrib.png" width="900" height="300" />
150 | 
151 | * Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes. such as collect more data, change metrics, resampling data, cross-validation dataset etc.
152 | For the project I have considered the metrics for treating this imbalance nature with-
153 | 1. Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
154 | 2. Recall: A measure of a classifiers completeness
155 | 3. ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
156 | 
157 | * Moreover the reason for choosing this metrics not only helps me improve class imbalance comfirmation bias but also due to my nature of problem to be solved of earthquake prediction False negative must be penalized more.
158 | 
159 | **Lets analyse places with top 20 higher & lower number of magnitude mean**
160 | 
161 | Top 20 places where lowest magnitude mean quake experienced in past 30 days. 
162 | <img src="Images/lowest.png" width="900" height="400" />
163 | 
164 | Top 20 places where highes magnitude mean quake experienced in past 30 days. 
165 | <img src="Images/highest.png" width="900" height="400" />
166 | 
167 | * Finally for `mag_outcome` feature we created based on 7 days rolling window period in future as target, I have converted it to class as 1 or 0 based on magnitude outcome > 2.5
168 | 
169 | Rest of the part is best explained in project walkthrough notebooks `Data/ETL_USGS_EarthQuake.ipybn` or `Data/ETL_USGS_EarthQuake.html`.
170 | Finally the cleaned data for prediction is stored in database file `Data/Earthquakedata.db` using sql engine.
171 | 
172 | **Note** : only for project walkthrough purpose cleaned data is stored in database but for realtime analysis, in `Webapp/main.py` flask app, we extract data on the go without storing. This make sures we get realtime data any day when web app is requested by any user.
173 | 
174 | ### Model implementation and methodology
175 | 
176 | After preprocessing with removing null values, and feature engineering as discussed above, I performed Boosting algorithms for classification problem.
177 | 
178 | 1. Adaboost classifier with estimator as DecisionTreeClassifier
179 | 
180 | 2. Adaboosr classifier with estimator as RandomForestClassifier
181 | 
182 | 3. Finally I tried Xgboost algorithm.
183 | 
184 | For all the above algorithms, 
185 | 
186 | * DecisionTreeClassifier
187 | 
188 | max_depth =[2,6,7], n_estimators = [200,500,700] and used gridsearch CV for best estimator as nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split = 2 samples which helps for classification with various types of features in dataset.
189 | 
190 | * RandomForestCLassifer
191 | 
192 | Same parameters were used for randomforest as well to compare the algorithms used with gridsearchCV along with another hyperparamter `max_features`= ['auto','sqrt','log2'] that will let select features based on log(featues), sqrt(features) etc.
193 | 
194 | * XgboostClassifier
195 | 
196 | I did not use grid Search CV here since, it took me more very long to train, hence I tried max_depth same as above algorithms with best fit, i.e 6, `learning_rate=0.03` and `gbtree` as booster
197 | 
198 | model selection was based on Evaluation on `roc_auc score` and `recall` and hyperparameter tunning.
199 | A better walkthrough is mentioned with great detail in `models/Earthquake-predictor-ML-workflow.ipybn` or `models/Earthquake-predictor-ML-workflow.html`.
200 | 
201 | 
202 | `max_depth` hyperparameter along with `n_estimator` was important as this indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data due eqarthquake data being only for past 30 days and features such as rolling window time period of magnitude.
203 | 
204 | `max_features` hyperparameter is used since it ensures how many features to take in account for classification. Due to features such as maginutude and depth of quake for 22,15,and 7 days, this hyperparameter takes care of how many to pay attention to. GridSearchCV will take care of what features to take depending on `sqrt(num_features)`,`log(num_features)`,`auto(num_features)`.
205 | 
206 | ### Improvement and evaluation 
207 | 
208 | * I have used gridsearch CV for improving model and hyperparameter tunning on Adaboost classifier with base estimators as `DecisionTreeClassifier` and `RandomForestClassifier`.
209 | * Using the same hyper parameters I trained XGBoost. As mentioned above, metrics for evaluation is `roc_auc score` and `recall`.
210 | 
211 | **DecisionTreeClassifier adaboost**
212 | 
213 | ![DecisionTreeClassifier evaluation](https://github.com/aditya-167/Realtime-Earthquake-forecasting/blob/master/Images/DecisionTree.jpg)
214 | 
215 | 1. With **adaboost decision tree classifier** and hyper parameter tunning, we get area under curve (score) = 0.8867
216 | 2. higher the auc score, better is the model since it is better at distinguishing postive and negative classes.
217 | 3. Make a note here that we get from **confusion matrix**, `False negative = 42`and `Recall score =0.7789`. We need this value apart from auc score that we will analyze later when we have tested with diffferent models below
218 | 
219 | I got Best estimator with `max_depth = 6` and for `n_estimators = 500` after running gridSearchcv.
220 | 
221 | model selection is based on metrics score after comaparing all the algorithm score
222 | 
223 | **RandomForesClassifier adaboost**
224 | 
225 | ![RandomForestClassifier evaluation](https://github.com/aditya-167/Realtime-Earthquake-forecasting/blob/master/Images/RandomForest.png)
226 | 
227 | 1. Below is the auc score for **adaboost RandomForest classifier** with 0.916 which is slightly lower than Decision tree classifier
228 | 2. Moreover when we look at **confusion matrix**, `False Negative=38` and `Recall score = 0.8' can be observed which is slightly higher than recall score of decision tree. Thus performs better than decision tree adabooost 
229 | 
230 | Random forest gets best estimator with `max_depth = 7` and `max_feature = sqrt(features)` 
231 | 
232 | model selection is based on metrics score after comaparing all the algorithm score
233 | 
234 | **XGBoost model**
235 | 
236 | ![XGBoost](https://github.com/aditya-167/Realtime-Earthquake-forecasting/blob/master/Images/XGboost.jpg)
237 | 
238 | 1. I have also tested with xgboost model below with similar parameters as I got above, since grid search CV was taking lot of time for xgboost.
239 | 
240 | 
241 | 2. With `Estimators = 500` , and `learning rate =0.03` as we can see this significantly gives higher AUC score of almost 0.98 and also `False negative = 37` which is similar Random Forest adaboost but xgboost has higher True positive and less False Positve compared to Random forest adaboost. i.e `Recall score = 0.805` which is similar adaboost Random Forrest tree. But XGboost is really good at classifying positive and negative classes and also better `aur_roc_score = 0.98193`.
242 | We can see above that xgboost algorithm has higher auc score (0.9819) than adaboost decision tree and random forest, as it is evident from the ROC curve. 
243 | 
244 | * Since Xgboost model having higher `recall` & `auc_score` than other alorithms, it can be considered more robust as it has ability to handle class imbalance with recall score, and deal good with False negative values and penalize it which is important for our task. i.e reduce False Negative values.
245 | Hence we consider xgboost for prediction of live data and deployment in the application.
246 | 
247 | -> For more insights go : `models/Earthquake-predictor-ML-workflow.ipybn` or `models/Earthquake-predictor-ML-workflow.html`.
248 | 
249 | ### Prediction and Web-application
250 | 
251 | * Select specific features such as `data`,`place`,`long`,`lat` and give earthquake probablity from prediction at that place and date as `quake` probability
252 | * with taking only 7 days rolling period data from predict dataframe since this outcome value is NaN and we need to predict next 7 days period.
253 | 
254 | **Prediction for a particular day**
255 | 
256 | ![prediction](https://github.com/aditya-167/Realtime-Earthquake-forecasting/blob/master/Images/predict.jpg)
257 | 
258 | **Web App**
259 | 
260 | 1. Now its time to deploy the model on web application with flask and I have chosen it to deploy on https://www.pythonanywhere.com/ which is a free hosting cloud platform for web flask applications.
261 | 
262 | 2. Main Idea of Application will be predicting or forecasting these earthquake sites on given day all over the world.
263 | 
264 | 3. The user has option to change the date using a slider and look at predicted places all over the world where earthquake is likely to happen. [App](http://srichaditya3098.pythonanywhere.com/).
265 | 
266 | 4. Application uses google maps [api](https://developers.google.com/maps/documentation), hence the coordinates we get from the prediction of our model needs to be converted to api format. This has been done and can be viewed `Webapp/main.py`
267 | 
268 | 
269 | ### Improvement and conclusion
270 | 
271 | Though XGboost model has given Higher `roc_auc` and better `recall`, I believe any work given always has some scope for improvement and in here we could also use `RNN or LSTM` for time series or `rather event series forecasting`. LSTMs have hidden memory cells that help in remembering and handeling time series or event series data well. Moreover for xgboost I have just used hyper parameters from already tuned Adaboost models, but we can also tune xgboost hyper parameter and find best parameters using GridSearchCV or RandomSearch.
272 | 
273 | **Some final thoughts** 
274 | 
275 | 1. So far the model looks good with xgboost as chosen model for predictions in web app haveing higher auc score and higher recall_score as I have explained under XGBoost result section why auc and recall score are chosen.
276 | 
277 | 2. Our main Aim is to predict wether earthquake will happen or not at a given day and place. So we definitely would **not like the model with higher False Neagtive values , since its more dangerous to predict as no earthquake while in reality earthquake happend than predicting earthquake will happen given in reality it did not**. We can allow False positive more than False negative
278 | 
279 | 3. After seeing these comparision on auc_roc score, confusion matrix, and recall score, since all the above algorithm have given similar result with slightly different recall scores, Xgboost with `FN=37` but with higher `auc_score 0f 0.98` performs over-all better. Hence for webapplication deployment, I have chosen Xgboost as it also faster than adaboost.
280 | 
281 | Hence with all the mentioned implementation, the web application was successfully deployed and necessary project walktrhough can be accessed from `Data and models` directory.
282 | 
283 | 
284 | 


--------------------------------------------------------------------------------