├── application.py ├── Images ├── info.png ├── lowest.png ├── XGboost.jpg ├── highest.png ├── predict.jpg ├── application.jpg ├── earthquakes.png ├── DecisionTree.jpg ├── RandomForest.png ├── class_distrib.png └── featureengineer.png ├── Webapp ├── __init__.py ├── __init__.pyc ├── __pycache__ │ ├── main.cpython-36.pyc │ └── __init__.cpython-36.pyc ├── templates │ └── index.html └── main.py ├── Data ├── Earthquakedata.db └── Earthquakedata_predict.db ├── requirements.txt └── README.md /application.py: -------------------------------------------------------------------------------- 1 | from Webapp import app 2 | app.run(debug=True) 3 | -------------------------------------------------------------------------------- /Images/info.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/info.png -------------------------------------------------------------------------------- /Images/lowest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/lowest.png -------------------------------------------------------------------------------- /Images/XGboost.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/XGboost.jpg -------------------------------------------------------------------------------- /Images/highest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/highest.png -------------------------------------------------------------------------------- /Images/predict.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/predict.jpg -------------------------------------------------------------------------------- /Webapp/__init__.py: -------------------------------------------------------------------------------- 1 | from flask import Flask 2 | 3 | app = Flask(__name__) 4 | 5 | from Webapp import main 6 | 7 | -------------------------------------------------------------------------------- /Webapp/__init__.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__init__.pyc -------------------------------------------------------------------------------- /Data/Earthquakedata.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Data/Earthquakedata.db -------------------------------------------------------------------------------- /Images/application.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/application.jpg -------------------------------------------------------------------------------- /Images/earthquakes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/earthquakes.png -------------------------------------------------------------------------------- /Images/DecisionTree.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/DecisionTree.jpg -------------------------------------------------------------------------------- /Images/RandomForest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/RandomForest.png -------------------------------------------------------------------------------- /Images/class_distrib.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/class_distrib.png -------------------------------------------------------------------------------- /Images/featureengineer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Images/featureengineer.png -------------------------------------------------------------------------------- /Data/Earthquakedata_predict.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Data/Earthquakedata_predict.db -------------------------------------------------------------------------------- /Webapp/__pycache__/main.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__pycache__/main.cpython-36.pyc -------------------------------------------------------------------------------- /Webapp/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aditya-167/Realtime-Earthquake-forecasting/HEAD/Webapp/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | click==7.1.2 2 | Flask==1.1.2 3 | gunicorn==20.0.4 4 | itsdangerous==1.1.0 5 | Jinja2==2.11.2 6 | joblib==0.16.0 7 | MarkupSafe==1.1.1 8 | numpy==1.19.1 9 | pandas==1.1.0 10 | python-dateutil==2.8.1 11 | pytz==2020.1 12 | scikit-learn==0.23.1 13 | scipy==1.5.2 14 | six==1.15.0 15 | sklearn==0.0 16 | SQLAlchemy==1.3.18 17 | threadpoolctl==2.1.0 18 | Werkzeug==1.0.1 19 | xgboost==1.1.1 20 | -------------------------------------------------------------------------------- /Webapp/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 | 6 |Worldwide Earthquake Forecaster |
58 |
113 | 120 | |
121 |
135 |
136 | * We can see lots of null values of certain features, but as part of prediction most of the features that address 'error' in measurement have missing values, thus for feature selection we consider only certain features in final dataframe, hence I choose simply **drop or ignore the null values**.
137 |
138 | * Apart from features in dataset we focus on, I have done some feature Engineering based on some considerations on my model as follows:
139 |
140 | * Set rolling window size for future prediction based on past values with fixed window size in past
141 | * I have created 6 new features based on rolling window size on average depth and average magnitude.
142 | * A final outcome 'mag_outcome' has been defined as target values and the output is considered as shifted values from set rolling window of past days eg: '7'.
143 | **New features include** : avg_depth, magnitude_avg for 22,15,7 days rolling window period for training.
144 |
145 |
146 |
147 | * After feature engineering and dealing with null values, the model has imbalance class distribution
148 |
149 |
150 |
151 | * Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes. such as collect more data, change metrics, resampling data, cross-validation dataset etc.
152 | For the project I have considered the metrics for treating this imbalance nature with-
153 | 1. Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
154 | 2. Recall: A measure of a classifiers completeness
155 | 3. ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
156 |
157 | * Moreover the reason for choosing this metrics not only helps me improve class imbalance comfirmation bias but also due to my nature of problem to be solved of earthquake prediction False negative must be penalized more.
158 |
159 | **Lets analyse places with top 20 higher & lower number of magnitude mean**
160 |
161 | Top 20 places where lowest magnitude mean quake experienced in past 30 days.
162 |
163 |
164 | Top 20 places where highes magnitude mean quake experienced in past 30 days.
165 |
166 |
167 | * Finally for `mag_outcome` feature we created based on 7 days rolling window period in future as target, I have converted it to class as 1 or 0 based on magnitude outcome > 2.5
168 |
169 | Rest of the part is best explained in project walkthrough notebooks `Data/ETL_USGS_EarthQuake.ipybn` or `Data/ETL_USGS_EarthQuake.html`.
170 | Finally the cleaned data for prediction is stored in database file `Data/Earthquakedata.db` using sql engine.
171 |
172 | **Note** : only for project walkthrough purpose cleaned data is stored in database but for realtime analysis, in `Webapp/main.py` flask app, we extract data on the go without storing. This make sures we get realtime data any day when web app is requested by any user.
173 |
174 | ### Model implementation and methodology
175 |
176 | After preprocessing with removing null values, and feature engineering as discussed above, I performed Boosting algorithms for classification problem.
177 |
178 | 1. Adaboost classifier with estimator as DecisionTreeClassifier
179 |
180 | 2. Adaboosr classifier with estimator as RandomForestClassifier
181 |
182 | 3. Finally I tried Xgboost algorithm.
183 |
184 | For all the above algorithms,
185 |
186 | * DecisionTreeClassifier
187 |
188 | max_depth =[2,6,7], n_estimators = [200,500,700] and used gridsearch CV for best estimator as nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split = 2 samples which helps for classification with various types of features in dataset.
189 |
190 | * RandomForestCLassifer
191 |
192 | Same parameters were used for randomforest as well to compare the algorithms used with gridsearchCV along with another hyperparamter `max_features`= ['auto','sqrt','log2'] that will let select features based on log(featues), sqrt(features) etc.
193 |
194 | * XgboostClassifier
195 |
196 | I did not use grid Search CV here since, it took me more very long to train, hence I tried max_depth same as above algorithms with best fit, i.e 6, `learning_rate=0.03` and `gbtree` as booster
197 |
198 | model selection was based on Evaluation on `roc_auc score` and `recall` and hyperparameter tunning.
199 | A better walkthrough is mentioned with great detail in `models/Earthquake-predictor-ML-workflow.ipybn` or `models/Earthquake-predictor-ML-workflow.html`.
200 |
201 |
202 | `max_depth` hyperparameter along with `n_estimator` was important as this indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data due eqarthquake data being only for past 30 days and features such as rolling window time period of magnitude.
203 |
204 | `max_features` hyperparameter is used since it ensures how many features to take in account for classification. Due to features such as maginutude and depth of quake for 22,15,and 7 days, this hyperparameter takes care of how many to pay attention to. GridSearchCV will take care of what features to take depending on `sqrt(num_features)`,`log(num_features)`,`auto(num_features)`.
205 |
206 | ### Improvement and evaluation
207 |
208 | * I have used gridsearch CV for improving model and hyperparameter tunning on Adaboost classifier with base estimators as `DecisionTreeClassifier` and `RandomForestClassifier`.
209 | * Using the same hyper parameters I trained XGBoost. As mentioned above, metrics for evaluation is `roc_auc score` and `recall`.
210 |
211 | **DecisionTreeClassifier adaboost**
212 |
213 | 
214 |
215 | 1. With **adaboost decision tree classifier** and hyper parameter tunning, we get area under curve (score) = 0.8867
216 | 2. higher the auc score, better is the model since it is better at distinguishing postive and negative classes.
217 | 3. Make a note here that we get from **confusion matrix**, `False negative = 42`and `Recall score =0.7789`. We need this value apart from auc score that we will analyze later when we have tested with diffferent models below
218 |
219 | I got Best estimator with `max_depth = 6` and for `n_estimators = 500` after running gridSearchcv.
220 |
221 | model selection is based on metrics score after comaparing all the algorithm score
222 |
223 | **RandomForesClassifier adaboost**
224 |
225 | 
226 |
227 | 1. Below is the auc score for **adaboost RandomForest classifier** with 0.916 which is slightly lower than Decision tree classifier
228 | 2. Moreover when we look at **confusion matrix**, `False Negative=38` and `Recall score = 0.8' can be observed which is slightly higher than recall score of decision tree. Thus performs better than decision tree adabooost
229 |
230 | Random forest gets best estimator with `max_depth = 7` and `max_feature = sqrt(features)`
231 |
232 | model selection is based on metrics score after comaparing all the algorithm score
233 |
234 | **XGBoost model**
235 |
236 | 
237 |
238 | 1. I have also tested with xgboost model below with similar parameters as I got above, since grid search CV was taking lot of time for xgboost.
239 |
240 |
241 | 2. With `Estimators = 500` , and `learning rate =0.03` as we can see this significantly gives higher AUC score of almost 0.98 and also `False negative = 37` which is similar Random Forest adaboost but xgboost has higher True positive and less False Positve compared to Random forest adaboost. i.e `Recall score = 0.805` which is similar adaboost Random Forrest tree. But XGboost is really good at classifying positive and negative classes and also better `aur_roc_score = 0.98193`.
242 | We can see above that xgboost algorithm has higher auc score (0.9819) than adaboost decision tree and random forest, as it is evident from the ROC curve.
243 |
244 | * Since Xgboost model having higher `recall` & `auc_score` than other alorithms, it can be considered more robust as it has ability to handle class imbalance with recall score, and deal good with False negative values and penalize it which is important for our task. i.e reduce False Negative values.
245 | Hence we consider xgboost for prediction of live data and deployment in the application.
246 |
247 | -> For more insights go : `models/Earthquake-predictor-ML-workflow.ipybn` or `models/Earthquake-predictor-ML-workflow.html`.
248 |
249 | ### Prediction and Web-application
250 |
251 | * Select specific features such as `data`,`place`,`long`,`lat` and give earthquake probablity from prediction at that place and date as `quake` probability
252 | * with taking only 7 days rolling period data from predict dataframe since this outcome value is NaN and we need to predict next 7 days period.
253 |
254 | **Prediction for a particular day**
255 |
256 | 
257 |
258 | **Web App**
259 |
260 | 1. Now its time to deploy the model on web application with flask and I have chosen it to deploy on https://www.pythonanywhere.com/ which is a free hosting cloud platform for web flask applications.
261 |
262 | 2. Main Idea of Application will be predicting or forecasting these earthquake sites on given day all over the world.
263 |
264 | 3. The user has option to change the date using a slider and look at predicted places all over the world where earthquake is likely to happen. [App](http://srichaditya3098.pythonanywhere.com/).
265 |
266 | 4. Application uses google maps [api](https://developers.google.com/maps/documentation), hence the coordinates we get from the prediction of our model needs to be converted to api format. This has been done and can be viewed `Webapp/main.py`
267 |
268 |
269 | ### Improvement and conclusion
270 |
271 | Though XGboost model has given Higher `roc_auc` and better `recall`, I believe any work given always has some scope for improvement and in here we could also use `RNN or LSTM` for time series or `rather event series forecasting`. LSTMs have hidden memory cells that help in remembering and handeling time series or event series data well. Moreover for xgboost I have just used hyper parameters from already tuned Adaboost models, but we can also tune xgboost hyper parameter and find best parameters using GridSearchCV or RandomSearch.
272 |
273 | **Some final thoughts**
274 |
275 | 1. So far the model looks good with xgboost as chosen model for predictions in web app haveing higher auc score and higher recall_score as I have explained under XGBoost result section why auc and recall score are chosen.
276 |
277 | 2. Our main Aim is to predict wether earthquake will happen or not at a given day and place. So we definitely would **not like the model with higher False Neagtive values , since its more dangerous to predict as no earthquake while in reality earthquake happend than predicting earthquake will happen given in reality it did not**. We can allow False positive more than False negative
278 |
279 | 3. After seeing these comparision on auc_roc score, confusion matrix, and recall score, since all the above algorithm have given similar result with slightly different recall scores, Xgboost with `FN=37` but with higher `auc_score 0f 0.98` performs over-all better. Hence for webapplication deployment, I have chosen Xgboost as it also faster than adaboost.
280 |
281 | Hence with all the mentioned implementation, the web application was successfully deployed and necessary project walktrhough can be accessed from `Data and models` directory.
282 |
283 |
284 |
--------------------------------------------------------------------------------