├── README.md
├── images
└── images_folder
    ├── NYSE_prices.jpg
    ├── README.md
    ├── adv_area_income_age.jpg
    ├── bitcoin_prices.png
    ├── breast_cancer.jpg
    ├── classification_report_logistic_model.jpg
    ├── classification_report_rf.jpg
    ├── correlations.jpg
    ├── currencies_boxplot.png
    ├── daily_time_site_age.jpg
    ├── daily_time_site_daily_int_usage.jpg
    ├── empirical_click_rate.jpg
    ├── equatorial_coords.jpg
    ├── error_test_set_new_model.jpg
    ├── error_train_model_new.jpg
    ├── feature_imp1.jpg
    ├── feature_imp_2.jpg
    ├── feature_importance_galaxies.jpg
    ├── fig1.jpg
    ├── fig2.jpg
    ├── fraud_confusion_matrix_naive_abyes.jpg
    ├── fraud_feature_imp_naive_bayes.jpg
    ├── fraud_rate.jpg
    ├── fraud_rate_naive_bayes.jpg
    ├── hybrid1.jpg
    ├── hybrid10.jpg
    ├── hybrid11.jpg
    ├── hybrid12.jpg
    ├── hybrid13.jpg
    ├── hybrid14.jpg
    ├── hybrid15.jpg
    ├── hybrid2.jpg
    ├── hybrid3.jpg
    ├── hybrid4.jpg
    ├── hybrid5.jpg
    ├── hybrid6.jpg
    ├── hybrid7.jpg
    ├── hybrid8.jpg
    ├── hybrid9.jpg
    ├── hyp_test1.jpg
    ├── hyp_testing2.jpg
    ├── linear_model2.jpg
    ├── linear_model3.jpg
    ├── linear_model_coeff.jpg
    ├── linear_model_evidence_regression.jpg
    ├── linear_model_gaussian_residuals.jpg
    ├── linear_model_yearly_amount_spent_vs_time_website.jpg
    ├── loss_function.jpg
    ├── model_test_dataset.jpg
    ├── model_training_set.jpg
    ├── pieplot_currencies.png
    ├── prophet_model.jpg
    ├── prophet_model2.jpg
    ├── prophets_results_reality.jpg
    ├── returns_new_model_train_set.jpg
    ├── returns_test_set_new_model.jpg
    ├── roc_fraud.jpg
    ├── sVC_score_grid_search.jpg
    ├── scatter_plot_volume_close_log.png
    ├── sentiment_analysis_tweets2.jpg
    ├── training_test_dataset.jpg
    ├── uber_dbscan.jpg
    ├── uber_kmeans1.jpg
    ├── uber_kmeans2.jpg
    ├── uber_kmeans3.jpg
    └── uber_kmeans4.jpg


/README.md:
--------------------------------------------------------------------------------
  1 | # Table of contents
  2 | - [Project 1. Sentiment-analysis-of-Tweets-and-prediction-of-bitcoin-prices](#P1)
  3 | - [Project 2. NYSE Prediction of Prices](#P2)
  4 | - [Project 3. Electronic Signature of Loans](#P3)
  5 | - [Project 4. Hybrid Mutual Fund Analysis](#P4)
  6 | - [Project 5. Linear Regression in E-commerce](#P5)
  7 | - [Project 6. Advertisement with ML](#P6)
  8 | - [Project 7. Detecting fraud with ML](#P7)
  9 | - [Project 8. NLP MiniProject](#P8)
 10 | - [Project 9. Classification of galaxies, stars and quasars](#P9)
 11 | - [Project 10. Miniproject DL: Breast cancer detection](#P10)
 12 | - [Project 11.DL project: Yolo object detection](#P11)
 13 | - [Project 12. Tinder recommendation system](#P12)
 14 | - [Project 13. Clustering for Uber Pickups](#P13)
 15 | - [Project 14. NLP-Project-with-Deep-Learning: Construction of a Translator machine](#P14)
 16 | - [My first data science project:Anomalies of temperatures in RJ and rainfalls of SP](#P15)
 17 |   
 18 | # [Project 1.Sentiment-analysis-of-Tweets-and-prediction-of-bitcoin-prices](https://github.com/Andreoliveira85/Sentiment-analysis-of-Tweets-and-prediction-of-bitcoin-prices) <a name="P1"></a>
 19 | **Final project for concluding the Fullstack formation in Data Science at Jedha in Paris.**
 20 | 
 21 | ## Project overview:
 22 | 
 23 | * Firstly we scrapped 13 institutional and private Twitter accounts using Twitter APIs from April 2018 to mid September 2020.
 24 | * We used historical datasets of Bitcoin and other top coins in the crypto space in the same time period.
 25 | * We performed an initial exploratory data analysis on the coins historical datasets in order to understand the dominance of BTC in relation to other coins and the evolution of prices in the recent years.
 26 | * We merged the twitter datasets and we runned a sentiment analysis on those approximately 60000 tweets. The main reactions to BTC were positive during that period.
 27 | * Using the index of polarity built on the tweets and the close prices of BTC in the previous day we built a LSTM model with several dense layers and a stop on the final bias. The neural network performed  well in predicting prices both on the train set (April 2018- April 2020) with a mean square error app. 2.25% and on the test set (May 2020-14 Sept 2020) with a mean square error of 3.89%. 
 28 | * We gathered the predictions from the LSTM model and we used them as regressors for the FB Prophet Algorithm in order to predict prices from the end of the timeset (15 Sept 2020) until 30 November 2020. We used this method as a validation/test for the robustness of the predictions output by our LSTM model. The prices were close of the real prices gathered on google Finance for this new time period but only in a matter of 2 weeks (beginning October). This can be explained empirically by the big noise component of the time series data that reflects the bubble "crescendo" unexpected trend of BTC during the last three months.
 29 | * As a final note we reflect about the difficulty of predicting 1 day forward returns for BTC prices (multi-step model) due to the high volatility of the series that when managed creates a high bias on the architecture of the model. As a future project we will come back to this point.
 30 | 
 31 | 
 32 | ## Code and resources used:
 33 | **Python version:** 3.7
 34 | **Packages/Libraries:** pandas, numpy, plottly, seaborn, matplotlib, tweepy, prophet, tensorflow, keras, scikit-learn, textblob, wordcloud.
 35 | **Datasets involved:** kaggle historical datasets on prices of BTC and other top coins, datasets of tweets created by several scrapped tweet accounts. 
 36 | 
 37 | ### Sidenotes: 
 38 | The description of the datasets used is done on the slides of the [final presentation](https://github.com/Andreoliveira85/Sentiment-analysis-of-Tweets-and-prediction-of-bitcoin-prices/blob/main/JEDHA_FINAL_PROJECT_FULLSTACK_FINAL_PRESENTATION-1-14.pdf) . We could not upload the datasets here due to its size. We invite the visitor to check the graphics and visualizations created on the notebooks displayed in the final presentation.
 39 | 
 40 | * The following chart indicates the distribution of volume of the main crypto currencies. We learned the predominance of the theter coin over BTC. Theter is widely used on East Asia as a way of transaction due to its parity close to 1USD.
 41 | 
 42 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/pieplot_currencies.png)
 43 | 
 44 | * The following boxplot indicates the volume of the coins. 
 45 | 
 46 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/currencies_boxplot.png)
 47 | 
 48 | * The following scatter plot (log scale to overcome the high skewness of the cloud around BTC) shows the predominance of BTC followed by the other coins when considering the variables close price and volume in the crypto space.
 49 | 
 50 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/scatter_plot_volume_close_log.png)
 51 | 
 52 | * The following graph represents the evolution of close prices of BTC over the last years. We remark the meteoric growth from 2016 to 2017 followed by a steep decline. During this year of 2020 we observe the highly fast increasing trend of the coin and we can observe around the 1st trimester the influence of the covid pandemic crisis on the prices.
 53 | 
 54 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/bitcoin_prices.png)
 55 | 
 56 | * The following plot represents the sentiment analysis on the tweets that we performed over the 13 different instiutional and private accounts.
 57 | The tendency of the index that we built on the classification of the tweets is positive on BTC.
 58 | 
 59 | ![my_image](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/sentiment_analysis_tweets2.jpg) 
 60 |  
 61 |  
 62 | * Split between train and test set:
 63 | 
 64 | 
 65 | * split <img src="https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/training_test_dataset.jpg" />
 66 | 
 67 | 
 68 | 
 69 | * The results of the LSTM model on the train set (April 2018-April 2020 approx.). The model learned quite well.
 70 | 
 71 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/model_training_set.jpg). 
 72 | 
 73 | * The results of our model on the test set (April 2020- mid Sept 2020). The trends are followed by the predictions. There is a time period where the model performs far worst than the actual prices (August-September) but then the actual prices and the predictions start to converge again in the last two weeks of the dataset almost coallescing in the end.
 74 | 
 75 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/model_test_dataset.jpg)
 76 | 
 77 | * The loss function (Mean squared error) on train (blue) and test (orange) set (per epoch). Scale 10^-3
 78 | 
 79 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/loss_function.jpg)
 80 | 
 81 | * About the 1 day forward returns of the BTC:
 82 | 
 83 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/returns_new_model_train_set.jpg)
 84 | 
 85 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/returns_test_set_new_model.jpg)
 86 | 
 87 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/fig2.jpg)
 88 | 
 89 | 
 90 | 
 91 | * The predictions built by our model were used as regressors to feed the FB Prophet algorithm. The seasonality effects of the predictions based on the LSTM predictors are shown below. Monthly and bigger size trends should be neglected since the coin is not mature enough for such extrapolations. Although interesting to note the weekly trend of the price going down on Thusrdays and going up during the weekends.
 92 | 
 93 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/prophet_model2.jpg)
 94 | 
 95 | * The results of the Prophetization of our predictions:
 96 | 
 97 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/prophet_model.jpg)
 98 | 
 99 | * The contrast of the predictions of the FB Prophet model feeded by the outputs of the LSTM net we built with the actual prices from 15 Sept 2020 until 30 November 2020 (fetched on Google Finance)
100 | 
101 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/prophets_results_reality.jpg)
102 | 
103 | **Conclusion:** the predictions of our LSTM model when are used as predictors in the FB Prophet algorithm produce new predictions that are close of the close actual value of the BTC according to Google Finance during a couple days (15 Sept until beg October). After that the actual value takes off much more than the predictions. This can be empirically understood due to the big component of noise in this time series and the recent bubble effect registered with BTC in the crypto space that is steeper than the high derivative of prices in 2017. According to this model this would be a good time to sell.
104 | 
105 | 
106 | # [Project 2. NYSE Prediction of Prices](https://github.com/Andreoliveira85/NYSE-prediction-prices)  <a name="P2"></a>
107 | Deep Learning project
108 | 
109 | ## Project overview:
110 | 
111 | We use two datasets relative to historical data for stocks market at the New York Stock Exchange market for the year 2016. We performed an exploratory data analysis of those stocks and created a neural net mix of GRU with LSTM models to predict prices. The predictions and the real values follow the same trends of growth or ungrowth genreally speaking and around certain periods of time (before the 50th day of that year and around day 130) they almost coalesce. 
112 | 
113 | * Performance of the algorithm:
114 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/NYSE_prices.jpg)
115 | 
116 | ## Code and resources used:
117 | **Python version:** 3.7
118 | **Packages:libraries:** numpy, pandas, matplotlib, scikitlearn, math, keras.
119 | **Datasets:** available at Kaggle
120 | 
121 | 
122 | 
123 | # [Project 3. Electronic Signature of Loans](https://github.com/Andreoliveira85/electronic-signature-loans)  <a name="P3"></a>
124 | ## Project description: 
125 | 
126 | Confrontation of several Machine Learning algorithms to predict (classification) electronic signature of contracts. The dataset is a collection of financial info from clients  of a  private anonymous firm. The ML algorithms used were ANN, support vector machines and ensemble learning techniques: random forest classifiers, gradient boosting and adaboost. The algorithms with best score were our neural network and the random forest classifier. 
127 | 
128 | ## Code and resources used:
129 | **Python version:** 3.7
130 | **Packages:libraries:** numpy, pandas, matplotlib, scikitlearn, math, keras.
131 | **Datasets:** dataset from anonymous private company available in the repo.
132 | 
133 | 
134 | 
135 | # [Project 4. Hybrid Mutual Fund Analysis](https://github.com/Andreoliveira85/Hybrid-Mutual-fund-Analysis)   <a name="P4"></a>
136 | ## Project overview: 
137 | ### Aim (Exploratory Data Analysis Project):
138 |   * Analyse various parameters related to the Hybrid Mutual fund dataset and find distinction between good and bad schemes.
139 |   
140 |   * A hybrid fund is an investment fund that is characterized by diversification among two or more asset classes. These funds typically invest in a mix of stocks and bonds. The term hybrid indicates that the fund strategy includes investment in multiple asset classes. These funds offer investors an option for investing in multiple asset classes through a single fund. These funds can offer varying levels of risk tolerance ranging from conservative to moderate and aggressive. We carry a thorough exploratory visualization analysis of a dataset of funds traded by a firm in order to identify bad/good schemes according to several criteria. The exploration is done by multivariate analysis of the different variables of the dataset.
141 | 
142 | ## Preview of some visualizations:
143 | 
144 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid1.jpg)
145 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid2.jpg)
146 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid3.jpg)
147 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid4.jpg)
148 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid5.jpg)
149 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid6.jpg)
150 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid7.jpg)
151 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid8.jpg)
152 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid9.jpg)
153 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid10.jpg)
154 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid11.jpg)
155 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid12.jpg)
156 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid13.jpg)
157 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid14.jpg)
158 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hybrid15.jpg)
159 | 
160 | 
161 | 
162 | ## Code and resources used:
163 | **Python version:** 3.7
164 | **Packages:libraries:** numpy, pandas, matplotlib, seaborn.
165 | **Datasets:** available at the repo
166 | 
167 | 
168 | 
169 | 
170 | # [Project 5. Linear Regression in E-commerce](https://github.com/Andreoliveira85/Project-Linear-regression-in-E-commerce)   <a name="P5"></a>
171 | 
172 | # Project description:
173 | 
174 | Some Ecommerce company based in New York City sells clothing online but also has in-store style and clothing advice sessions. 
175 | Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app 
176 | or website for the clothes they want. The company is trying to decide whether to focus their efforts on their mobile app experience or their website. We use linear 
177 | regression in order to quantify the linear dependence of the target variable "yearly amount spent" that captures the investment done in terms of the other variables
178 | of the dataset. The score (Adjusted R2) is incredibly high 98% in the train and test set (with 20% size for the validation dataset). We also perform regularization
179 | of the model (that in this case is not highly required since there is no overfitting in the model). Lasso regularization performs for the hyper parameter 1 a slight 
180 | underfitting (Score on the train set : 0.981235530537366 vs Score on the test set : 0.9787641440205315) and the best performance for Ridge regularization on the model
181 | is the vanishing coefficient 0 reducing it to the clasical linear regression model that we built in the first try.
182 | 
183 | 
184 | ## Some visualizations:
185 | 
186 | * Bivariate EDA
187 | 
188 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model_yearly_amount_spent_vs_time_website.jpg)
189 | 
190 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model2.jpg)
191 | 
192 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model3.jpg)
193 | 
194 | * Empirical Evidence supporting linear regression:
195 | 
196 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model_evidence_regression.jpg)
197 | 
198 | * Gaussian residuals of the linear model:
199 | 
200 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model_gaussian_residuals.jpg)
201 | 
202 | * Coefficients of the linear model:
203 | 
204 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/linear_model_coeff.jpg)
205 | 
206 | ## Code and resources used:
207 | **Python version:** 3.7
208 | 
209 | **Libraries/packages used:** pandas, numpy, matplotlib, seaborn, scikitlearn.
210 | 
211 | **Dataset:** available at the repo.
212 | 
213 | 
214 | 
215 | 
216 | # [Project 6. Advertisement with ML](https://github.com/Andreoliveira85/Advertisement-with-ML)   <a name="P6"></a>
217 | ## Project description:
218 | In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user (classification problem).
219 | 
220 | This data set contains the following features:
221 | 
222 | * 'Daily Time Spent on Site': consumer time on site in minutes
223 | * 'Age': cutomer age in years
224 | * 'Area Income': Avg. Income of geographical area of consumer
225 | * 'Daily Internet Usage': Avg. minutes a day consumer is on the internet
226 | * 'Ad Topic Line': Headline of the advertisement
227 | * 'City': City of consumer
228 | * 'Male': Whether or not consumer was male
229 | * 'Country': Country of consumer
230 | * 'Timestamp': Time at which consumer clicked on Ad or closed window
231 | * 'Clicked on Ad': 0 or 1 indicated clicking on Ad
232 | 
233 | ## Methodologies: 
234 | 
235 | We approach the classification problem of predicting the variables 'click on the ad' by the visitors of the website using different classification
236 | algorithms: logistic regression and random forest classifiers. After initial EDA we construct a ML pipeline where the split of the dataset is done with 34% test size. The logistic model does not see any under or over fitting of data. The random forest classifier pergorms better (96% score) in the validation test set
237 | than the logistic model(90%). Using the random forest model we create a hiearchy of features importance. We conclude that the variable more influencial to predict
238 | clicks on the ad is the daily internet usage of the user that visits the website. Since this is a variable that we can not control we decided to run the random forest model where this variable is erased from the ensemble of explanatory features. The score of the model goes a bit down when we run this version of the algorithm. Although interestingly we register in the hierachy of the new features importance that the variable are income surpasses the varaible age (which does not happen in the features importance of the first model). The third part of this project concerns hypothesis A/B testing where we analyse different confidence intervals (5% and 40% associated risk levels respectively) for the proportion rate of clicks done by male and female visitors. For the smallest level of risk 5% nothing can be concluded. Although for the higher level of 40% risk we observe empirically by random sampling thar the proportion of clicks on the add done by females surpasses the one done by males. This can be seen as a first indication of a marketing future strategy to be implemented. Although the Qui2 test built with 5% risk tell us that the variables "click on the add" and "gender" are independent concluding this discussion.  
239 | 
240 | # Visualizations:
241 | 
242 | * Multivariate EDA: Age vs area income
243 | 
244 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/adv_area_income_age.jpg)
245 | 
246 | * Multivariate EDA: Age vs daily time spent on site
247 | 
248 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/daily_time_site_age.jpg)
249 | 
250 | * Multivariate EDA: Daily internet usage vs daily time spent on site
251 | 
252 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/daily_time_site_daily_int_usage.jpg)
253 | 
254 | * Classification report of the logistic regression model on the test validation set:
255 | 
256 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/classification_report_logistic_model.jpg)
257 | 
258 | *  Classification report of the random forest classifier on the test validation set:
259 | 
260 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/classification_report_rf.jpg)
261 | 
262 | 
263 | * Feature importance hierarchy using the random forest classifier:
264 | 
265 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/feature_imp1.jpg)
266 | 
267 | *  Feature importance hierarchy running the random forest classifier with deprecation of the variable "daily internet usage of the user":
268 | 
269 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/feature_imp_2.jpg)
270 | 
271 | * Empirical click rate on the add per gender:
272 | 
273 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/empirical_click_rate.jpg)
274 | 
275 | * Hypothesis testing for the click rates with 5% risk on the confidence level:
276 | 
277 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hyp_test1.jpg)
278 | 
279 | * * Hypothesis testing for the click rates with 40% risk on the confidence level:
280 | 
281 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/hyp_testing2.jpg)
282 |  
283 | 
284 | 
285 | 
286 |  
287 | 
288 | 
289 | 
290 | 
291 | ## Code and resources used:
292 | **Python version:** 3.7
293 | **Libraries/packages used:** pandas, numpy, matplotlib, seaborn, scikitlearn
294 | **Dataset:** available at the repo
295 | 
296 | 
297 | 
298 | 
299 | # [Project 7. Detecting fraud with ML](https://github.com/Andreoliveira85/Detecting-fraud-with-ML)  <a name="P7"></a>
300 | ## Project description: 
301 | 
302 | We use a dataset from a private anonymous company in order to classify fraudulent clients taking other variables such as age, gender, browser used to shop, among 
303 | others as explnatory variables.
304 | 
305 | ## Methodologies:
306 | 
307 | After cleaning the dataset we start with a exploratory visualisation analysis of the data (univariate and multi variate). We register 9.4% of fraudulent clients on the dataset. After some feature engineering pipeline specifically built to handle the difference between the cathegorical and the numerical explanatory variables (Kbins discretizers from scikitlearn) we built a Bernoulli naive Bayes model for a split of 30% between train and test sets on the data. The Naive Bayes model performw well with a confusion matrix fully charged detecting all the classes for this unbalanced dataset. The Naive Bayes model predicts a rate of fraud of 9.36 % (the empirical rate of fraud is 9.4 %), the false negative rate is 4.11 and the false positive rate is 4.74 %. The ROC (receiving operating characteristic)for this model is 75% on the prediction of probabilities of fraud on the test set. About the hierarchy of feature importance: 
308 | time_delta of the purchases, the country with higher fraud percentages, the source and the browser from the purchase and age are the most important variables according to the NAive Bernoulli Bayes model to understand and predict fraud probabilities. As a second approach we use support vector machines to predict fraud. This second model has a score of 0.931111307186659. We use GridSearchCV for hyper parameter tunning and we retrieve as optimal parameters {'C': 50, 'gamma': 0.005}. For these optimal parameters the score on the train set is 0.9071356992947494
309 | againsts (a non over fitting) score on the test set of 0.9073101866149027.
310 | 
311 | * Fraud rate detected in the EDA
312 | 
313 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/fraud_rate.jpg)
314 | 
315 | * Confusion matrix and scores using Naive Bayes model:
316 | 
317 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/fraud_confusion_matrix_naive_abyes.jpg)
318 | 
319 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/fraud_rate_naive_bayes.jpg)
320 | 
321 | * Feature importance hierarchy using Naive Bayes model:
322 | 
323 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/fraud_feature_imp_naive_bayes.jpg)
324 | 
325 | 
326 | * ROC using Naive Bayes model:
327 | 
328 | 
329 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/roc_fraud.jpg)
330 | 
331 | 
332 | * Scores using SVM classifiers with GridSearch:
333 | 
334 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/sVC_score_grid_search.jpg)
335 | 
336 | 
337 | ## Resources and code used:
338 | 
339 | **Python version:** 3.7
340 | **Libraries/packages used:** pandas, numpy, scikitlearn, matplotlib, seaborn
341 | **Dataset:** available at the repo
342 | 
343 | 
344 | 
345 | 
346 | 
347 | 
348 | # [Project 8. NLP MiniProject](https://github.com/Andreoliveira85/NLP-Miniproject) <a name="P8"></a>
349 | ## Project description:
350 | 
351 | In this NLP project we will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews.
352 | 
353 | Each observation in this dataset is a review of a particular business by a particular user.
354 | 
355 | The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
356 | 
357 | The "cool" column is the number of "cool" votes this review received from other Yelp users. 
358 | 
359 | All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
360 | 
361 | The "useful" and "funny" columns are similar to the "cool" column.
362 | 
363 | ## Methodologies:
364 | 
365 | After bivariate visualization analysis we construct a ML Pipeline based on the vectorization of the text of the reviews and a Multinomial Naive Bayes model to predict probability of bad reviews (1 star) or good reviews (5 stars). The classification report is below:  
366 |             
367 |             
368 |             
369 |                 precision    recall  f1-score   support
370 | 
371 |                1       0.91      0.68      0.78       228
372 |                5       0.93      0.98      0.96       998
373 | 
374 |         accuracy                           0.93      1226
375 |        macro avg       0.92      0.83      0.87      1226
376 |     weighted avg       0.93      0.93      0.92      1226
377 | 
378 | ## Code and resources used:
379 | 
380 | **Python version:** 3.7
381 | **Libraries/packages:** pandas, numpy, scikitlearn.
382 | **Dataset:** available at the repo
383 | 
384 | 
385 | 
386 | 
387 | 
388 | 
389 | 
390 | # [Project 9. Classification of galaxies, stars and quasars](https://github.com/Andreoliveira85/classification-galaxies-stars-and-quasars)   <a name="P9"></a>
391 | ## Project description: 
392 | 
393 | This project aims to use machine learning classification techniques to classify stellar objects: galaxies, stars and quasars. We are using data from the Sloan Digital Sky Survey. The Sloan Digital Sky Survey is a project which offers public data of space observations. 
394 | 
395 | For this purpose a special 2.5 m diameter telescope was built at the Apache Point Observatory in New Mexico, USA. The telescope uses a camera of 30 CCD-Chips with 2048x2048 image points each. The chips are ordered in 5 rows with 6 chips in each row. Each row observes the space through different optical filters (u, g, r, i, z) at wavelengths of approximately 354, 476, 628, 769, 925 nm.
396 | 
397 | The telescope covers around one quarter of the earth's sky - therefore focuses on the northern part of the sky.
398 | 
399 | **For more information about this project - please visit their website. Our dataset is provided there**
400 | 
401 | http://www.sdss.org/
402 | 
403 | 
404 | 
405 | ## Methodologies:
406 | 
407 | We start exploring visually the data in order to try to understanding patterns in the quantitative variables gathered from the measurements. In order to solve the classification problem, after doing some feature engineering and a train/test split of 33% size, we use the ensemble learning XGBoost algorithm in order to retrieve the features hierarchy importances. In a second step in order to validate the model we use GridSearchCV with 2 folds and obtain the score on the test set  for the 1st fold 0.9904477611940299 and 0.991044776119403 for the second.
408 | 
409 | * Equatorial coordinates of the 3 class objects
410 | 
411 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/equatorial_coords.jpg)
412 | 
413 | * Feature importance of the main explanatory variables
414 | 
415 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/feature_importance_galaxies.jpg)
416 | 
417 | * Correlation between the main variables
418 | 
419 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/correlations.jpg)
420 | 
421 | ## Code and resources used:
422 | **Python version:** 3.7
423 | **Packages/libraries used:** pandas, numpy, matplotlib, seaborn, scikitlearn
424 | **Dataset:** available at http://www.sdss.org/
425 | 
426 | 
427 | 
428 | 
429 | # [Project 10. Miniproject DL: Breast cancer detection](https://github.com/Andreoliveira85/MiniProject-DL-Breast-cancer-detection)   <a name="P10"></a>
430 | 
431 | 
432 | ## Project description:
433 | 
434 | This miniproject consists in using a CNN neural net model to predict (classification) breast cancer detection on the public dataset of breast cancer available in the package datasets of scikitlearn.
435 | 
436 | ## Methodologies:
437 | 
438 |  After the usual ML pipeline with a train/test split of size 20% we train a CNN neural net with a binary cross-entropy loss of  0.0496 and accuracy 0.9805 on the train set against a loss of 0.0747 and accuracy 0.9737 on the test validation set. We confront predictions with the values on the test set and visualize the plots of the loss and the accuracy metrics on the train and test sets.
439 | 
440 | * Performance of the algorithm:
441 | 
442 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/breast_cancer.jpg)
443 | 
444 | ## Resources and code used:
445 | **Python version:** 3.7
446 | **Libraries/packages used:** pandas, numpy, tensorflow, keras, scikitlearn
447 | 
448 | 
449 | 
450 | # [Project 11-DL project: Yolo object detection](https://github.com/Andreoliveira85/NLP-Project-with-Deep-Learning)  <a name="P11"></a>
451 | ## Project description:
452 | ### Object Detection with YoloV3
453 | 
454 | 
455 | The detection of objects in an image is one of the major application areas of Deep Learning.
456 | 
457 | The principle is simple: in addition to training an algorithm to detect and tell what is in an image, it will be trained to tell where the object is in the image:
458 | 
459 | ![Object Detection](https://drive.google.com/uc?export=view&id=1G-mbb6drlUlXsMdg8Xld4p4EGYo8einf)
460 | 
461 | To do so, we will implement an algorithm called YoloV3.
462 | 
463 | However, it is very difficult to set up the whole training process of the algorithm. That's why we will learn how to use it thanks to this Github repository:
464 | 
465 | [Implement YoloV3](https://github.com/zzh8829/yolov3-tf2)
466 | 
467 | So our goal is to:
468 | 
469 |   1. Clone this repository on your local file
470 |   2. Use it for image detection.
471 |   3. Then try to do the same thing with video detection.
472 |   
473 |   ## Code/resources used:
474 |   **Python version:** 3.7
475 |   **Libraries/packages:** numpy, tensorflow, keras
476 | 
477 | 
478 | # [Project 12. Tinder recommendation system](https://github.com/Andreoliveira85/Tinder-recommendation-system)   <a name="P12"></a>
479 | ## Company's Description 📇
480 | 
481 | <a href="https://tinder.com/?lang=en" target="_blank">Tinder</a> is of one the most famous online dating application in the world right now. The whole idea is to being able to anonymously swipe right or left on a person to show if you liked her or not. If both person swiped right: *It's a match*! And then you can start chatting! 
482 | 
483 | This whole concept revolutionized the way young people date. Founder <a href="https://www.crunchbase.com/person/sean-rad" target="_blank">Sean Rade</a> believed that *"no matter who you are, you feel more comfortable approaching somebody if you know they want you to approach them."*
484 | 
485 | With over 50 million users (80% + from 16 to 34), Tinder's valuation is around $1.4 billion which makes this start-up one of the most famous unicorn in california as of today. 😮
486 | 
487 | ## Goals 🎯
488 | Our goal is to
489 | 
490 | * Recommend 10 best profiles for a given user 
491 | 
492 | ## Scope of this project 🖼️
493 | 
494 | The dataset is availabale here:
495 | 
496 | 
497 | 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Unsupervised_Learning/Tinder_data.zip" target="_blank">Tinder Data</a> 👈👈
498 | 
499 | The files contain 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users. You will find a zip file contained data on gender for each person as well as a rating. 
500 | 
501 | ### Introduction to Recommendation engines 
502 | 
503 | It is now time to discuss Recommendation Engines. 
504 | There are two types of recommendation engines: 
505 | 
506 | 1. Collaborative Filtering 
507 | 2. Content Based 
508 | 
509 | ![](https://miro.medium.com/max/690/1*G4h4fOX6bCJhdmbsXDL0PA.png)
510 | 
511 | 
512 | ### Collaborative filtering principles 
513 | 
514 | We start this project with Collaborative Filtering. The idea is to recommend a product based on other users' review. Let u see the idea behind  [this explanatory gif](https://www.kdnuggets.com/2019/09/machine-learning-recommender-systems.html#:~:text=Recommender%20systems%20are%20an%20important,to%20follow%20from%20example%20code.) from KDNugget: 
515 | 
516 | ![](https://miro.medium.com/max/623/1*hQAQ8s0-mHefYH83uDanGA.gif)
517 | 
518 | 
519 | Instead of having "products" to recommend, this time, we will recommend people!
520 | 
521 | ### Build a utility matrix 
522 | 
523 | Our goal is to be able to create a recommandation engine built on a utility matrix like this one <a href="https://towardsdatascience.com/math-for-data-science-collaborative-filtering-on-utility-matrices-e62fa9badaab" target="_blank">utility matrix</a>. This should look something like this: 
524 | 
525 | <img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/utility_matrix.png"/>
526 | 
527 | ### Machine Learning
528 | 
529 | TruncatedSVD is the perfect algorithm here gue to the sparsity of the utility matrix! 👏 We will apply this algorithm to reduce dimension and then create a <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html" target="_blank">correlation matrix</a> to see which profile are correlated and thefore would be a match!  
530 | 
531 | 
532 | ## Deliverable 📬
533 | 
534 | Goals: 
535 | 
536 | * Have built a utility matrix 
537 | * Have created a correlation matrix 
538 | * Recommend a list of 10 profiles for a random user 
539 | 
540 | 
541 | 
542 | ## Code/resources used:
543 | **Python version:** 3.7
544 | **Libraries/packages used:** pandas, numpy, scikitlearn
545 | 
546 | # [Project 13. Clustering for Uber Pickups](https://github.com/Andreoliveira85/Uber-Pickups)   <a name="P13"></a>
547 | ## Aim of the project:
548 | 
549 | One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  
550 | 
551 | (check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)
552 | 
553 | Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 
554 | 
555 | Therefore, our project aims to retrieve a recommendation system such that **their app would recommend hot-zones in major cities to be in at any given time of day.**  
556 | 
557 | ### We will focus on:
558 | * Creating an algorithm to find hot zones DBSCAN vs Kmeans 
559 | * Visualizing results on a nice dashboard 
560 | * We will focus on NYC city data.
561 | 
562 | ## Visualizations of the clusters by Kmeans (evolution during hourly periods):
563 | 
564 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/uber_kmeans1.jpg)
565 | 
566 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/uber_kmeans2.jpg)
567 | 
568 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/uber_kmeans3.jpg)
569 | 
570 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/uber_kmeans4.jpg)
571 | 
572 | 
573 | ## Visualizations of the clusters using DBSCAN:
574 | 
575 | ![](https://github.com/Andreoliveira85/My-data-machine-learning-portfolio/blob/main/images_folder/uber_dbscan.jpg)
576 | 
577 | 
578 | ## Code/Resources used:
579 | **Python version:** 3.7
580 | **Libraries/packages used:** pandas, numpy, scikitlearn
581 | 
582 | 
583 | 
584 | 
585 | # [Project 14. NLP-Project-with-Deep-Learning: Construction of a Translator machine](https://github.com/Andreoliveira85/NLP-Project-with-Deep-Learning) <a name="P14"></a>
586 | 
587 | ## Project description:
588 | 
589 | According to the Google paper [*Attention is all you need*](https://arxiv.org/abs/1706.03762), you only need layers of Attention to make a Deep Learning model understand the complexity of a sentence. We will try to implement this type of model for our translator. 
590 | 
591 | ## Project description 
592 | 
593 |  
594 | 
595 | Our data can be found on this link: https://go.aws/38ECHUB
596 | 
597 | ### Preprocessing 
598 | 
599 | The whole purpose of your preprocessing is to express your (French) entry sentence in a sequence of clues.
600 | 
601 | i.e. :
602 | 
603 | * je suis malade---> `[123, 21, 34, 0, 0, 0, 0, 0]`
604 | 
605 | This gives a *shape* -> `(batch_size, max_len_of_a_sentence)`.
606 | 
607 | The clues correspond to a number that we will have to assign for each word token. 
608 | 
609 | The zeros correspond to what are called [*padded_sequences*](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) which allow all word sequences to have the same length (mandatory for your algorithm). 
610 | 
611 | This time, we won't have to *one hot encoder* your target variable. We  will simply be able to create a vector similar to your input sentence. 
612 | 
613 | i.e. : 
614 | 
615 | * I am sick ---> `[43, 2, 42, 0, 0]`
616 | 
617 | WARNING, we  will however need to add a step in our preprocessing. For each sentence we will need to add a token `<start>` & `<end>` to indicate the beginning and end of a sentence. We can do this via `Spacy`.
618 | 
619 | We will use : 
620 | 
621 | * `Pandas` or `Numpy` for reading the text file.
622 | * `Spacy` for Tokenization 
623 | * `Tensorflow` for [padded_sequence](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) 
624 | 
625 | ### Modeling 
626 | 
627 | For modeling, we will need to set up layers of attention. We will need to: 
628 | 
629 | * Create an `Encoder` class that inherits from `tf.keras.Model`.
630 | * Create a Bahdanau Attention Layer that will be a class that inherits `tf.keras.layers.Layer`
631 | * Finally create a `Decoder` class that inherits from `tf.keras.Model`.
632 | 
633 | 
634 | We will need to create your own cost function as well as our own training loop. 
635 | 
636 | 
637 | ### Tips 
638 | 
639 | We will not take the whole dataset at the beginning for our experiments, we just take 5000 or even 3000 sentences. This will allow us to iterate faster and avoid bugs simply related to your need for computing power. 
640 | 
641 | Also, we acknowledge the inspiration from the [Neural Machine Translation with Attention] tutorial (https://www.tensorflow.org/tutorials/text/nmt_with_attention) from TensorFlow. 
642 | 
643 | ## Code/resources used:**
644 | **Python version:** 3.7
645 | **Libraries/packages used:** keras, tensorflow, numpy, pandas, spacy
646 | 
647 | 
648 | # [My first data science project: Anomalies of temperatures in RJ and rainfalls of SP](https://github.com/Andreoliveira85/My-first-data-project) <a name="P15"></a>
649 | 
650 | ## Project description:
651 | 
652 | ### Title of the Project: Anomalies of temperatures in RJ and rainfalls of SP
653 | 
654 | ### Client profile:
655 | Our client is typically a think tank company that wants to start from the
656 | scratch to build a decision model to economise the water reservoirs of the
657 | state of Rio de Janeiro in the upcoming years. The reason is to try to avoid
658 | calamity situations such as the one that happened in SP during the period
659 | of 2014-2017 in the absence of rainfall. The critical anomaly temperature
660 | that is taken as threshold is the value 0.62. The choice was made a prior as
661 | a legitimate benchmark for a deviation of temperature since it was the
662 | value registered at 2010, the year that ends the warmest decade
663 | registered in the globe.
664 | See https://www.ncdc.noaa.gov/sotc/global/201013.
665 | 
666 | ### Questions:
667 | 1. How bad were the levels of precipitation in SP (2013-2016)?
668 | 2. How comparable were the values of the temperature anomalies in
669 | SP in that period with historical records?
670 | 3. How do the temperatures on RJ relate and how to predict high
671 | occurrences of temperature anomalies bigger than 0.62?
672 | Sources:
673 | * Temperatures (monthly 1961-2017): https://
674 | data.giss.nasa.gov/gistemp/
675 | * Pluviosity SP ( annual 1961-2017): https://
676 | www.kaggle.com/arkanius/rain-intensitty-in-so-paulo/metadata
677 | * CO2 (annual 1961-2017):
678 | * https://www.kaggle.com/srikantsahu/co2-and-ghg-emission-data?
679 | select=emission+data.csv
680 | 
681 | ### Methodology:
682 | 1. Anomalies of temperatures: 3 data sets of different Brazilian cities
683 | were used: SP, RJ and Manaus. The benchmark temperature for
684 | each city was the median temperature registered monthly in each
685 | data set. A temperature anomaly is the absolute value of the
686 | deviations between the registered temperatures and the benchmark
687 | temperature.
688 | 
689 | 2. Import and cleaning of the data. We dealt with different data sets
690 | and there was some previous work to create two final data frames:
691 | one for the annual rainfall precipitation/mm^3 in SP, the annual
692 | averaged anomalies of temperatures registered and the global levels
693 | of emission C02 in Brazil; the second containing the anomalies of
694 | temperatures registered in SP, RJ and Manaus on a monthly basis.
695 | Some non negligible amount (as it is commented in the code) of
696 | entries on the data sets were missing and several adjustments of
697 | the data types and corresponding missing values were made during
698 | the cleaning phase.
699 | 
700 | 3. Data visualisation using matplotlib- Pyhton combined with several
701 | graphics in Tableau. The relevant graphics are shown in the
702 | diaporama and on the code. Exploring visually the data we draw
703 | several conclusions: the critical values of rainfall in the period after
704 | 2000 were registered in 2014 with a tendency of high decrease
705 | starting in 2010. At same time the anomalies of temperatures in SP
706 | rise with an almost monotonic tendency accomplished by the
707 | anomalies of temperatures in RJ.
708 | 4. First statistics: the annual rainfalls in SP and the emissions of C02 do
709 | not show correlations with the annual rainfalls registered in SP. This
710 | is a point of valid discussion such as it was made on the code file.
711 | We used Spearman coefficient calculations since we have not a very
712 | big data set of annual registers and we can not assume that the
713 | underlying distributions are Normal (we can not perform the Central
714 | Limit theorem since the observed annual values are in limited
715 | number). Nevertheless the Spearman coefficients between the the
716 | anomalies of temperatures are high as commented in the code
717 | which suggested us the use of regression techniques in order to
718 | predict models.
719 | 
720 | 5. Predictions using regressions: we use multilinear regression
721 | between the variables anomalies of temperatures for SP,RJ and
722 | Manaus. The regression test presented high score (R^2 above 80%)
723 | and the model predicts that the anomaly temperatures of RJ follow
724 | the same speed as the anomaly temperatures of SP under this
725 | model. Please see diaporama and code.
726 | 
727 | 6. Predictions using the logistic regression. Under the logistic model it
728 | is concluded with high (almost 90%) of precision and accuracy on
729 | the data set (see the diaporama and the code file please for the
730 | numbers and precise structure in the comments of the
731 | implementation of this algorithm) the probability of getting a
732 | anomaly temperature above 0.62 is modelled by a logarithmic
733 | relation between the temperature anomalies of SP and Manaus
734 | where the occurrence of anomalies bigger than 0.62 occur often
735 | (check confusion matrix in the code file). This means that if in the
736 | near future we expect time series of anomalies in the temperatures
737 | of Rio similar ou “very close” of parts of the data set that is treated
738 | then we can expect that under this model those anomalies are
739 | bigger than the critical value 0.62 which was the anomaly registered
740 | in 2010, a remarkable global warm year that closed the warmest
741 | decade registered on the planet.
742 | 
743 | ### Comments:
744 | This deliverable is just descriptive and does not include graphics or
745 | quantitative results that were achieved during this exploratory analysis of
746 | the data. Climate (modelled in local geographical or global geographical
747 | scales) is highly nonlinear and mathematically complex, not being by
748 | nature in any way able to be captured by linear models. For the future it is
749 | relevant to rethink in the way to associate the variables of temperature
750 | anomalies in SP with the variables of C02 emissions and the annual
751 | rainfalls in different ways and using different and larger data sets. The first
752 | data frame that we used to do a first analysis is too small and also other
753 | importante variables such as deforestation area of Amazonia, atmospheric
754 | pressure and exogenous climatic variables such as proxy-data given by the
755 | gulf stream should be taken into account.
756 | 
757 | ### Preview of some visualizations [(link for the slides used to present the project)](https://github.com/Andreoliveira85/My-first-data-project/blob/main/andre_presentation_essentials_jedha.pdf)
758 | 
759 | ![](https://github.com/Andreoliveira85/My-first-data-project/blob/main/first_project_image1.jpg)
760 | 
761 | ![](https://github.com/Andreoliveira85/My-first-data-project/blob/main/first_project_image2.jpg)
762 | 
763 | ![](https://github.com/Andreoliveira85/My-first-data-project/blob/main/first_project_image3.jpg)
764 | 
765 | ![](https://github.com/Andreoliveira85/My-first-data-project/blob/main/first_project_image4.jpg)
766 | 
767 | ![](https://github.com/Andreoliveira85/My-first-data-project/blob/main/first_project_image5.jpg)
768 | 
769 | 
770 | 
771 | ## Code/resources used:
772 | **Python version:** 3.7
773 | **Packages:** pandas, numpy, scikit learn, stats, scipy
774 | **Data vis:** Tableau
775 | 
776 | 
777 |   
778 | 


--------------------------------------------------------------------------------
/images:
--------------------------------------------------------------------------------
1 | images file
2 | 


--------------------------------------------------------------------------------
/images_folder/NYSE_prices.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/NYSE_prices.jpg


--------------------------------------------------------------------------------
/images_folder/README.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/images_folder/adv_area_income_age.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/adv_area_income_age.jpg


--------------------------------------------------------------------------------
/images_folder/bitcoin_prices.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/bitcoin_prices.png


--------------------------------------------------------------------------------
/images_folder/breast_cancer.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/breast_cancer.jpg


--------------------------------------------------------------------------------
/images_folder/classification_report_logistic_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/classification_report_logistic_model.jpg


--------------------------------------------------------------------------------
/images_folder/classification_report_rf.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/classification_report_rf.jpg


--------------------------------------------------------------------------------
/images_folder/correlations.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/correlations.jpg


--------------------------------------------------------------------------------
/images_folder/currencies_boxplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/currencies_boxplot.png


--------------------------------------------------------------------------------
/images_folder/daily_time_site_age.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/daily_time_site_age.jpg


--------------------------------------------------------------------------------
/images_folder/daily_time_site_daily_int_usage.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/daily_time_site_daily_int_usage.jpg


--------------------------------------------------------------------------------
/images_folder/empirical_click_rate.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/empirical_click_rate.jpg


--------------------------------------------------------------------------------
/images_folder/equatorial_coords.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/equatorial_coords.jpg


--------------------------------------------------------------------------------
/images_folder/error_test_set_new_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/error_test_set_new_model.jpg


--------------------------------------------------------------------------------
/images_folder/error_train_model_new.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/error_train_model_new.jpg


--------------------------------------------------------------------------------
/images_folder/feature_imp1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/feature_imp1.jpg


--------------------------------------------------------------------------------
/images_folder/feature_imp_2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/feature_imp_2.jpg


--------------------------------------------------------------------------------
/images_folder/feature_importance_galaxies.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/feature_importance_galaxies.jpg


--------------------------------------------------------------------------------
/images_folder/fig1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fig1.jpg


--------------------------------------------------------------------------------
/images_folder/fig2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fig2.jpg


--------------------------------------------------------------------------------
/images_folder/fraud_confusion_matrix_naive_abyes.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fraud_confusion_matrix_naive_abyes.jpg


--------------------------------------------------------------------------------
/images_folder/fraud_feature_imp_naive_bayes.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fraud_feature_imp_naive_bayes.jpg


--------------------------------------------------------------------------------
/images_folder/fraud_rate.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fraud_rate.jpg


--------------------------------------------------------------------------------
/images_folder/fraud_rate_naive_bayes.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/fraud_rate_naive_bayes.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid1.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid10.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid10.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid11.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid11.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid12.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid12.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid13.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid13.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid14.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid14.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid15.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid15.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid2.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid3.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid4.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid5.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid6.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid6.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid7.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid7.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid8.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid8.jpg


--------------------------------------------------------------------------------
/images_folder/hybrid9.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hybrid9.jpg


--------------------------------------------------------------------------------
/images_folder/hyp_test1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hyp_test1.jpg


--------------------------------------------------------------------------------
/images_folder/hyp_testing2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/hyp_testing2.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model2.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model3.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model_coeff.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model_coeff.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model_evidence_regression.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model_evidence_regression.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model_gaussian_residuals.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model_gaussian_residuals.jpg


--------------------------------------------------------------------------------
/images_folder/linear_model_yearly_amount_spent_vs_time_website.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/linear_model_yearly_amount_spent_vs_time_website.jpg


--------------------------------------------------------------------------------
/images_folder/loss_function.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/loss_function.jpg


--------------------------------------------------------------------------------
/images_folder/model_test_dataset.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/model_test_dataset.jpg


--------------------------------------------------------------------------------
/images_folder/model_training_set.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/model_training_set.jpg


--------------------------------------------------------------------------------
/images_folder/pieplot_currencies.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/pieplot_currencies.png


--------------------------------------------------------------------------------
/images_folder/prophet_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/prophet_model.jpg


--------------------------------------------------------------------------------
/images_folder/prophet_model2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/prophet_model2.jpg


--------------------------------------------------------------------------------
/images_folder/prophets_results_reality.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/prophets_results_reality.jpg


--------------------------------------------------------------------------------
/images_folder/returns_new_model_train_set.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/returns_new_model_train_set.jpg


--------------------------------------------------------------------------------
/images_folder/returns_test_set_new_model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/returns_test_set_new_model.jpg


--------------------------------------------------------------------------------
/images_folder/roc_fraud.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/roc_fraud.jpg


--------------------------------------------------------------------------------
/images_folder/sVC_score_grid_search.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/sVC_score_grid_search.jpg


--------------------------------------------------------------------------------
/images_folder/scatter_plot_volume_close_log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/scatter_plot_volume_close_log.png


--------------------------------------------------------------------------------
/images_folder/sentiment_analysis_tweets2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/sentiment_analysis_tweets2.jpg


--------------------------------------------------------------------------------
/images_folder/training_test_dataset.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/training_test_dataset.jpg


--------------------------------------------------------------------------------
/images_folder/uber_dbscan.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/uber_dbscan.jpg


--------------------------------------------------------------------------------
/images_folder/uber_kmeans1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/uber_kmeans1.jpg


--------------------------------------------------------------------------------
/images_folder/uber_kmeans2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/uber_kmeans2.jpg


--------------------------------------------------------------------------------
/images_folder/uber_kmeans3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/uber_kmeans3.jpg


--------------------------------------------------------------------------------
/images_folder/uber_kmeans4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Andreoliveira85/My-data-machine-learning-portfolio/aa9d8e5774f46247c538cdfdb034ef5e95861b1e/images_folder/uber_kmeans4.jpg


--------------------------------------------------------------------------------