├── .gitignore ├── figures ├── CAM.png ├── NBSVM.png ├── SGDR.png ├── mask1.png ├── mask2.png ├── mask3.png ├── unet.png ├── vgg16.png ├── anchors.png ├── carvana.png ├── content.png ├── df_imdb.png ├── step_lr.png ├── Lr_levels.jpeg ├── NLP_model.png ├── artifacts.png ├── bootstrap.png ├── content_2.png ├── cos_lr_fix.png ├── cycle_gan.png ├── cyclic_lr.png ├── dendogram.png ├── exp_lr_fin.png ├── focal_loss.png ├── inception.png ├── lr_cyclic.png ├── lr_finder.png ├── matching.png ├── matplotlib.png ├── pdp_plot.png ├── r2_score.png ├── results_Fl.png ├── sampling.png ├── style_img.png ├── to_zero_lr.png ├── trigrams.png ├── wgan_algo.png ├── bag_of_words.png ├── decay_fix_lr.png ├── image_sample.png ├── lr_annealing.png ├── naive_bayes.png ├── nlp_batches.png ├── one_cycle_lr.png ├── output_noise.png ├── poly_lr_fix.png ├── resullts_sr.png ├── sample_lsun.png ├── ssd_and_yolo.png ├── style_result.png ├── tree_example.png ├── wgan_results.png ├── RNN_approaches.png ├── adaptative_lr.gif ├── attention_s2s.png ├── attention_viz.png ├── carvana_masks.png ├── classification.png ├── cycle_gan_loss.png ├── cycle_results.png ├── deconvolution.png ├── obj_detection.png ├── pdp_plot_2vars.png ├── perceptual_loss.png ├── pixel_shuffle.png ├── ploting_example.png ├── prediction_sr.png ├── results_anchors.png ├── rnn_8character.png ├── rnn_Ncharacter.png ├── rnn_character.png ├── saddle_points.png ├── style_transfer.png ├── t_distribution.png ├── triangular_lr.png ├── detection_results.png ├── examples_imagenet.png ├── imagenet_search1.png ├── imagenet_search2.png ├── imagenet_search3.png ├── imagenet_search4.png ├── imagenet_search5.png ├── obj_localization.png ├── one_hot_encoding.png ├── r2_random_forest.png ├── s2s_architecture.png ├── ssd_architecture.png ├── std_random_forest.png ├── varying_cyclic_lr.png ├── NLP_model_sentiment.png ├── cyclic_learning_rate.png ├── feature_importance.png ├── full_objective_gans.png ├── img_super_resolution.png ├── iterator_streaming.png ├── loss_with_cyclic_lr.png ├── neural_machine_BLUE.png ├── style_reconstruction.png ├── super_res_enhanced.png ├── tree_interpretation.png ├── check_board_artifacts.png ├── colaborative_filtering.png ├── result_style_transfer.png ├── colaborative_filtering_NN.png ├── prediction_extrapolation.png ├── calibrating_validation_set.png ├── colaborative_filtering_predict.png ├── colaborative_filtering_rossman.png ├── model_for_non_structured_data.png └── feature_importance_extrapolation.png ├── README.md ├── intro_to_ML ├── lecture7.md ├── lecture6.md ├── lecture8.md ├── lecture11.md ├── lecture9_10.md ├── lecture5.md ├── lecture3.md ├── lecture1.md ├── lecture4.md └── lecture2.md ├── parctical_DL ├── lecture1.md ├── lecture3.md ├── lecture5.md ├── lecture2.md └── lecture4.md └── cutting_edge_DL ├── lecture8.md └── lecture13.md /.gitignore: -------------------------------------------------------------------------------- 1 | data* 2 | fastai* -------------------------------------------------------------------------------- /figures/CAM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/CAM.png -------------------------------------------------------------------------------- /figures/NBSVM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/NBSVM.png -------------------------------------------------------------------------------- /figures/SGDR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/SGDR.png -------------------------------------------------------------------------------- /figures/mask1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/mask1.png -------------------------------------------------------------------------------- /figures/mask2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/mask2.png -------------------------------------------------------------------------------- /figures/mask3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/mask3.png -------------------------------------------------------------------------------- /figures/unet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/unet.png -------------------------------------------------------------------------------- /figures/vgg16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/vgg16.png -------------------------------------------------------------------------------- /figures/anchors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/anchors.png -------------------------------------------------------------------------------- /figures/carvana.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/carvana.png -------------------------------------------------------------------------------- /figures/content.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/content.png -------------------------------------------------------------------------------- /figures/df_imdb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/df_imdb.png -------------------------------------------------------------------------------- /figures/step_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/step_lr.png -------------------------------------------------------------------------------- /figures/Lr_levels.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/Lr_levels.jpeg -------------------------------------------------------------------------------- /figures/NLP_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/NLP_model.png -------------------------------------------------------------------------------- /figures/artifacts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/artifacts.png -------------------------------------------------------------------------------- /figures/bootstrap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/bootstrap.png -------------------------------------------------------------------------------- /figures/content_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/content_2.png -------------------------------------------------------------------------------- /figures/cos_lr_fix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cos_lr_fix.png -------------------------------------------------------------------------------- /figures/cycle_gan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cycle_gan.png -------------------------------------------------------------------------------- /figures/cyclic_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cyclic_lr.png -------------------------------------------------------------------------------- /figures/dendogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/dendogram.png -------------------------------------------------------------------------------- /figures/exp_lr_fin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/exp_lr_fin.png -------------------------------------------------------------------------------- /figures/focal_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/focal_loss.png -------------------------------------------------------------------------------- /figures/inception.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/inception.png -------------------------------------------------------------------------------- /figures/lr_cyclic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/lr_cyclic.png -------------------------------------------------------------------------------- /figures/lr_finder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/lr_finder.png -------------------------------------------------------------------------------- /figures/matching.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/matching.png -------------------------------------------------------------------------------- /figures/matplotlib.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/matplotlib.png -------------------------------------------------------------------------------- /figures/pdp_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/pdp_plot.png -------------------------------------------------------------------------------- /figures/r2_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/r2_score.png -------------------------------------------------------------------------------- /figures/results_Fl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/results_Fl.png -------------------------------------------------------------------------------- /figures/sampling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/sampling.png -------------------------------------------------------------------------------- /figures/style_img.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/style_img.png -------------------------------------------------------------------------------- /figures/to_zero_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/to_zero_lr.png -------------------------------------------------------------------------------- /figures/trigrams.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/trigrams.png -------------------------------------------------------------------------------- /figures/wgan_algo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/wgan_algo.png -------------------------------------------------------------------------------- /figures/bag_of_words.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/bag_of_words.png -------------------------------------------------------------------------------- /figures/decay_fix_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/decay_fix_lr.png -------------------------------------------------------------------------------- /figures/image_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/image_sample.png -------------------------------------------------------------------------------- /figures/lr_annealing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/lr_annealing.png -------------------------------------------------------------------------------- /figures/naive_bayes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/naive_bayes.png -------------------------------------------------------------------------------- /figures/nlp_batches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/nlp_batches.png -------------------------------------------------------------------------------- /figures/one_cycle_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/one_cycle_lr.png -------------------------------------------------------------------------------- /figures/output_noise.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/output_noise.png -------------------------------------------------------------------------------- /figures/poly_lr_fix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/poly_lr_fix.png -------------------------------------------------------------------------------- /figures/resullts_sr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/resullts_sr.png -------------------------------------------------------------------------------- /figures/sample_lsun.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/sample_lsun.png -------------------------------------------------------------------------------- /figures/ssd_and_yolo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/ssd_and_yolo.png -------------------------------------------------------------------------------- /figures/style_result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/style_result.png -------------------------------------------------------------------------------- /figures/tree_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/tree_example.png -------------------------------------------------------------------------------- /figures/wgan_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/wgan_results.png -------------------------------------------------------------------------------- /figures/RNN_approaches.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/RNN_approaches.png -------------------------------------------------------------------------------- /figures/adaptative_lr.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/adaptative_lr.gif -------------------------------------------------------------------------------- /figures/attention_s2s.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/attention_s2s.png -------------------------------------------------------------------------------- /figures/attention_viz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/attention_viz.png -------------------------------------------------------------------------------- /figures/carvana_masks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/carvana_masks.png -------------------------------------------------------------------------------- /figures/classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/classification.png -------------------------------------------------------------------------------- /figures/cycle_gan_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cycle_gan_loss.png -------------------------------------------------------------------------------- /figures/cycle_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cycle_results.png -------------------------------------------------------------------------------- /figures/deconvolution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/deconvolution.png -------------------------------------------------------------------------------- /figures/obj_detection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/obj_detection.png -------------------------------------------------------------------------------- /figures/pdp_plot_2vars.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/pdp_plot_2vars.png -------------------------------------------------------------------------------- /figures/perceptual_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/perceptual_loss.png -------------------------------------------------------------------------------- /figures/pixel_shuffle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/pixel_shuffle.png -------------------------------------------------------------------------------- /figures/ploting_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/ploting_example.png -------------------------------------------------------------------------------- /figures/prediction_sr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/prediction_sr.png -------------------------------------------------------------------------------- /figures/results_anchors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/results_anchors.png -------------------------------------------------------------------------------- /figures/rnn_8character.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/rnn_8character.png -------------------------------------------------------------------------------- /figures/rnn_Ncharacter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/rnn_Ncharacter.png -------------------------------------------------------------------------------- /figures/rnn_character.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/rnn_character.png -------------------------------------------------------------------------------- /figures/saddle_points.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/saddle_points.png -------------------------------------------------------------------------------- /figures/style_transfer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/style_transfer.png -------------------------------------------------------------------------------- /figures/t_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/t_distribution.png -------------------------------------------------------------------------------- /figures/triangular_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/triangular_lr.png -------------------------------------------------------------------------------- /figures/detection_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/detection_results.png -------------------------------------------------------------------------------- /figures/examples_imagenet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/examples_imagenet.png -------------------------------------------------------------------------------- /figures/imagenet_search1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/imagenet_search1.png -------------------------------------------------------------------------------- /figures/imagenet_search2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/imagenet_search2.png -------------------------------------------------------------------------------- /figures/imagenet_search3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/imagenet_search3.png -------------------------------------------------------------------------------- /figures/imagenet_search4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/imagenet_search4.png -------------------------------------------------------------------------------- /figures/imagenet_search5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/imagenet_search5.png -------------------------------------------------------------------------------- /figures/obj_localization.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/obj_localization.png -------------------------------------------------------------------------------- /figures/one_hot_encoding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/one_hot_encoding.png -------------------------------------------------------------------------------- /figures/r2_random_forest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/r2_random_forest.png -------------------------------------------------------------------------------- /figures/s2s_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/s2s_architecture.png -------------------------------------------------------------------------------- /figures/ssd_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/ssd_architecture.png -------------------------------------------------------------------------------- /figures/std_random_forest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/std_random_forest.png -------------------------------------------------------------------------------- /figures/varying_cyclic_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/varying_cyclic_lr.png -------------------------------------------------------------------------------- /figures/NLP_model_sentiment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/NLP_model_sentiment.png -------------------------------------------------------------------------------- /figures/cyclic_learning_rate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/cyclic_learning_rate.png -------------------------------------------------------------------------------- /figures/feature_importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/feature_importance.png -------------------------------------------------------------------------------- /figures/full_objective_gans.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/full_objective_gans.png -------------------------------------------------------------------------------- /figures/img_super_resolution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/img_super_resolution.png -------------------------------------------------------------------------------- /figures/iterator_streaming.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/iterator_streaming.png -------------------------------------------------------------------------------- /figures/loss_with_cyclic_lr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/loss_with_cyclic_lr.png -------------------------------------------------------------------------------- /figures/neural_machine_BLUE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/neural_machine_BLUE.png -------------------------------------------------------------------------------- /figures/style_reconstruction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/style_reconstruction.png -------------------------------------------------------------------------------- /figures/super_res_enhanced.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/super_res_enhanced.png -------------------------------------------------------------------------------- /figures/tree_interpretation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/tree_interpretation.png -------------------------------------------------------------------------------- /figures/check_board_artifacts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/check_board_artifacts.png -------------------------------------------------------------------------------- /figures/colaborative_filtering.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/colaborative_filtering.png -------------------------------------------------------------------------------- /figures/result_style_transfer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/result_style_transfer.png -------------------------------------------------------------------------------- /figures/colaborative_filtering_NN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/colaborative_filtering_NN.png -------------------------------------------------------------------------------- /figures/prediction_extrapolation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/prediction_extrapolation.png -------------------------------------------------------------------------------- /figures/calibrating_validation_set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/calibrating_validation_set.png -------------------------------------------------------------------------------- /figures/colaborative_filtering_predict.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/colaborative_filtering_predict.png -------------------------------------------------------------------------------- /figures/colaborative_filtering_rossman.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/colaborative_filtering_rossman.png -------------------------------------------------------------------------------- /figures/model_for_non_structured_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/model_for_non_structured_data.png -------------------------------------------------------------------------------- /figures/feature_importance_extrapolation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yassouali/fast.ai_notes/HEAD/figures/feature_importance_extrapolation.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Notes for [fast.ai](https://www.fast.ai/) courses 2 | 3 | ## Introduction to Machine Learning for Coders: 4 | * Course [Link](https://course.fast.ai/ml) - [Github repo](https://github.com/fastai/fastai/tree/master/courses/ml1) - [Youtube playlist](https://www.youtube.com/playlist?list=PLdU_yAhDwVefmKny848HTdvNSbdN_eS-B) 5 | * Notes: [Lec 1](intro_to_ML/lecture1.md) - [Lec 2](intro_to_ML/lecture2.md) - [Lec 3](intro_to_ML/lecture3.md) - [Lec 4](intro_to_ML/lecture4.md) - [Lec 5](intro_to_ML/lecture5.md) - [Lec 6](intro_to_ML/lecture6.md) - [Lec 7](intro_to_ML/lecture7.md) - [Lec 8](intro_to_ML/lecture8.md) - [Lec 9/10](intro_to_ML/lecture9_10.md) - [Lec 11](intro_to_ML/lecture11.md). 6 | 7 | ## Practical Deep Learning for Coders (2018): 8 | * Course [Link](https://course.fast.ai/ml) - [Github repo](https://github.com/fastai/fastai/tree/master/courses/dl1) - [Youtube playlist]() 9 | * Notes: [Lec 1](parctical_DL/lecture1.md) - [Lec 2](parctical_DL/lecture2.md) - [Lec 3](parctical_DL/lecture3.md) - [Lec 4](parctical_DL/lecture4.md) - [Lec 5](parctical_DL/lecture5.md) - [Lec 6](parctical_DL/lecture6.md) - [Lec 7](parctical_DL/lecture7.md). 10 | 11 | ## Cutting Edge Deep Learning for Coders (2018): 12 | * Course [Link](http://course.fast.ai/part2.html) - [Github repo](https://github.com/fastai/fastai/tree/master/courses/dl2) - [Youtube playlist]() 13 | * Notes: [Lec 8](cutting_edge_DL/lecture8.md) - [Lec 9](cutting_edge_DL/lecture9.md) - [Lec 10](cutting_edge_DL/lecture10.md) - [Lec 11](cutting_edge_DL/lecture11.md)- [Lec 12](cutting_edge_DL/lecture12.md) - [Lec 13](cutting_edge_DL/lecture13.md)- [Lec 14](cutting_edge_DL/lecture14.md). -------------------------------------------------------------------------------- /intro_to_ML/lecture7.md: -------------------------------------------------------------------------------- 1 | # Lecture 7 - Random Forest from Scratch 2 | 3 | ### t-distribution 4 | 5 | The t distribution (aka, Student’s t-distribution) is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown. 6 | According to the central limit theorem, the sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is sufficiently large. Therefore, when we know the standard deviation of the population, we can compute a z-score, and use the normal distribution to evaluate probabilities with the sample mean. 7 | 8 | But sample sizes are sometimes small, and often we do not know the standard deviation of the population. When either of these problems occur, statisticians rely on the distribution of the t statistic (also known as the t score), whose values are given by: t = [ x - μ ] / [ s / sqrt( n ) ] 9 | 10 | So the t-distribution approaches the normal distribution as the sample size increase, and that the difference is negligible even for moderately large sample sizes (> 30). However, for small samples the difference is important. The t-distribution is used when the population variance is unknown. Simply put, estimating the variance from the sample leads to greater uncertainty and a more spread out distribution, as can be seen by the t-distributions heavier tails. 11 | 12 |

13 | 14 | ### Standard error 15 | The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution[1] or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM). 16 | 17 | 18 | ### Size of the validation set 19 | 20 | How big does out validation set need to be? first we need to answer the question how precisely do I need to know the accuracy of the algorithm. For example, given a dogs vs. cats image classifier, and the model we're looking at had about a 99.4, 99.5% accuracy on the validation set. And a validation set size was 2000. So the number of incorrect is something around (1 - accuracy) * n. 21 | 22 | So we were getting about 12 wrong. And the number of cats we had is a half so the number of wrong cats is about 6. Then we run a new model and we find instead that the accuracy has gone to 99.2%. Then Does 99.4 vs. 99.2 matter? If it wasn’t about cats and dogs, probably not, but it was about finding fraud, then the difference between a .6% error rate and .8% error rate is like 25% of the cost of fraud so that can be huge. 23 | 24 | -------------------------------------------------------------------------------- /intro_to_ML/lecture6.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 6 - Data Products](#Lecture-6---Data-Products) 3 | - [1. Drivetrain Approach Link](#1-a-nameDrivetrainApproachLinkhttpswwworeillycomideasdrivetrain-approach-data-productsaDrivetrain-Approach-Link) 4 | - [2. Review of Random forest interpretation](#2-a-nameReviewofRandomforestinterpretationaReview-of-Random-forest-interpretation) 5 | - [3. Counter Example: Extrapolation](#3-a-nameCounter-Example-ExtrapolationaCounter-Example-Extrapolation) 6 | 7 | 11 | 12 | 13 | # Lecture 6 - Data Products 14 | 15 | ## 1. Drivetrain Approach [Link](https://www.oreilly.com/ideas/drivetrain-approach-data-products) 16 | 17 | ## 2. Review of Random forest interpretation 18 | 19 | * Confidence based on tree variance: The variance of the predictions of the trees. Normally the prediction is just the average, this is variance of the trees. We generally take just one row/observation and find out how confident we are about that (i.e. how much variance there are in the trees) or we can do the same for for different groups. 20 | * Feature importance: After we are done building a random forest model, we take each column and randomly shuffle it. And we run a prediction and check the score. The feature importantce for a given variables we shuffled, is the difference between RMSE of the original score and the score obtained after shuffling. 21 | * Partial dependence: For a given column, we replace all the values with a given values, say for YearMade, we replace all the values with 1960, and we predict the price, this is done for every row and every values we're interested in, we then take the average of all the plots. 22 | * Tree interpreter: Like feature importance, but feature importance is for complete random forest model, and this tree interpreter is for feature importance for particular observation. So let’s say it’a about hospital readmission. If a patient A is going to be readmitted to a hospital, which feature for that particular patient is going to impact and how can we change that. It is calculated starting from the prediction of mean then seeing how each feature is changing the behavior of that particular patient. 23 | 24 | ## 3. Counter Example: Extrapolation 25 | 26 | We're going to see an example where random forests fail miserably when predicting a time series, and we need to do extrapolation to make it work, we start by creating a synthetic dataset: 27 | 28 | ```python 29 | import numpy as np 30 | x = np.linespace(0, 1) # per default, we can 50 samples 31 | y = x + np.random.uniform() 32 | ``` 33 | 34 |

35 | 36 | Now before we create a training and a validation set and fit a random forest model, given that skitlearn RF accepts only a matrix as inputs for x, we must transform x created by `np.linespace` from one vector (rank one : len(x.shape)) into a matrix / tensor of 2 dimension, we can use numpy reshape function, or use python : 37 | 38 | * `x = x[:,None]` which is equivalent to `x[None]`, here we add a new first dimension, so x will be one row, 39 | * `x = x[...,None]` to add a new dimension as the last one, here x will be one column. 40 | 41 | After this we can split the synthetic dataset into train / va: 42 | 43 | ```python 44 | x_trn, x_val = x[:40], x[40:] 45 | y_trn, y_val = y[:40], y[40:] 46 | m = RandomForestRegressor().fit(x, y) 47 | ``` 48 | 49 | And then we fit the model, and use it to predict new values, that are unseen before, and that necessate a prediction greater than all the values in the training set. 50 | 51 | ```python 52 | plt.scatter(y_val, m.predict(x_val)) 53 | ``` 54 | 55 |

56 | 57 | And we see that the predictions are very different in comparision to the actuals, beacuse the tree directly predits, given the inputs, the maximum y values it have seen in the training set, one possible solution is to use gradient boosting trees. What a GBM does is, it creates a little tree, and runs everything through that first little tree, then it calculates the residuals and the next little tree just predicts the residuals. So it would be kind of like detrending it, GBM still can’t extrapolate to the future but at least they can deal with time-dependent data more conveniently. -------------------------------------------------------------------------------- /parctical_DL/lecture1.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Jupyter Notebook shortcuts](#JupyterNotebookshortcuts) 3 | * 2. [Looking at the data](#Lookingatthedata) 4 | * 2.1. [Data transformations](#Datatransformations) 5 | * 2.2. [References:](#References:) 6 | 7 | 11 | 12 | 13 | 14 | # Lecture 1: Image recognition 15 | 16 | ## 1. Jupyter Notebook tips 17 | 18 | Here is a list of shortcuts to use on a daily basis: 19 | 20 | **general shortcuts** 21 | 22 | Note: The Jupyter help menu shows all the shorcuts as a capital letter, but we should be using the lowercase letters: 23 | 24 | | Command | Result | 25 | |-----------------|:---------------------------------------------:| 26 | | h | Show help menu | 27 | | a | Insert cell above | 28 | | b | Insert cell below | 29 | | m | Change cell to markdown | 30 | | y | Change cell to coden | 31 | 32 | **Code shortcuts** 33 | 34 | When we are editing the Python code, there are a number of shortcuts to help to speed up the typing: 35 | 36 | | Command | Result | 37 | |-----------------|:---------------------------------------------:| 38 | | Tab | Code completion | 39 | | Shift+Tab | Shows method signature and docstring | 40 | | Shift+Tab (x2) | Opens documentation pop-up | 41 | | Shift+Tab (x3) | Opens permanent window with the documentation | 42 | 43 | **Notebook configuration** 44 | 45 | * Reload extensions: `%reload_ext autoreload` Whenever a module we are using gets updated, the Jupyter IPython’s environment will be able to reload that for us, depending on how we configure it. 46 | 47 | * Autoreload configuration: `%autoreload 2`, In combination with the previously mentioned reload extensions, this setting will indicate how we want to reload modules. We can configure this in three ways: 48 | 49 | * 0 - Disable automatic reloading; 50 | * 1 - Reload all modules imported with %import every time before executing hte python code typed; 51 | * 2 - Reload all modules (except those excluded by %import) every time before executing the Python code typed. 52 | 53 | * Matplotlib inline: `%matplotlib inline`, This option will output plotting commands inline within the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document. 54 | 55 | **Secure access via port forwarding** 56 | 57 | When running the Jupyter Notebooks on our own server, we need to find a way to securely access it. 58 | 59 | ``` 60 | # ssh -L local_port:remote_host:remote_port remote_machine_ip 61 | $ ssh -L 8888:localhost:8888 my_vm_ip 62 | ``` 63 | 64 | This command will map our local 8888 port to localhost:8888 on the remote machine, allowing us to securely connect to the in using directly the localhost adress in the browser. 65 | 66 | To maintain an open connection / usage of our terminal without a closing of the connection, we can use Tmux / Screen / Boyubu. 67 | 68 | ## 2. Looking at the data 69 | 70 | As well as looking at the overall metrics, it's also a good idea to look at examples of each of: 71 | 1. A few correct labels at random 72 | 2. A few incorrect labels at random 73 | 3. The most correct labels of each class (i.e. those with highest probability that are correct) 74 | 4. The most incorrect labels of each class (i.e. those with highest probability that are incorrect) 75 | 5. The most uncertain labels (i.e. those with probability closest to 0.5). 76 | 77 | ```python 78 | def rand_by_mask(mask): 79 | return np.random.choice(np.where(mask)[0], 4, replace=False) 80 | 81 | def rand_by_correct(is_correct): 82 | return rand_by_mask((preds == data.val_y)==is_correct) 83 | 84 | def load_img_id(ds, idx): return np.array(PIL.Image.open(PATH+ds.fnames[idx])) 85 | 86 | def plot_val_with_title(idxs, title): 87 | imgs = [load_img_id(data.val_ds,x) for x in idxs] 88 | title_probs = [probs[x] for x in idxs] 89 | print(title) 90 | return plots(imgs, rows=1, titles=title_probs, figsize=(16,8)) 91 | 92 | def plots(ims, figsize=(12,6), rows=1, titles=None): 93 | f = plt.figure(figsize=figsize) 94 | for i in range(len(ims)): 95 | sp = f.add_subplot(rows, len(ims)//rows, i+1) 96 | sp.axis('Off') 97 | if titles is not None: sp.set_title(titles[i], fontsize=16) 98 | plt.imshow(ims[i]) 99 | ``` 100 | ``` 101 | -> plot_val_with_title(rand_by_correct(True), "Correctly classified") 102 | -> plot_val_with_title(rand_by_correct(False), "Incorrectly classified") 103 | ``` 104 | 105 | 106 | ### 2.1. Data transformations 107 | 108 | These transformations can vary for each architecture but usually entail one or more of these: 109 | * resizing: each images gets resized to the input the network expects; 110 | * normalizing: data values are rescaled to values between 0 and 1; y = (x - xmin) / (xmax _ xmin) 111 | * standardizing: data values are rescaled to a standard distribution with a mean of 0 and a standard deviation of 1; where μ is the mean and σ is the standard deviation. y = (x - μ) / σ. where μ is the mean and σ is the standard deviation. 112 | 113 | 114 | ### 2.2. References: 115 | * [FastAi lecture 1 notes](https://www.zerotosingularity.com/blog/fast-ai-part-1-course-1-annotated-notes/) -------------------------------------------------------------------------------- /intro_to_ML/lecture8.md: -------------------------------------------------------------------------------- 1 | # Lecture 8 - Gradient Descent and Logistic Regression 2 | 3 | ## Normalization 4 | 5 | Normalizing data is subtracting out the mean and dividing by the standard deviation. 6 | 7 | Does it matter in Random Forests to normalize? Not really, the key is that when we are deciding where to split, all that matters is the order. Like all that matters is how they are sorted, so if we subtract the mean divide by the standard deviation, they are still sorted in the same order. Random forests only care about the sort order of the independent variables. They don’t care at all about their size. So that’s why they’re wonderfully immune to outliers because they totally ignore the fact that it’s an outlier, they only care about which one is higher than what other thing. So this is an important concept. It doesn’t just appear in random forests. It occurs in some metrics as well. For example, area under the ROC curve completely ignores scale and only cares about sort. Also dendrograms and Spearman’s correlation only cares about order, not about scale. So random forests, one of the many wonderful things about them are that we can completely ignore a lot of these statistical distribution issues. But we can’t for deep learning because deep learning, we are trying to train a parameterized model. So we do need to normalize our data. If we don’t then it’s going to be much harder to create a network that trains effectively. 8 | 9 | So we grab the mean and the standard deviation of our training data and subtract out the mean, divide by the standard deviation, and that gives us a mean of zero and standard deviation of one. 10 | 11 | ```python 12 | mean = x.mean() 13 | std = x.std() 14 | 15 | x=(x-mean)/std 16 | ``` 17 | Now for our validation data, we need to use the standard deviation and mean from the training data. We have to normalize it the same way. 18 | 19 | ## Looking at the data 20 | 21 | In any sort of data science work, it's important to look at wer data, to make sure we understand the format, how it's stored, what type of values it holds, etc. To make it easier to work with, let's reshape it into 2d images from the flattened 1d format. 22 | 23 | Helper methods 24 | ```python 25 | def show(img, title=None): 26 | plt.imshow(img, cmap="gray") 27 | if title is not None: plt.title(title) 28 | 29 | def plots(ims, figsize=(12,6), rows=2, titles=None): 30 | f = plt.figure(figsize=figsize) 31 | cols = len(ims)//rows 32 | for i in range(len(ims)): 33 | sp = f.add_subplot(rows, cols, i+1) 34 | sp.axis('Off') 35 | if titles is not None: sp.set_title(titles[i], fontsize=16) 36 | plt.imshow(ims[i], cmap='gray') 37 | ``` 38 | 39 | ### NN as universal approximators [Link](http://neuralnetworksanddeeplearning.com/chap4.html) 40 | 41 | ### Softmax and log softmax 42 | 43 | There are a number of advantages of using log softmax over softmax including practical reasons like improved numerical performance and gradient optimization. These advantages can be extremely important for implementation especially when training a model can be computationally challenging and expensive. At the heart of using log-softmax over softmax is the use of log probabilities over probabilities, which has nice information theoretic interpretations. 44 | 45 | When used for classifiers the log-softmax has the effect of heavily penalizing the model when it fails to predict a correct class. Whether or not that penalization works well for solving the problem is open to testing, so both log-softmax and softmax are worth using. 46 | 47 | Gradient methods generally work better optimizing log p(x) (-log p goes to inf if the p goes to zero, and one if p=1 and it is the correct class) than p(x) because the gradient of log p(x) is generally more well-scaled. That is, it has a size that consistently and helpfully reflects the objective function's geometry, making it easier to select an appropriate step size and get to the optimum in fewer steps. 48 | 49 | Compare the gradient optimization process for p(x)=exp(−x^2) and f(x) = log p(x) = −x^2. At any point x, the gradient of f(x) is f′(x)=−2x. If we multiply that by 1/2, we get the exact step size needed to get to the global optimum at the origin, no matter what x is. This means that we don't have to work too hard to get a good step size (or "learning rate"). No matter where our initial point is, we just set our step to half the gradient and we'll be at the origin in one step. And if we don't know the exact factor that is needed, we can just pick a step size around 1, do a bit of line search, and we'll find a great step size very quickly, one that works well no matter where x is. This property is robust to translation and scaling of f(x). While scaling f(x) will cause the optimal step scaling to differ from 1/2, at least the step scaling will be the same no matter what x is, so we only have to find one parameter to get an efficient gradient-based optimization scheme. 50 | 51 | In contrast, the gradient of p(x) has very poor global properties for optimization. We have p′(x)=f′(x)p(x)=−2x exp(−x2). This multiplies the perfectly nice, well-behaved gradient −2x with a factor exp(−x2) which decays (faster than) exponentially as x increases. At x=5, we already have exp(−x2)=1.4 10^−11, so a step along the gradient vector is about 10^−11 times too small. To get a reasonable step size toward the optimum, we'd have to scale the gradient by the reciprocal of that, an enormous constant ∼10^11. Such a badly-scaled gradient is worse than useless for optimization purposes - we'd be better off just attempting a unit step in the uphill direction than setting our step by scaling against p′(x)! (In many variables p′(x) becomes a bit more useful since we at least get directional information from the gradient, but the scaling issue remains.) 52 | 53 | In general there is no guarantee that log p(x) will have such great gradient scaling properties as this toy example, especially when we have more than one variable. However, for pretty much any nontrivial problem, log p(x) is going to be way, way better than p(x). This is because the likelihood is a big product with a bunch of terms, and the log turns that product into a sum. Provided the terms in the likelihood are well-behaved from an optimization standpoint, their log is generally well-behaved, and the sum of well-behaved functions is well-behaved. By well-behaved we mean f′′(x) doesn't change too much or too rapidly, leading to a nearly quadratic function that is easy to optimize by gradient methods. The sum of a derivative is the derivative of the sum, no matter what the derivative's order, which helps to ensure that that big pile of sum terms has a very reasonable second derivative! -------------------------------------------------------------------------------- /intro_to_ML/lecture11.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Trigram with NB features](#TrigramwithNBfeatures) 3 | * 2. [Adding NB feature](#AddingNBfeature) 4 | * 3. [Embeddings](#Embeddings) 5 | 6 | 10 | 11 | 12 | # Lecture 11 - Embeddings 13 | 14 | Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described in the last lecture, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict the sentiment. 15 | 16 | ## 1. Trigram with NB features 17 | 18 |

19 | 20 | We create our term matrix the exact same way as before, using tokenized words, but before this time we also add bigrams and trigrams, and we limit the count to 800,000, because if not, we'll end up with a much higher count. 21 | 22 | ```python 23 | veczr = CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000) 24 | trn_term_doc = veczr.fit_transform(trn) 25 | val_term_doc = veczr.transform(val) 26 | ``` 27 | 28 | And this time around the size of our matrix is 25000 x 800000, with the vocabulary is not just single words but also couples and triplets : `['by vast', 'by vengeance', 'by vengeance .', 'by vera', 'by vera miles']`. 29 | 30 | ## 2. Adding NB feature 31 | 32 | The claims in the original paper: 33 | 34 | * Naive Bayes (NB) and Support Vector Machine (SVM) are widely used as baselines in text-related tasks but their performance varies significantly across variants, features and datasets. 35 | * Word bigrams are useful for sentiment analysis, but not so much for topical text classification tasks 36 | * NB does better than SVM for short snippet sentiment tasks, while SVM outperforms NB for longer documents 37 | * A SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets 38 | 39 | NBSVM is an algorithm, originally designed for binary text/sentiment classification, which combines the Multinomial Naive Bayes (MNB) classifier with the Support Vector Machine (SVM). It does so through the element-wise multiplication of standard feature vectors by the positive class/negative class ratios of MNB log-counts. Such vectors are then used as inputs for the SVM classifier. 40 | 41 | In our case we'll use logistic instead of SVM, and this time instead of use the binarized features as inputs : x @ W + b, we use the ratio-matrix (R_pos = p(f|C0)/p(f|C1) and R_neg=1/R_pos), and then use the logistic regression on the new inputs : (x.r) @ W + b, and this, together with the trigrams and bigrams give us the state of the art results. 42 | 43 | ```python 44 | x_nb = x.multiply(r) 45 | m = LogisticRegression(dual=True, C=0.1) 46 | m.fit(x_nb, y) 47 | 48 | val_x_nb = val_x.multiply(r) 49 | preds = m.predict(val_x_nb) 50 | (preds.T==val_y).mean() 51 | ``` 52 | 53 |

54 | 55 | After multiplying the term document matrix with r, everywhere a zero appears in the term document matrix, a zero appears in the multiplied version. And every time a one appears in the term document matrix, the equivalent value of r appears on the bottom. So we haven’t really changed much. We’ve just kind of changed the ones into something else i.e. r’s from that feature. So what we are now going to do is we’re going to use this our independent variables, instead, in our logistic regression. 56 | 57 | So in other words, this (xnb @ w1) is equal to x*r·w1 . So we could just change the weights to be r·w1 and get the same number. So this ought to mean that the change that we made to the independent variable should not have made any difference because we can calculate exactly the same thing without making that change. So there’s the question. Why did it make a difference? It is due to regularization, with regularization we want the weigths to go to zero, but after multiplying the inputs with the ratios which can be viewd as priors, we dictate our weights to only be very bing/very small in case we have great confidence it'll improve the results (loss is quite big), but in general, we don't want to diverge very much from the prior wich are naive bayes ratios. 58 | 59 | ## 3. Embeddings 60 | 61 | In categorical variable, we transformed them into one hot vector and then fead them to our model, but the vectors are sparce and orthogonal to each other, so no similarity between the enteries, and also the dimensionnality equals the size of the vocabulary, making it quite big, so a better approach is to lean new vector representations for entery in a much lower and semantically richer space, this is done using a simple linear layer, where the weights are the new vectors we'd like to learn, the this matrix is of size (vocab_size x embs_size). We can implement the same model in PyTorch: 62 | 63 | ```python 64 | class DotProdNB(nn.Module): 65 | def __init__(self, nf, ny, w_adj=0.4, r_adj=10): 66 | super().__init__() 67 | self.w_adj, self.r_adj = w_adj, r_adj 68 | self.w = nn.Embedding(nf+1, 1, padding_idx=0) 69 | self.w.weight.data.uniform_(-0.1,0.1) 70 | self.r = nn.Embedding(nf+1, ny) 71 | 72 | def forward(self, feat_idx, feat_cnt, sz): 73 | w = self.w(feat_idx) 74 | r = self.r(feat_idx) 75 | x = ((w+self.w_adj)*r/self.r_adj).sum(1) 76 | return F.softmax(x) 77 | ``` 78 | 79 | In pytorch, instead of using a very large matrix (25000 x 800000) of binary values, and then multiply it by the weights / ratios; we can only store the none negative values of the matrix (a sparce matrix) as indices, and use a look-up table, which takes these indices and gives and ouputs the weights for w and the ratios for r, and this lookup table in pytorch in `nn.Embedding`. In general, `nn.Embedding` is a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it we should specify the size of the lookup table, and initialize the word vectors ourselves. Here is an exmaple: 80 | 81 | ```python 82 | # vocab_size is the number of words in the train, val and test set 83 | # vector_size is the dimension of the word vectors we are using 84 | embed = nn.Embedding(vocab_size, vector_size) 85 | 86 | # intialize the word vectors, pretrained_weights is a 87 | # numpy array of size (vocab_size, vector_size) and 88 | # pretrained_weights[i] retrieves the word vector of 89 | # i-th word in the vocabulary 90 | embed.weight.data.copy_(torch.fromnumpy(pretrained_weights)) 91 | 92 | # Then turn the word index into actual word vector 93 | vocab = {"some": 0, "words": 1} 94 | word_indexes = [vocab[w] for w in ["some", "words"]] 95 | word_vectors = embed(word_indexes) 96 | ``` 97 | 98 | He we're not only multiplying by r, but also adding w_adj, this is to further help the model have weights starting from say 0.4, By doing this, even after regularization, every feature is getting some form of minimum weight, Not necessarily because it could end up choosing a coefficient of -0.4 for a feature and that would say “we know what, even though Naive Bayes says it’s the r should be whatever for this feature. I think we should totally ignore it”. 99 | -------------------------------------------------------------------------------- /intro_to_ML/lecture9_10.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Iterators & Streaming](#IteratorsStreaming) 3 | * 2. [Regularization](#Regularization) 4 | * 3. [NLP](#NLP) 5 | 6 | 10 | 11 | 12 | # Lecture 9/10 NLP, and Columnar Data 13 | 14 | ## 1. Iterators & Streaming 15 | 16 | Its a object that we call next on. We can pull different pieces, ordered or randomized that can be pulled for mini-batches. 17 | 18 |

19 | 20 | Pytorch is designed to handle this concept too, of predicting a stream of data (or the next item). fastai is also designed around this concept of iterators. DataLoader is created with the same concept, but with multiprocessing, which is multi threaded applications. 21 | 22 | 23 | ## 2. Regularization 24 | 25 | Add new terms to the loss function. If we add L1 and L2 norms of the weights to the loss functions, it incentives the coefficients to be close to zero. See the extra term below. 26 | 27 | $$ loss =\frac{1}{n} \sum{(wX-y)^2} + \alpha \sum{w^2} $$ 28 | 29 | How to implement? 30 | 31 | * Change the loss function 32 | * Change the training loop to add the derivative adjustment. 33 | 34 | In the beginning of the training we might see that the error in the training set is smaller with regularization than without, this might be due to the fact that the function we're optimizing is easier and smoother and can less iterations to reduce the error, but in the end with regularization the training error is larger because now we don't have any overfitting, and in the error in the validation set becomes smaller. 35 | 36 | ## 3. NLP: Text classification 37 | 38 | For text classification (in this example the text is a review and the labels are 1 / 0, for negative and positive reviews), we'll first discard the order of the words, and only use their frequency in the corpus, this is done using term document matrix, which count the number of appearances of each word (feature) in the document, so the size of this matrix is # Document x # Number of unique words, the unique words are obtained by tokenizing all the words in the corpus (*this "movie" isn't good*. *becomes this " movie " is n't good .*). First we'll only use unigrams. 39 | 40 |

41 | 42 | So we start by creating the training and validation set using the text files provided in the IMDB sentiment analysis dataset. We have two folder one for training examples, and one for validation/test, in each one we have neg and pos examples, so for train examples, we go through the negatives add them to train_text and append zeros (idx) for all of them, same for pos but with label 1. 43 | 44 | ```python 45 | def texts_labels_from_folders(path, folders = ['neg','pos']): 46 | texts, labels = [], [] 47 | for idx, label in enumerate(folders): 48 | for fname in glob(os.path.join(path, label, '*.*')): 49 | texts.append(open(fname, 'r').read()) 50 | labels.append(idx) 51 | 52 | trn,trn_y = texts_labels_from_folders(f'{PATH}train',names) 53 | val,val_y = texts_labels_from_folders(f'{PATH}test',names) 54 | ``` 55 | 56 | After that, we convert the collection of text documents to a matrix of token counts using `sklearn.feature_extraction.text.CountVectorizer`: 57 | 58 | ```python 59 | veczr = CountVectorizer(tokenizer=tokenize) 60 | trn_term_doc = veczr.fit_transform(trn) 61 | val_term_doc = veczr.transform(val) 62 | ``` 63 | 64 | `fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. Since we have to apply the *same transformation* to the validation set, the second line uses just the method `transform(val)`. `trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary. 65 | 66 | Term Document Matrix: 67 | 68 | * columns - vocab or words 69 | * rows - different documents 70 | * cells - count of occurance 71 | 72 | We'll obtain 25000x75132 sparse matrix, so 25000 documents and 75132 tokenized words on training set, given that this a sparce matrix and only a small amount of elements are non zeros, it is stored as ((1, 100) -> 2, here word 100 apprears two time in document 1), to reduce the memory requirements. 73 | 74 | Sentiment Approach: bays rule 75 | 76 |

77 | 78 | The probability that a document is from class 1 given that the document is from class 0. We will use a conditional probability to calculate the ratio as follows below 79 | 80 | 81 | 82 | . 83 | 84 | 85 | 86 | How do we calculate it? 87 | 88 | * We calculate p(C=0) and p(C=1) using the number of positive and negative examples in the dataset, in this case we have 12500/12500 so P = 0.5 for both classes. 89 | * Now we need to calculate P(d|C1) and P(d|C0), given that each document contains a set of features, and with the assumption the features are independent, we only need to find P(f|C1) and P(f|C0) and multiply them: 90 | * P(f|C0) = (number of appearances of the features in the positives + 1) / (number of positives in the set + 1), we add one because we want to avoid having 0 probability, beacuse there is always a small possibility that a word will be in a given document. 91 | * And then we multiply them together if the feature is in a given doc : with P(dj|C0) = P(fi|C0) x term_doc(dj, fi). 92 | * We multiply a number of probablities together, so to avoid numerical instability we use the log of the expression: 93 | 94 | 95 | 96 | And the **log-count ratio** r for each word f: 97 | 98 | 99 | 100 | ```python 101 | def pr(y_i): 102 | p = x[y==y_i].sum(0) 103 | return (p+1) / ((y==y_i).sum()+1) 104 | x=trn_term_doc 105 | y=trn_y 106 | 107 | r = np.log(pr(1)/pr(0)) 108 | b = np.log((y==1).mean() / (y==0).mean()) 109 | 110 | pre_preds = val_term_doc @ r.T + b 111 | preds = pre_preds.T>0 112 | (preds==val_y).mean() 113 | ``` 114 | 115 | To get better result, we can binarize the document matrix (instead of having frequencies of each word, we're only going to have 0/1, if the word exists in a give document of not) 116 | 117 | ```python 118 | x=trn_term_doc.sign() 119 | r = np.log(pr(1)/pr(0)) 120 | 121 | pre_preds = val_term_doc.sign() @ r.T + b 122 | preds = pre_preds.T>0 123 | (preds==val_y).mean() 124 | ``` 125 | And thus we get better results : 81 to 83 126 | 127 | If we look at the equation of naive bayes, we see that it is the same as for logistic regression, the only difference is that in LR regression we use the data to obtain to optimize the parameters, and this can gives us better results: 128 | 129 | ```python 130 | m = LogisticRegression(C=0.1, dual=True) 131 | m.fit(trn_term_doc.sign(), y) 132 | preds = m.predict(val_term_doc.sign()) 133 | (preds==val_y).mean() 134 | ``` 135 | And we get better results (88%) -------------------------------------------------------------------------------- /parctical_DL/lecture3.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 3: Overfitting](#Lecture-3-Overfitting) 3 | - [1. Kaggle](#1-a-nameKaggleaKaggle) 4 | - [1.1. How to download data](#11-a-nameHowtodownloaddataaHow-to-download-data) 5 | - [1.2. Create Submission file for Kaggle](#12-a-nameCreateSubmissionfileforKaggleaCreate-Submission-file-for-Kaggle) 6 | - [2. Softmax loss](#2-a-nameSoftmaxlossaSoftmax-loss) 7 | - [3. Multilabel classification](#3-a-nameMultilabelclassificationaMultilabel-classification) 8 | - [4. Difference in log base for cross entropy calcuation](#4-a-nameDifferenceinlogbaseforcrossentropycalcuationaDifference-in-log-base-for-cross-entropy-calcuation) 9 | - [5. Structured and unstructured data](#5-a-nameStructuredandunstructureddataaStructured-and-unstructured-data) 10 | 11 | 15 | 16 | 17 | # Lecture 3: Overfitting 18 | 19 | ## 1. Kaggle 20 | 21 | ### 1.1. How to download data 22 | 23 | Kaggle CLI is a good tool to use when we are downloading from Kaggle. After installing Kaggle-CLI, we can download the datasets directly from the terminal. 24 | ``` 25 | kg download -u -p -c 26 | ``` 27 | A quite interesting addon is `CurlWget`, which gives us a curl command we can past to the terminal the use directly. 28 | 29 | ### 1.2. Create Submission file for Kaggle 30 | Let's take for example the dog vs cats kaggle competition, we first run our model on the test set and obtain by default the log probabilities using the fastai library, so we exponentiate them to get back the probabilities: 31 | ```python 32 | log_preds,y = learn.TTA(is_test=True) 33 | probs = np.exp(log_preds) 34 | ``` 35 | 36 | And we create a dataframe to store them, and write them to disk using the correct format: 37 | ```python 38 | df = pd.DataFrame(probs) 39 | df.columns = data.classes 40 | ``` 41 | One additional column we need to add, that is required by kaggle, is the ID of the images / examples in the test set. So we insert a new column at position zero named ‘id’ and remove first 5 and last 4 letters since we just need ids in this case. 42 | 43 | ```python 44 | df.insert(0,'id', [o[5:-4] for o in data.test_ds.fnames]) 45 | ``` 46 | And the results are: 47 | 48 | | | id | cat | dogs | 49 | |--|:---------:|:--------:|:--------:| 50 | |0 | /828 | 0.000005 | 0.999994 | 51 | |1 | /10093 | 0.979626 | 0.013680 | 52 | |2 | /2205 | 0.999987 | 0.000010 | 53 | |3 | /11812 | 0.000032 | 0.999559 | 54 | |4 | /4042 | 0.000090 | 0.999901 | 55 | 56 | After that we can write the dataframe to disk, to inspect it further and upload it to kaggle, we note that with large files compression is important to speedup work 57 | 58 | ```python 59 | SUBM = f'{PATH}sub/' 60 | os.makedirs(SUBM, exist_ok=True) 61 | df.to_csv(f'{SUBM}subm.gz', compression='gzip', index=False) 62 | ``` 63 | 64 | If we're working in a distant server, we can use `FileLink` to back a URL that we can use to download onto the computer. For submissions, or file checking etc. 65 | 66 | ```python 67 | FileLink(f'{SUBM}subm.gz') 68 | ``` 69 | 70 | ## 2. Softmax loss 71 | 72 | In a classification task, the goal is to learn a mapping h:X→Y, for either a: 73 | 74 | * Binary vs multiclass: In binary classification, |Y|=2 (e.g, a positive category, and a negative category). In Multiclass classifcation, |Y|=k for some k∈N. 75 | 76 | * Single-label vs multilabel: This refers to how many possible outcomes are possible for a single example x∈X. This refers to whether the chosen categories are mutually exclusive, or not. For example, if we are trying to predict the color of an object, then we're probably doing single label classification: a red object can not be a black object at the same time. On the other hand, if we're doing object detection in an image, then since one image can contain multiple objects in it, we're doing multi-label classification. 77 | 78 | ## 3. Multilabel classification 79 | 80 | Softmax is used only for multi-class cases, in which the inputs is one of M classes, for multilabel classification, where each output might belong to multiple classes, we can use sigmoid unit in the output, giving us and output of size `batch_size x number_classes`, just like softmax, the only difference is that time the ouputs are not normlized, but each element is between [0,1], so then use a simple binary cross with the labels. but we also can train a multilabel classifier with tanh & hinge, by just treating the targets to be in {-1,1}. Or even sigmoid & focal loss with {0, 1} with problems with 1:1000 training imbalance, when we want to reduce the effect of the frequent class and increase the loss of the unfrequent ones: 81 | 82 | 83 | 84 | Here we added a new term: (1- pt)^γ, the log loss is calculated using binary cross entropy, and then the resuls are multiplied by this term, setting γ > 0 reduces the relative loss for well-classified examples (pt > 0.5), putting more focus on hard, misclassified examples. they also add a wight alpha. 85 | 86 |

87 | 88 | ```python 89 | def Focal_Loss(y_true, y_pred, alpha=0.25, gamma=2): 90 | epsilon = 1e-8 91 | y_pred += epsilon 92 | 93 | cross_entropy = -y_true * np.log(y_pred) 94 | weight = np.power(1 - y_pred, gamma) * y_true 95 | flocal_loss = cross_entropy * weight * alpha 96 | 97 | reduce_fl = np.max(flocal_loss, axis=-1) 98 | return reduce_fl 99 | ``` 100 | 101 | ## 4. Difference in log base for cross entropy calcuation 102 | 103 | Log base e and log base 2 are only a constant factor off from each other : 104 | 105 | $$\log_{n}X = \frac{\log_{e} X} {\log_{e} n}$$ 106 | 107 | Therefore using one over the other scales the entropy by a constant factor. When using log base 2, the unit of entropy is bits, where as with natural log, the unit is nats. One isn't better than the other. It's kind of like the difference between using km/hour and m/s. It is possible that log base 2 is faster to compute than the logarithm. However, in practice, computing cross-entropy is pretty much never the most costly part of the algorithm, so it's not something to be overly concerned with. In practice the logarithm used is the natural logarithm (base e). 108 | 109 | ## 5. Structured and unstructured data 110 | 111 | Structured data is data that can be both syntactically and semantically described by a straightforward format description. Data in CSV files, XML files, JSON files, email headers, and to some extent HTML is structured, because once we have the format specifier for a particular file, we can easily identify specific values in the data and what they mean semantically. A whole lot of business-gathered data is in lists, tables, or other structured formats. 112 | 113 | Unstructured data is data that is in a more ambiguous format. It may (or may not) be in a well-defined syntactical format, such as a video or audio image, but its semantic contents are not obvious from its format. Natural-language text, audio and video files, etc are much harder to “mine” than structured data, so analyzing them requires a more statistical approach to figuring what the data means semantically. 114 | 115 | For structured data, to use it with neural networks, we need excesive data preprocessing, such as transforming all the features represented as strings as one hot vectors, and replace all the nan values with the mean and add a new binary column to specify if the nan values for each given feature, we can also add more features based on the dates and collected from 3rd parties. the data are then normelized and passed through the network. 116 | 117 | -------------------------------------------------------------------------------- /intro_to_ML/lecture5.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 5 - Extrapolation and RF from Scratch](#Lecture-5---Extrapolation-and-RF-from-Scratch) 3 | - [1. Model overfitting](#1-a-nameModeloverfittingaModel-overfitting) 4 | - [2. Cross-validation](#2-a-nameCross-validationaCross-validation) 5 | - [3. Tree interpreter Link](#3-a-nameTreeinterpreterLinkhttpblogdatadivenetinterpreting-random-forestsaTree-interpreter-Link) 6 | - [4. Extrapolation](#4-a-nameExtrapolationaExtrapolation) 7 | 8 | 12 | 13 | # Lecture 5 - Extrapolation and RF from Scratch 14 | 15 | ## 1. Model overfitting 16 | 17 | We should always split our dataset into, train / validation / test sets, and for random forest, we can even have 4 ways to check how our models is doing using OOB scores in addition to Train / Val / Test. Now what if we have a temporal data, and when we contruct our model on the dataset, it'll be used in production on a more recent set of examples, which might have some caracterestic not present on our dataset, now if we construct a test and validation sets randomly, the performances of our model will certainly not going to be replicated on the real world cases. One way to avoid this is by having a validation set containing examples more recent than the train set, and the test test will contain examples which are more recent than the ones on the validation set. This way we might be able to approximate how well our model will do on the real data, and we can always use the OOB score to get a measurements for a random sample of the data and see if our model is overfitting in the satistical sense. 18 | 19 | ## 2. Cross-validation 20 | 21 | Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. 22 | 23 | The general procedure is as follows: 24 | 25 | 1. Shuffle the dataset randomly. 26 | 2. Split the dataset into k groups 27 | 3. For each unique group: 28 | 1. Take the group as a hold out or test data set 29 | 2. Take the remaining groups as a training data set 30 | 3. Fit a model on the training set and evaluate it on the test set 31 | 4. Retain the evaluation score and discard the model 32 | 4. Summarize the model evaluation scores (average over all the hold out validation set). 33 | 34 | The possible problems with cross-validation are: 35 | 36 | * Time consumption: Training k models can take a lot of time and a lot of recources. 37 | * For some kind of problems, like temporal data, doing a random shuffling will gives validation sets that are not represetative of the real world situation. 38 | 39 | ## 3. Tree interpreter [Link](http://blog.datadive.net/interpreting-random-forests/) 40 | 41 | When considering a decision tree, it is intuitively clear that for each decision that a tree (or a forest) makes there is a path (or paths) from the root of the tree to the leaf, consisting of a series of decisions, guarded by a particular feature, each of which contribute to the final predictions. 42 | 43 | A decision tree with M leaves divides the feature space into M regions Rm, 1 ≤ m ≤ M. In the classical definition, the prediction function of a tree is then defined as where M is the number of leaves in the tree (i.e. regions in the feature space), Rm is a region in the feature space (corresponding to leaf m), cm is a constants corresponding to region m and finally I is the indicator function (returning 1 if x ∈ Rm, 0 otherwise). The value of cm is determined in the training phase of the tree, which in case of regression trees corresponds to the mean of the response variables of samples that belong to region Rm (or ratio(s) in case of a classification tree). The definition is concise and captures the meaning of tree: the decision function returns the value at the correct leaf of the tree. But it ignores the “operational” side of the decision tree, namely the path through the decision nodes and the information that is available there. 44 | 45 | Since each decision is guarded by a feature, and the decision either adds or subtracts from the value given in the parent node, the prediction can be defined as the sum of the feature contributions + the “bias” (i.e. the mean given by the topmost region that covers the entire training set). 46 | 47 | Without writing out the full derivation, the prediction function can be written down as where K is the number of features, cfull is the value at the root of the node and *contrib(x,k)* is the contribution from the k-th feature in the feature vector x. This is superficially similar to linear regression (f(x)=a+bx). For linear regression the coefficients b are fixed, with a single constant for every feature that determines the contribution. For the decision tree, the contribution of each feature is not a single predetermined value, but depends on the rest of the feature vector which determines the decision path that traverses the tree and thus the guards/contributions that are passed along the way. 48 | 49 | To do this king of interpretation: 50 | 51 | ```python 52 | from treeinterpreter import treeinterpreter as ti 53 | 54 | df_train, df_valid = split_vals(df_raw[df_keep.columns], n_trn) 55 | row = X_valid.values[None,0] 56 | prediction, bias, contributions = ti.predict(m, row) 57 | idxs = np.argsort(contributions[0]) 58 | [o for o in zip(df_keep.columns[idxs], df_valid.iloc[0][idxs], contributions[0][idxs])] 59 | ``` 60 | 61 | And now we can see how much each given values of a given variable / feature contributed to the final prediction: ('age', 11, -0.12507089451852943), so if the age of the vehicule is high, the pice goes down, and ('YearMade', 1999, 0.071829996446492878), increase the prices over the mean, which probably mean the a lot of cars were made prior to 1999. 62 | 63 | Another way to show the contributions made is to usew aterfall charts: 64 | 65 |

66 | 67 | ## 4. Extrapolation 68 | 69 | One problem with random forests is extrapolation, given a new example not seen before it is hard for random forests to find a good prediction, and given some other model, say linear regression, we can extrapolate even if the range is outside the range of the training data, one possible way to solve this problem, is to remove the features that are very dependent on the strong temporal components that are irrelevent when taken out of context, thus they make no valuable contribution when used on new data. 70 | 71 | So in this case, to find these componenets, we can start by finding the difference between our validation set and our training set. If we understand the difference between our validation set and our training set, then we can find the predictors which have a strong temporal component and therefore they may be irrelevant by the time we get to the future time period. So we can create a RF to predict for a given example “is it in the validation set” (is_valid). So we add a new column, zeros for the examples in train set and ones for the examples in the validation set, and we run our model to find the most relevent features. 72 | 73 | ```python 74 | df_ext = df_keep.copy() 75 | df_ext['is_valid'] = 1 76 | df_ext.is_valid[:n_trn] = 0 77 | x, y, nas = proc_df(df_ext, 'is_valid') 78 | 79 | m = RandomForestClassifier(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True) 80 | m.fit(x, y) 81 | fi = rf_feat_importance(m, x) 82 | ``` 83 | 84 | And the results are: 85 | 86 |

87 | 88 | Now we can remove the most relevent features from the dataframe, and retrain our random forest, and we end up with a very good results. 89 | -------------------------------------------------------------------------------- /parctical_DL/lecture5.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Collaborative filtering](#Collaborativefiltering) 3 | * 2. [Simple example in excel](#Simpleexampleinexcel) 4 | * 3. [Python version](#Pythonversion) 5 | * 4. [Improving the model](#Improvingthemodel) 6 | * 4.1. [Neural Net Version](#NeuralNetVersion) 7 | 8 | 12 | 13 | 14 | # Lecture 5: Collaborative filtering 15 | 16 | ## 1. Collaborative filtering 17 | 18 | Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. It is based on the idea that people who agreed in their evaluation of certain items in the past are likely to agree again in the future. A person who wants to see a movie for example, might ask for recommendations from friends. The recommendations of some friends who have similar interests are trusted more than recommendations from others. This information is used in the decision on which movie to see. 19 | 20 | Let’s look at collaborative filtering. We are going to look at Movielens dataset. The Movielens dataset is basically is a list of movie ratings by users. 21 | 22 | ``` 23 | !wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip 24 | ``` 25 | 26 | In MovieLens dataset, we have a set of users and movies, in the user ratings, each row is one rating given by a user to a movie specified by its ID, and in movies dataframe, each movie is specified by its title and its genres. 27 | 28 | ```python 29 | ratings = pd.read_csv(path+'ratings.csv') 30 | ratings.head() 31 | ``` 32 | 33 | | | userId | movieId | rating | timestamp | 34 | |--|-------------- |:-------------:| -------:| ------------:| 35 | |0 | 1 | 31 | 2.5 | 1260759144 | 36 | |1 | 1 | 1029 | 3.0 | 1260759179 | 37 | |2 | 1 | 1061 | 3.0 | 1260759182 | 38 | |3 | 1 | 1129 | 2.0 | 1260759185 | 39 | |4 | 1 | 1172 | 4.0 | 1260759205 | 40 | 41 | ```python 42 | movies = pd.read_csv(path+'movies.csv') 43 | movies.head() 44 | ``` 45 | 46 | | | movieId | title | rating | 47 | |--|-------------- |:-------------:| -------:| 48 | |0 | 1 | Toy Story (1995) | Adventure\|Animation\|Children\|Comedy\|Fantasy | 49 | |1 | 2 | Jumanji (1995) | Adventure\|Children\|Fantasy | 50 | |2 | 3 | Grumpier Old Men (1995) | Adventure\|Children\|Fantasy | 51 | |3 | 4 | Waiting to Exhale (1995) | Comedy\|Drama\|Romance | 52 | |4 | 5 | Father of the Bride Part II (1995) | Comedy | 53 | 54 | ## 2. Simple example in excel 55 | 56 | We'll create a simple example for collaborative filtering in excel, first we'll begin by using a matrix factorization / decomposition instead of building a neural net, we have a matrix containing all the users (rows) and all the movies (columns), and each element of the matrix is the rating given by a certain user to the corresponding movie, if a given user did not rate the movie the element is blank. 57 | 58 | And the objective is to learn two embedding matrices, one for the users (each user will characterized by a vector of size 5, that will reflect the user's preferences), and one embedding matrix for the movies, in which each movie will be represented by a vector of size 5 to reflect the genre and type of movie, and the objective is to use these reflect matrices to construct the rating matrix for each user/movies by a simple matrix multiply, and using SGD to adjust the weights of the two embedding matrices to be closer to the ground truth, we'll be able to learn good representation for each user and each movie (we predict 0 for the missing values=). 59 | 60 |

61 | 62 | * Blue cells, the actual rating 63 | * Purple cells, our predictions 64 | * Red cell, our loss function i.e. Root Mean Squared Error (RMSE) 65 | * Green cells, movie embeddings (randomly initialized) 66 | * Orange cells, user embeddings (randomly initialized) 67 | 68 | ## 3. Python version 69 | 70 | It is quite possible that user ID’s are not contiguous which makes it hard to use them as an index of embedding matrix. So we will start by creating indices that starts from zero and are contiguous and replace `ratings.userId` column with the index by using Panda’s apply function with an anonymous function `lambda` and do the same for `ratings.movieId`. 71 | 72 | ```python 73 | u_uniq = ratings.userId.unique() 74 | user2idx = {o:i for i,o in enumerate(u_uniq)} 75 | ratings.userId = ratings.userId.apply(lambda x: user2idx[x]) 76 | 77 | m_uniq = ratings.movieId.unique() 78 | movie2idx = {o:i for i,o in enumerate(m_uniq)} 79 | ratings.movieId = ratings.movieId.apply(lambda x: movie2idx[x]) 80 | 81 | n_users=int(ratings.userId.nunique()) n_movies=int(ratings.movieId.nunique()) 82 | ``` 83 | 84 | And then we can create our model, by first creating two embedding matrices, each one with outputs of 50 (n_factors), so each movie/user will be represented by a 50 dimensional vector, and then we initialize the weights of the embeddings with uniform distribution with zero mean and 0.05 std (close to K.He initialization). And when we get a given movie / user pairs, we extract the corresponding vector representation for each one in the embedding matrix, and then calculate their dot products and sum it to obtain a possible rating given by the user for the movie. 85 | 86 | ```python 87 | class EmbeddingDot(nn.Module): 88 | def __init__(self, n_users, n_movies): 89 | super().__init__() 90 | self.u = nn.Embedding(n_users, n_factors) 91 | self.m = nn.Embedding(n_movies, n_factors) 92 | self.u.weight.data.uniform_(0,0.05) 93 | self.m.weight.data.uniform_(0,0.05) 94 | 95 | def forward(self, cats, conts): 96 | users,movies = cats[:,0],cats[:,1] 97 | u,m = self.u(users),self.m(movies) 98 | return (u*m).sum(1) 99 | ``` 100 | 101 | And then we send our model to the GPU and choose an optimizer to train in. 102 | 103 | ```python 104 | model = EmbeddingDot(n_users, n_movies).cuda() 105 | opt = optim.SGD(model.parameters(), 1e-1, weight_decay=1e-5, momentum=0.9) 106 | ``` 107 | 108 | ## 4. Improving the model 109 | 110 | One possible way to improve the model is to add a bias, given that some movies have higher ratings than others, we can also specify the rage of the outputs / ratings by using a sigmoid to have the output in the range of [0,1], and then and the min rating and multiply by the range of the ratings. 111 | 112 | ```python 113 | min_rating, max_rating = ratings.rating.min(), ratings.rating.max() 114 | 115 | def get_emb(ni,nf): 116 | e = nn.Embedding(ni, nf) 117 | e.weight.data.uniform_(-0.01,0.01) 118 | return e 119 | 120 | class EmbeddingDotBias(nn.Module): 121 | def __init__(self, n_users, n_movies): 122 | super().__init__() 123 | (self.u, self.m, self.ub, self.mb) = [get_emb(*o) for o in [ 124 | (n_users, n_factors), (n_movies, n_factors), (n_users,1), (n_movies,1) 125 | ]] 126 | 127 | def forward(self, cats, conts): 128 | users,movies = cats[:,0],cats[:,1] 129 | um = (self.u(users)* self.m(movies)).sum(1) 130 | res = um + self.ub(users).squeeze() + self.mb(movies).squeeze() 131 | res = F.sigmoid(res) * (max_rating-min_rating) + min_rating 132 | return res 133 | ``` 134 | 135 | and we can plot the predictions : 136 | 137 | ```python 138 | preds = learn.predict() 139 | y=learn.data.val_y[:,0] 140 | sns.jointplot(preds, y, kind='hex', stat_func=None) 141 | ``` 142 | 143 |

144 | 145 | ### 4.1. Neural Net Version 146 | 147 | One possible way to also tackle the problem of collaborative filtering is to use linear layers to treat both the movie and user representation to predict a given rating, first we create two embedding matrices for the users and movies just like before, the only difference this time, is that instead of multiplying them together, for given user and movies, we extract their corresponding embeddings, and we concatenate them together to form a vector of size `n_factors*2`, and feed this vector to a dropout, and then two linear layers, each one followed by a non linearity and a dropout: 148 | 149 | ```python 150 | class EmbeddingNet(nn.Module): 151 | def __init__(self, n_users, n_movies, nh=10, p1=0.5, p2=0.5): 152 | super().__init__() 153 | (self.u, self.m) = [get_emb(*o) for o in [ 154 | (n_users, n_factors), (n_movies, n_factors)]] 155 | self.lin1 = nn.Linear(n_factors*2, nh) 156 | self.lin2 = nn.Linear(nh, 1) 157 | self.drop1 = nn.Dropout(p1) 158 | self.drop2 = nn.Dropout(p2) 159 | 160 | def forward(self, cats, conts): 161 | users,movies = cats[:,0],cats[:,1] 162 | x = self.drop1(torch.cat([self.u(users),self.m(movies)], dim=1)) 163 | x = self.drop2(F.relu(self.lin1(x))) 164 | return F.sigmoid(self.lin2(x)) * (max_rating-min_rating+1) + min_rating-0.5 165 | ``` 166 | 167 |

168 | -------------------------------------------------------------------------------- /intro_to_ML/lecture3.md: -------------------------------------------------------------------------------- 1 | # Lecture 3 - Performance, Validation and Model Interpretation 2 | 3 | 4 | - [Lecture 3 - Performance, Validation and Model Interpretation](#Lecture-3---Performance-Validation-and-Model-Interpretation) 5 | - [1. Importance of good validation set](#1-a-nameImportanceofgoodvalidationsetaImportance-of-good-validation-set) 6 | - [2. Interpreting machine learning models](#2-a-nameInterpretingmachinelearningmodelsaInterpreting-machine-learning-models) 7 | - [3. Confidence in the predictions](#3-a-nameConfidenceinthepredictionsaConfidence-in-the-predictions) 8 | - [4. Feature importance](#4-a-nameFeatureimportanceaFeature-importance) 9 | 10 | 14 | 15 | 16 | 17 | ## 1. Importance of good validation set 18 | 19 | If we do not have a good validation set, it is hard, if not impossible, to create a good model. If we are trying to predict next month’s sales and we build models. If we have no way of knowing whether the models we have built are good at predicting sales a month ahead of time, then we have no way of knowing whether it is actually going to be any good when we put the model in production. we need a validation set that we know is reliable at telling us whether the model is likely to work well when we put it in production or use it on the test set. 20 | 21 | Normally we should not use the test set for anything other than using it right at the end of the competition or right at the end of the project to find out how we did. But there is one thing we can use the test set for in addition, that is to **calibrate** the validation set. 22 | 23 |

24 | 25 | In Kaggle for example, we can have four different models and submit each of the four models to Kaggle to find out their scores. X-axis is the Kaggle scores, and y-axis is the score on a particular validation set, above we have to validation sets, one plot for each, If the validation set is good, then the relationship between the leaderboard score (i.e. the test set score) should lie in a straight line (perfectly correlated). Ideally, it will lie on the y = x line, then we can find out which val set is the best. In this case, The val set on the right looks like it is going to predict the Kaggle leaderboard score well. That is really cool because we can go away and try a hundred different types of models, feature engineering, weighting, tweaks, hyper parameters, whatever else, see how they do on the validation set, and not have to submit to Kaggle. So we will get a lot more iterations, a lot more feedback. This is not just true for Kaggle but every machine learning project, but in general, if the validation set is not showing nice fit line, we need to reconstruct it. 26 | 27 | ## 2. Interpreting machine learning models 28 | 29 | For model interpretation, there is no need to use the full dataset because we do not need a massively accurate random forest, we just need one which indicates the nature of relationships involved. So we just need to make sure the sample size is large enough that if we call the same interpretation commands multiple times, we do not get different results back each time. In practice, 50,000 is a good choice. 30 | 31 | ```python 32 | set_rf_samples(50000) 33 | m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True) 34 | m.fit(X_train, y_train) 35 | ``` 36 | 37 | ## 3. Confidence in the predictions 38 | 39 | We already know how to get the prediction. We take the average value in each leaf node in each tree after running a particular row through each tree. Normally, we do not just want a prediction, we also want to know how confident we are of that prediction. We would be less confident of a prediction if we have not seen many examples of rows like this one. In that case, we would not expect any of the trees to have a path through, which is designed to help us predict that row. So conceptually, we would expect then that as we pass this unusual row through different trees, it is going to end up in very different places. In other words, rather than just taking the mean of the predictions of the trees and saying that is our prediction, what if we took the standard deviation of the predictions of the trees. If the standard deviation is high, that means each tree is giving us a very different estimate of this row’s prediction. If this was a really common kind of row, the trees would have learned to make good predictions for it because it has seen a lot of similar rows to split based on them. So the standard deviation of the predictions across the trees gives us at least relative understanding of how confident we are of this prediction. 40 | 41 | ```python 42 | preds = np.stack([t.predict(X_valid) for t in m.estimators_]) 43 | np.mean(preds[:,0]), np.std(preds[:,0]) 44 | ``` 45 | 46 | One problem with the code above, is that we execute run each tree squentially on the validation set, and we end up using only one CPU for all the trees (n_estimators=40), it'll be better to run them in parallel and utilize the multi-core system we have. To do this we can utilize the Python class `ProcessPoolExecutor` that uses a pool of processes to execute calls asynchronously. 47 | 48 | ```python 49 | def parallel_trees(m, fn, n_jobs=8): 50 | return list(ProcessPoolExecutor(n_jobs).map(fn, m.estimators_)) 51 | 52 | def get_preds(t): 53 | return t.predict(X_valid) 54 | 55 | preds = np.stack(parallel_trees(m, get_preds)) 56 | np.mean(preds[:,0]), np.std(preds[:,0]) 57 | ``` 58 | 59 | For example let's select one of the variables, `ProductSize`, and count the number of examples in each category for this given variables, and find out the mean of the predictions and the std of the predictions in each one of these categories. 60 | 61 | ```python 62 | raw_valid.ProductSize.value_counts().plot.barh() 63 | flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std'] 64 | summ = x[flds].groupby(flds[0]).mean() 65 | ``` 66 | 67 |

68 | 69 | And it is clear, the less examples we have, the larger the standard deviation is. 70 | 71 | ## 4. Feature importance 72 | 73 | We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on this feature for the predictions. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction. The permutation feature importance measurement was introduced by Breiman (2001) for random forests. Based on this idea, Fisher, Rudin, and Dominici (2018) proposed a model-agnostic version of the feature importance and called it model reliance. They also introduced more advanced ideas about feature importance, for example a (model-specific) version that takes into account that many prediction models may predict the data well. Their paper is worth reading. 74 | 75 | The permutation feature importance algorithm based on Fisher, Rudin, and Dominici (2018): 76 | 77 | Input: Trained model f, feature matrix X, target vector y, error measure L(y,f). 78 | - Estimate the original model error e_orig = L(y, f(X)) (e.g. mean squared error) 79 | - For each feature j = 1,…,p do: 80 | - Generate feature matrix Xperm by permuting feature xj in the data X. This breaks the association between feature xj and true outcome y. 81 | - Estimate error e_perm = L(Y,f(X_perm)) based on the predictions of the permuted data. 82 | - Calculate permutation feature importance FI_j= e_perm/e_orig. Alternatively, the difference can be used: FI_j = e_perm - e_orig. 83 | - Sort features by descending FI. 84 | 85 | In our case, this is done using the already available output in the trained random forest `m`. 86 | 87 | ```python 88 | def rf_feat_importance(m, df): 89 | return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_} 90 | ).sort_values('imp', ascending=False) 91 | 92 | def plot_fi(fi): 93 | return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False) 94 | 95 | plot_fi(fi[:30]) 96 | ``` 97 | 98 | And by displying the bar plot for 30 most important features in our model we get: 99 | 100 |

101 | 102 | We can also delete all the other columns / features, and only maintain the ten most important ones, and retrain our model to see if the predicition score might stay the same. 103 | 104 | ```python 105 | m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True) 106 | m.fit(X_train, y_train) 107 | print_score(m) 108 | ``` 109 | ``` 110 | rmse train rmse val rmse train R2 val oob_score 111 | [0.20685390156773095, 0.24454842802383558, 0.91015213846294174, 0.89319840835270514, 0.8942078920004991] 112 | ``` 113 | 114 | And the score are very similar, they are even slightly better (0.893 > 0.889), maybe beacuse the trees now can find better cuts and are able to avoid usuing redundant variables along the way. 115 | -------------------------------------------------------------------------------- /intro_to_ML/lecture1.md: -------------------------------------------------------------------------------- 1 | 2 | Preliminaries 3 | 4 | * 1. [Some notes about Google Colab](#SomenotesaboutGoogleColab) 5 | * 2. [Jupyter tricks](#Jupytertricks) 6 | * 3. [Python tricks](#Pythontricks) 7 | 8 | Lesson 1 - Introduction to Random Forests 9 | * 4. [Thecurse of dimensionality?](#Thecurseofdimensionality) 10 | * 5. [No free lunch theorem](#Nofreelunchtheorem) 11 | * 6. [Preprocessing](#Preprocessing) 12 | * 7. [Side note; Skitlearn](#SidenoteSkitlearn) 13 | 14 | 18 | 19 | 20 | # Preliminaries 21 | 22 | ## 1. Some notes about Google Colab 23 | 24 | - To install some packages, we can use either `pip` or `apt`: 25 | ```python 26 | !pip install -q matplotlib-venn 27 | from matplotlib_venn import venn2 28 | venn2(subsets = (3, 2, 1)) 29 | ``` 30 | ```shell 31 | !apt update && apt install -y libfluidsynth1 32 | ``` 33 | Installing FastAi in Google Colab (example only) [Link](https://gist.githubusercontent.com/gilrosenthal/58e9b4f9d562d000d07d7cf0e5dbd840/raw/343a29f22692011088fe428b0f800c77ccad3951/Fast.ai%2520install%2520script) 34 | For Machine learn course, we need to install the version 0.7 of fastAi, forum response [Link](https://forums.fast.ai/t/importerror-cannot-import-name-as-tensor/25295/3) 35 | 36 | To delete some packages, we can do the same thing 37 | ```shell 38 | !pip uninstall `fastai` 39 | ``` 40 | 41 | - To add some data to google colab, first we upload it to the Files (we can view the Files angle in the left angle, by clicking a small grey arrow) 42 | After adding the files, we can then import it into the notebook: 43 | ````python 44 | # Loadd the Drive helper and mount 45 | from google.colab import drive 46 | # This will prompt for authorization 47 | drive.mount('/content/name_of_the_folder') 48 | # The files will be present in the "/cotent/name_of_the_folder/My Drive" 49 | §ls "/content/name_of_the_folder/ My Drive" 50 | ```` 51 | 52 | * To download the data using Kaggle API, after installing it and exporting the API Keys to the terminal, we download it directly: 53 | ````shell 54 | !mkdir -p ~/.kaggle 55 | !cp kaggle.json ~/.kaggle/ 56 | kaggle competitions download -c bluebook-for-bulldozers 57 | ```` 58 | 59 | 60 | ## 2. Jupyter tricks 61 | 62 | - In data sience (unlike software engineering), prototyping is very important, and jupyter notebook helps a lot, for example, given a function `display`: 63 | 64 | 1. type `display` in a cell and press shift+enter, it will tell we where it came from `` 65 | 2. type `?display` in a cell and press shift+enter, it will show we the documentation 66 | 3. type `??display` in a cell and press shift+enter, it will show we the source code. This is particularly useful for fastai library because most of the functions are easy to read and no longer than 5 lines long. 67 | 68 | - `shift + tab` in Jupyter Notebook will bring up the inspection of the parameters of a function, if we hit `shift + tab` twice it'll tell us even more about the function parameters (part of the docs) 69 | - If we put %time, it will tell us how much times it took to execute the line in question. 70 | - If we run a line of code and it takes quite a long time, we can put %prun in front. `%prun m.fit(x, y)`. This will run a profiler and tells us which lines of code took the most time, and maybe try to change our preprocessing to help speed things up. 71 | 72 | ## 3. Python tricks 73 | 74 | 1- `getattr` looks inside of an object and finds an attribute with that name, for example to modify the date (datetime format in Pandas), we use it to add more colums in the dataframe depending on the name (Year, Week, Month, Is_quarter_end...), these data attribues are present in field_name.dt (Pandas splits out different methods inside attributes that are specific to what they are. So date time objects will be in `dt`). 75 | 76 | 2- PEP 8 - Style Guide for Python Code [PEP 8](https://www.python.org/dev/peps/pep-0008/). 77 | 78 | 3- Reading Data: In case we using large csv files (millions of columns), to be able to load them into the RAM by limiting the amount of space that it takes up when we read in, we create a dictionary for each column name to the data type of that column. 79 | ```python 80 | types = {'id': 'int64', 81 | 'item_nbr': 'int32', 82 | 'store_nbr': 'int8', 83 | 'unit_sales': 'float32', 84 | 'onpromotion': 'object'} 85 | 86 | %%time 87 | df_all = pd.read_csv(f'{PATH}train.csv', parse_dates=['date'], dtype=types, infer_datetime_format=True) 88 | ``` 89 | 90 | But generally it is better to use only a subset of the data to explore it, and not loading it all at one go, one possible way to do so is to use the UNIX command `shuf`, we can get a random sample of data at the command prompt and then we can just read that. This also is a good way to find out what data types to use, we read in a random sample and let Pandas figure it out for us. ```shuf -n 5 -o output.csv train.csv``` 91 | 92 | ___ 93 | 94 | # Lecture 1 - Introduction to Random Forests 95 | 96 | Now for the data, we'll use Kaggle data, and there is a trick to download it using cURL from the terminal, by capturing the GET resquest using the inspect element / Network of Firefox browser: 97 | 98 | * Press `ctrl + shift + i` to open web developer tool. Go to Network tab, click on Download button, and cancel out of the dialog box. It shows network connections that were initiated. We then right-click on it and select Copy as cURL. Paste the command and add `-o bulldozer.zip` at the end (possibly remove `— — 2.0` in the cURL command), or use the Kaggle API 99 | 100 | ## 4. Thecurse of dimensionality? 101 | The more dimension of the data / features we have, it creates a space that is more and more empty, the more dimensions we have, the more all of the points sit on the edge of that space. If we just have a single dimension where things are random, then they are spread out all over. But if it is a square then the probability that they are in the middle means that they cannot be on the edge of either dimension so it is a little less likely that they are not on the edge. Each dimension we add, it becomes multiplicatively less likely that the point is not on the edge of at least one dimension, so in high dimensions, everything sits on the edge. What that means in theory is that the distance between points is much less meaningful. But this turns out not to be the case for number of reasons, like they still do have different distances away from each other. Just because they are on the edge, they still vary on how far away they are from each other and their is some similarities between each point even in these high dimensions. 102 | 103 | ## 5. No free lunch theorem 104 | There is a claim that there is no type of model that works well for any kind of dataset. In the mathematical sense, any random dataset by definition is random, so there is not going to be some way of looking at every possible random dataset that is in someway more useful than any other approach. In real world, we look at data which is not random. Mathematically we would say it sits on some lower dimensional manifold. It was created by some kind of causal structure. 105 | 106 | 107 | ## 6. Preprocessing 108 | Generally, the provided dataset contains a mix of continuous and categorical variables. 109 | 110 | * **continuous**: numbers where the meaning is numeric such as price. 111 | * **categorical**: either numbers where the meaning is not continuous like zip code or string such as “large”, “medium”, “small” 112 | 113 | One of the most interesting types of features are dates, we can do a lot of feature engineering based on them: It really depends on what we are doing. If we are predicting soda sales in San Fransisco, we would probably want to know if there was a San Francisco Giants ball game that day. In this case, such feature engineering is crucial for getting better performances given that no machine learning algorithm can tell we whether the Giants were playing that day and that it was important. 114 | 115 | So before feeding the data to say a random forest, we must first process the dates and also convert all the strings, which are inefficient, and doesn't provide the numeric coding required for a random forest, to a categorical representation for a more rich representation. 116 | 117 | For example: 118 | - First, we can change any columns of strings in a panda's dataframe to a column of catagorical values. 119 | ```python 120 | for n,c in df.items(): 121 | if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered() 122 | ``` 123 | - Pandas then will automatically creates a list of categories (in `df.column.cat.categories`, and their decimal codes in `df.column.cat.codes`) 124 | 125 | - An additional pre-processing step, is processing missig values (NULL), so for each column, if there is NULL values (`pd.isnull(col).sum` > 0), we create a new column with the same name and null added in the end, and the NULL calues have 1, and the other are zeros (`pd.isnull(col)`), now for the original column, we replace the NULL values by the median of the column `df[name] = col.fillna(col.median())`. This is only done for numerical columns, pandas automaticlly handels categorical data (converted strings in our case), but the NULL in this case equals -1, so we add one to all the columns (`if not is_numerical_dtype(col)`) 126 | 127 | **Feather format**: Reading CSV takes about 10 seconds, and processing takes another 10 seconds, so if we do not want to wait again, it is a good idea to save them. Here we will save it in a feather format where we save the preprocessed data directly to disk in exactly the same basic format that it is in RAM. This is by far the fastest way to save something, and also to read it back. Feather format is becoming standard in not only Pandas but in Java, Apache Spark, etc. 128 | 129 | ## 7. Side note; Skitlearn 130 | 131 | Everything in scikit-learn has the same form and takes the same steps, to be used for random forest (either a regressor for predicting continuous variables `RandomForestRegressor `, or a classifier for predicting categorifcal variables `RandomForestClassifier `) 132 | 133 | ```python 134 | m = RandomForestRegressor(n_jobs=-1) 135 | m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice) 136 | ``` 137 | 138 | And steps are: 139 | 1. Create an instance of an object for the machine learning model, 140 | 2. Call `fit` by passing in the independent variables, giving the columns / features, and the variable to predict, and finnally computes some scores to evaluate the performances of the model. 141 | -------------------------------------------------------------------------------- /intro_to_ML/lecture4.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 4 - Feature Importance, Tree Interpreter](#Lecture-4---Feature-Importance-Tree-Interpreter) 3 | - [1. Hyper parameters of interest in Random Forests](#1-a-nameHyperparametersofinterestinRandomForestsaHyper-parameters-of-interest-in-Random-Forests) 4 | - [1.1. `set_rf_samples`](#11-a-namesetrfsamplesasetrfsamples) 5 | - [1.2. `min_samples_leaf`](#12-a-nameminsamplesleafaminsamplesleaf) 6 | - [1.3. `max_features `](#13-a-namemaxfeaturesamaxfeatures) 7 | - [2. Model Interpretation](#2-a-nameModelInterpretationaModel-Interpretation) 8 | - [2.1. One hot encoding](#21-a-nameOnehotencodingaOne-hot-encoding) 9 | - [2.2. Removing redundant variables](#22-a-nameRemovingredundantvariablesaRemoving-redundant-variables) 10 | - [2.3. Partial dependence](#23-a-namePartialdependenceaPartial-dependence) 11 | 12 | 16 | 17 | # Lecture 4 - Feature Importance, Tree Interpreter 18 | 19 | ## 1. Hyper parameters of interest in Random Forests 20 | 21 | ### 1.1. `set_rf_samples` 22 | 23 |

24 | 25 | Determines how many rows are in each tree. So before we start a new tree, we either bootstrap a sample (i.e. sampling with replacement from the whole dataset) or we pull out a subsample of a smaller number of rows and then we build a tree from there. 26 | 27 | First we have our whole dataset, we grab a few rows at random from it, and we turn them into a smaller dataset. From that, we build a tree. 28 | 29 |

30 | 31 | Assuming that the tree remains balanced as we grow it, and assuming we are growing it until every leaf is of size one, the tree will be log2(20K) layers deep. Same thing for the number of leaf nodes, we'll have 20K leaf nodes, one for each examples. We have a linear relationship between the number of leaf nodes and the size of the sample. So when we decrease the sample size, there are less final decisions that can be made. Therefore, the tree is going to be less rich in terms of what it can predict because it is making less different individual decisions and it also is making less binary choices to get to those decisions. 32 | 33 | Setting RF samples lower is going to mean that we overfit less, but it also means that we are going to have a less accurate individual tree model. We are trying to do two things when we build a model with bagging. (1) each individual tree/estimator is as accurate as possible (so each model is a strong predictive model). (2) the correlation between them must be as low as possible so that when we average them out together, we end up with something that generalizes. By decreasing the `set_rf_samples`, we are actually decreasing the pothe of the estimator and increasing the correlation 34 | 35 | ### 1.2. `min_samples_leaf` 36 | 37 | Before, we assumed that `min_samples_leaf=1`, if it is set to 2, the new depth of the tree is log2(20K/2) = log2(20K) - 1. Each time we double the `min_samples_leaf` , we are removing one layer from the tree, and halving the number of leaf nodes (i.e. 10k). The result of increasing `min_samples_leaf` is that now each of our leaf nodes contains more than one example, so we are going to get a more stable average that we are calculating in each tree. We have a little less depth (i.e. we have less decisions to make) and we have a smaller number of leaf nodes. So again, we would expect the result of that node would be that each estimator would be less predictive, but the estimators would be also less correlated. So this might help us avoid overfitting. 38 | 39 | ### 1.3. `max_features ` 40 | 41 | At each split, it will randomly sample columns (as opposed to `set_rf_samples` where we pick a subset of rows for each tree). With `max_features=0.5`, at each split, we’d pick a different subset of the features, here we pick half of them. The reason we do that is because we want the trees to be as rich as possible. Particularly, if we were only doing a small number of trees (e.g. 10 trees) and we picked the same column set all the way through the tree, we are not really getting much variety in what kind of things it can find. So this way, at least in theory, seems to be something which is going to give us a better set of trees by picking a different random subset of features at every decision point. 42 | 43 | ## 2. Model Interpretation 44 | 45 | ### 2.1. One hot encoding 46 | 47 | For categorical variables represented as codes (number in the range of the cardinality of the variables / number of categories), it might take a number of splits for the tree to get the important category, say the category we're interested in is situated in position 4, and we hae 6 categories, so we'll need to split to set the desired one, and this increases and the computation needed and the complexity of the tree. One solution is to replace the one column of categories with the a number of columns depending on the cardinality of the features, say for 6 categories we'll have 6 one hot encoding column to replace it, this might help the tree, but most importantly, we can use to find the important features that were hidden prior to this. 48 | 49 | In code this is done using the function `proc_df`, which automatically replaces all the column with a cardinality less than `max_n_cat` with one hot columns. 50 | 51 | ```python 52 | df_trn2, y_trn, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7) 53 | X_train, X_valid = split_vals(df_trn2, n_trn) 54 | ``` 55 | 56 | Interestingly, it turns out that before, it said that enclosure was somewhat important. When we do it as one hot encoded, it actually says `Enclosure_EROPS w AC` is the most important thing. So for at least the purpose of interpreting the model, we should always try one hot encoding quite a few of the variables. We can try making that number as high as we can so that it doesn’t take forever to compute and the feature importance doesn’t include really tiny levels that aren’t interesting. 57 | 58 |

59 | 60 | ### 2.2. Removing redundant variables 61 | 62 | We’ve already seen how variables which are basically measuring the same thing can confuse our variable importance. They can also make our random forest slightly less good because it requires more computation to do the same thing and there is more columns to check. So we are going to do some more work to try and remove redundant features using “dendrogram”. And it is kind of hierarchical clustering. 63 | 64 | **Side notes:** 65 | Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient. 66 | 67 | - Population correlation coefficient: $r_ { x y } = \frac { \sigma _ { x y } } { \sigma _ { x } \sigma _ { y } }$ 68 | - Rank correlation, is the correlation between the ranks of tha variables, and not their values. 69 | - Spearman’s correlation coefficient is a statistical measure of the strength of a monotonic relationship between paired data. In a sample it is denoted by rs and is by design constrained between -1 and 1. And its interpretation is similar to that of Pearsons, e.g. the closer is to the stronger the monotonic relationship. Correlation is an effect size and so we can verbally describe the strength of the correlation using the following guide for the absolute value of : 70 | 71 | After applying the hierarchical clustering to the data frame, and drawing the following dendogram: 72 | 73 |

74 | 75 | We see that a number of variables are strongly correlated, For example `saleYear` and `saleElapsed` are measuring basically the same thing (at least in terms of rank) which is not surprising because `saleElapsed` is the number of days since the first day in my dataset so obviously these two are nearly entirely correlated. 76 | 77 | One thing we can do to assure that these variables are very similar, is to remove one of them and see if the models performances stayed the same or not. 78 | 79 | ```python 80 | def get_oob(df): 81 | m = RandomForestRegressor(n_estimators=30, min_samples_leaf=5, 82 | max_features=0.6, n_jobs=-1, oob_score=True) 83 | x, _ = split_vals(df, n_trn) 84 | m.fit(x, y_train) 85 | return m.oob_score_ 86 | 87 | for c in ('saleYear', 'saleElapsed', 'fiModelDesc', 'fiBaseModel', 88 | 'Grouser_Tracks', 'Coupler_System'): 89 | print(c, get_oob(df_keep.drop(c, axis=1))) 90 | ``` 91 | 92 | And we see that the results (saleYear 0.889037446375 saleElapsed 0.886210803445 ...) are very similar to the baseline (0.889). 93 | 94 | ### 2.3. Partial dependence 95 | 96 | After ploting the values of `YearMade`, we see that a lot of them were made in year 1000, which means that they were made before a certain year and we didn't track them, so we can just grab things that are more recent, let's say, they were made before 1930, and use `ggplot`, which was originally was an R package (GG stands for the Grammar of Graphics). The grammar of graphics is this very powerful way of thinking about how to produce charts in a very flexible way. We use it to plot a realationship between to features, which will other wise give a very dense scatter plot. 97 | 98 | ```python 99 | ggplot(x_all, aes('YearMade', 'SalePrice'))+stat_smooth(se=True, method='loess') 100 | ``` 101 | 102 | Using only 500 points from the data frame, and ploting `YearMade` against `SalePrice`. aes stands for “aesthetic”, and we add standard error (se=True) to show the confidence interval of this smoother. and the dark line is the resulst of using `loess`, which is a locally weighted regression. 103 | 104 | We see a dipe in price, which might be missliding, and we could come to the conclusion that the bulldozers made in the period 1990 - 1997 are not good, and should not be sold, but their might be other factors in play, like a recession. One alternative is to replace the column `YearMade` with a given year, say starting from 1960 up to 2000, and for each time, we predict the prices using our trained Random Forest for each point in our data frame (in this examples 500), so we'll end up with 500 predictions for each Year (the ones in Blue), and by taking their average we can see that indeed the average price goes up when the `YearMade` increases, this is done using `pdp` - partial dependence plot. 105 | 106 | ```python 107 | def plot_pdp(feat, clusters=None, feat_name=None): 108 | feat_name = feat_name or feat 109 | p = pdp.pdp_isolate(m, x, feat) 110 | return pdp.pdp_plot(p, feat_name, plot_lines=True, 111 | cluster=clusters is not None, 112 | n_cluster_centers=clusters) 113 | 114 | plot_pdp('YearMade') 115 | ``` 116 |

117 | 118 | This partial dependence plot is something which is using a random forest to get us a more clear interpretation of what’s going on in our data. The steps were: 119 | 120 | * First of all look at the future importance to tell us which things do we think we care about. 121 | * Then to use the partial dependence plot to tell us what’s going on on average. 122 | 123 | We can also use `pdp` to find the price based on two variables (say `YearMade` and `saleElapsed`), or we can also cluster the 500 prediction (blue lines) to obtain the most import ones. 124 | 125 |

126 | -------------------------------------------------------------------------------- /parctical_DL/lecture2.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 2: CNNs](#Lecture-2-CNNs) 3 | - [1. Finding an optimal learning rate](#1-a-nameFindinganoptimallearningrateaFinding-an-optimal-learning-rate) 4 | - [2. learning rate annealing](#2-a-namelearningrateannealingalearning-rate-annealing) 5 | - [3. stochastic gradient descent with restarts (SGDR)](#3-a-namestochasticgradientdescentwithrestartsSGDRastochastic-gradient-descent-with-restarts-SGDR) 6 | - [4. Learning rate and fine tuning (differential learning rate annealing)](#4-a-nameLearningrateandfinetuningdifferentiallearningrateannealingaLearning-rate-and-fine-tuning-differential-learning-rate-annealing) 7 | - [5. Cycling Learning Rate](#5-a-nameCyclingLearningRateaCycling-Learning-Rate) 8 | - [5.1. One LR for all parameters](#51-a-nameOneLRforallparametersaOne-LR-for-all-parameters) 9 | - [5.2. Adaptive LR for each parameter](#52-a-nameAdaptiveLRforeachparameteraAdaptive-LR-for-each-parameter) 10 | - [5.3. Cycling Learning Rate](#53-a-nameCyclingLearningRate-1aCycling-Learning-Rate) 11 | - [5.4. Why it works ?](#54-a-nameWhyitworksaWhy-it-works) 12 | - [5.5. Epoch, iterations, cycles and stepsize](#55-a-nameEpochiterationscyclesandstepsizeaEpoch-iterations-cycles-and-stepsize) 13 | - [6. How to terain a state of the art classifier :](#6-a-nameHowtoterainastateoftheartclassifieraHow-to-terain-a-state-of-the-art-classifier) 14 | - [7. References:](#7-a-nameReferencesaReferences) 15 | 16 | 20 | 21 | # Lecture 2: CNNs 22 | 23 | ## 1. Finding an optimal learning rate 24 | 25 | How do we select the “right” learning rate to start training our model? This question is a he heavily debated one in deep learning and fast.ai offers a solution based on a paper from Leslie Smith - Cyclical Learning Rates for training Neural Networks. 26 | 27 |

28 | 29 | The idea of the paper is quite simple: 30 | 31 | * start with a small learning rate and calculate the loss; 32 | * gradually start increasing the learning rate and each time, calculate the loss; 33 | * once the loss starts to shoot up again, it is time to stop; 34 | * we select the highest learning rate we can find, where the loss is still crearly improving (steepest decrease of the loss). 35 | 36 |

37 | 38 | In the figure above, we see that when we increase the learning rate beyond a certain treshold, the loss goes up, this is when we stop our training and choose the lr that gave us the steepest decrease of the loss. 39 | 40 | ## 2. learning rate annealing 41 | 42 | In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrow parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and we'll be wasting computation taking too much time to converge with little improvement for a long time. Decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: 43 | 44 | * **Step decay:** Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic we see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. 45 | * **Exponential decay**. has the mathematical form , where α0, k are hyperparameters and t is the iteration number (or the epochs). 46 | * **1/t decay** has the mathematical form where apha0 and k are hyperparameters and t is the iteration number. 47 | 48 |

49 | 50 | In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter k. Lastly, if we can afford the computational budget, err on the side of slower decay and train for a longer time. 51 | 52 | ## 3. stochastic gradient descent with restarts (SGDR) 53 | 54 | With lr annealing, we may find ourselves in a part of the weight space that isn’t very resilient, that is, small changes to the weights may result in big changes to the loss. We want to encourage our model to find parts of the weight space that are both accurate and stable. Therefore, from time to time we increase the learning rate (this is the ‘restarts’ in ‘SGDR’), which will force the model to jump to a different part of the weight space if the current area is “spiky”. 55 | 56 | 57 | Here’s a picture of how that might look if we reset the learning rates 3 times (in this paper they call it a “cyclic LR schedule”): 58 | 59 |

60 | 61 | From the paper [Snapshot Ensembles](https://arxiv.org/abs/1704.00109) 62 | 63 | If we plot the learning rate while using “cyclic LR schedule” we get (the value used in the starting point is the one obtained using learning rate finder: 64 | 65 |

66 | 67 | The number of epochs between resetting the learning rate is set by cycle length, and the number of times this happens is referred to as the number of cycles. so for epochs=3 and cycle_len=1, we'll do three epochs, one with each cycle. 68 | 69 | We can also vary the length of each cycle, where we start with small cycles and the cycle length get multiplied each time. 70 | 71 |

72 | 73 | In the figure above, we do three cycles, the original length of the cycle is one epoch, and given that we multiply each time the length by two, we'll get 1 + 2 + 4 = 7 epochs in the end. 74 | 75 | Intuitively speaking, if the cycle length is too short, it starts going down to find a good spot, then pops out, and goes down trying to find a good spot and pops out, and never actually get to find a good spot. Earlier on, we want it to do that because it is trying to find a spot that is smoother, but later on, we want it to do more exploring. That is why cycle_mult=2 seems to be a good approach. 76 | 77 | ## 4. Learning rate and fine tuning (differential learning rate annealing) 78 | 79 | When using a pretrained model, we can freeze the weights of all the previous layers but the last one, so we only combine the learned features that the model is capable of outputing and use them to classify our inputs. 80 | 81 | Now if we want to fine tune the earlier layers to be more specific to our dataset, and only detect the fetures that will actualy help us, we can use diffent learning rates for each layers. 82 | 83 | In other words, for cat/dogs dataset, the previous layers have *already* been trained to recognize imagenet photos (whereas our final layers where randomly initialized), so we want to be careful of not destroying the carefully tuned weights that are already there. 84 | 85 | Generally speaking, the earlier layers have more general-purpose features. Therefore we would expect them to need less fine-tuning for new datasets. For this reason we will use different learning rates for different layers: the first few layers will be at say 1e-4, the middle layers at 1e-3, and our FC layers we'll leave at 1e-2, which is the same learning rate we got with the learning rate finder `lr=np.array([1e-4,1e-3,1e-2])`. 86 | 87 | And we can see that this kind of approach help us find better local minimas with smaller losses. 88 | 89 |

90 | 91 | ## 5. Cycling Learning Rate 92 | 93 | ### 5.1. One LR for all parameters 94 | 95 | Typically seen in SGD, a single LR is set at the beginning of the training, and an LR decay strategy is set (step, exponential etc.). This single LR is used to update all parameters. It is gradually decayed with each epoch with the assumption that with time, we reach near to the desired minima, upon which we need to slow down the updates so as not to overshoot it. 96 | 97 |

) 98 | 99 | There are many challenges to this approach: 100 | 101 | * Choosing an initial LR can be difficult to set in advance (as depicted in above figure). 102 | * Setting an LR schedule (LR update mechanism to decay it over time) is also difficult to be set in advance. They do not adapt to dynamics in data. 103 | * The same LR gets applied to all parameters which might be learning at different rates. 104 | * It is very hard to get out of a saddle point. 105 | 106 | ### 5.2. Adaptive LR for each parameter 107 | Improved optimizers like AdaGrad, AdaDelta, RMSprop and Adam alleviate much of the above challenges by adapting learning rates for each parameters being trained. With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule. 108 |

) 109 | 110 | ### 5.3. Cycling Learning Rate 111 | 112 | CLR was proposed by Leslie Smith in 2015. It is an approach to LR adjustments where the value is cycled between a lower bound and upper bound. By nature, it is seen as a competitor to the adaptive LR approaches and hence used mostly with SGD. But it is possible to use it along with the improved optimizers (mentioned above) with per parameter updates. 113 | 114 | CLR is computationally cheaper than the optimizers mentioned above. As the paper says: 115 | *Adaptive learning rates are fundamentally different from CLR policies, and CLR can be combined with adaptive learning rates, as shown in Section 4.1. In addition, CLR policies are computationally simpler than adaptive learning rates. CLR is likely most similar to the SGDR method that appeared recently.* 116 | 117 | ### 5.4. Why it works ? 118 | As far as intuition goes, conventional wisdom says we have to keep decreasing the LR as training progresses so that we converge with time. 119 | 120 | However, counterintuitively it might be useful to periodically vary the LR between a lower and higher threshold. The reasoning is that the periodic higher learning rates within the training help the model come out of any local minimas or saddle points if it ever enters into one, given that the difficulty in minimizing the loss arises from saddle points rather than poor local minima. If the saddle point happens to be an elaborate plateau, lower learning rates can never generate enough gradient to come out of it (or will take enormous time). That’s where periodic higher learning rates help with more rapid traversal of the surface. 121 | 122 |

) 123 | 124 | A second benefit is that the optimal LR appropriate for the error surface of the model will in all probability lie between the lower and higher bounds as discussed above. Hence we do get to use the best LR when amortized over time. 125 | 126 | ### 5.5. Epoch, iterations, cycles and stepsize 127 | 128 | An epoch is one run of the training algorithm across the entire training set. If we set a batch size of 100, we get 500 batches in 1 epoch or 500 iterations. The iteration count is accumulated over epochs, so that in epoch 2, we get iterations 501 to 1000 for the same batch of 500, and so one. 129 | 130 | With that in mind, a cycle is defined as that many iterations where we want our learning rate to go from a base learning rate to a max learning rate, and back. And a stepsize is half of a cycle. Note that a cycle, in this case, need not fall on the boundary of an epoch, though in practice it does. 131 | 132 |

) 133 | 134 | In the above diagram, we set a base lr and max lr for the algorithm, demarcated by the red lines. The blue line suggests the way learning rate is modified (in a triangular fashion), with the x-axis being the iterations. A complete up and down of the blue line is one cycle. And stepsize is half of that. 135 | 136 | * Deriving the optimal base lr and max lr: An optimal lower and upper bound of the learning rate can be found by letting the model run for a few epochs, letting the learning rate increase linearly and monitoring the accuracy. We run a complete step by setting stepsize equal to num_iterations (This will make the LR increase linearly and stop as num_iterations is reached). We also set base lr to a minimum value and max lr to a maximum value that we deem fit. 137 | * Deriving the optimal cycle length (or stepsize): The paper suggests, after experimentation, that the stepsize be set to 2-10 times the number of iterations in an epoch. In the previous example, since we had 500 iterations per epoch, setting stepsize from 1000 to 5000 would do. The paper found not much difference in setting stepsize to 2 times num of iterations in an epoch than 8 times so. 138 | 139 | ## 6. How to terain a state of the art classifier : 140 | 141 | 1. Use data augmentation and pretrained models 142 | 2. Use Cyclical Learning Rates to find highest learning rate where loss is still clearly improving 143 | 3. Train last layer from precomputed activations for 1-2 epochs 144 | 4. Train last layer with data augmentation for 2-3 epochs with stochastic gradient descent with restarts 145 | 5. Unfreeze all layers 146 | 6. Set earlier layers to 3x-10x lower learning rate than next higher layer 147 | 7. Use Cyclical Learning Rates again 148 | 8. Train full network with stochastic gradient descent with restarts with varying length until over-fitting 149 | 150 | ## 7. References: 151 | * [FastAi lecture 2 notes](https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-2-eeae2edd2be4) 152 | * [The Cyclical Learning Rate technique](http://teleported.in/posts/cyclic-learning-rate/) -------------------------------------------------------------------------------- /cutting_edge_DL/lecture8.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Tools](#Tools) 3 | * 1.1. [Pathlib](#Pathlib) 4 | * 1.2. [Tricks to using IDEs](#TrickstousingIDEs) 5 | * 1.3. [Matplotlib](#Matplotlib) 6 | * 1.4. [Python debugger](#Pythondebugger) 7 | * 2. [Object Detection](#ObjectDetection) 8 | * 2.1. [Load the data / annotations](#Loadthedataannotations) 9 | * 2.2. [Polting](#Polting) 10 | * 3. [Image classification](#Imageclassification) 11 | * 3.1. [The data](#Thedata) 12 | * 3.2. [The model](#Themodel) 13 | * 4. [Object localization](#Objectlocalization) 14 | * 4.1. [Boxes only](#Boxesonly) 15 | 16 | 20 | 21 | # Lecture 8: Object Detection 22 | 23 | ## 1. Tools 24 | 25 | ### 1.1. Pathlib 26 | 27 | - [Cheat Sheet](https://pbpython.com/pathlib-intro.html) 28 | 29 | Let's take a look at pathlib which was introduced in Python 3.4 and simplifies our work with files tremendously. With pathlib, we can directly call read_text from a pathlib variable, and the same process to write to a file: 30 | 31 | ```python 32 | from pathlib import Path 33 | import json 34 | # Reading 35 | path = Path('file.json') 36 | contents = path.read_text() 37 | json_content = json.load(contents) 38 | # Writing 39 | json_content['name'] = 'Rad' 40 | path.write_text(json.dump(json_content)) 41 | ``` 42 | 43 | We can also easily rename the file, and check if it exits or not: 44 | 45 | ```python 46 | path = Path('file.json') 47 | new_path = path.with_name('myfile.json') # PosixPath('myfile.json') 48 | path.rename(new_path) 49 | path.with_suffix('.txt') # To change the extension only 50 | path.exists() # False 51 | new_path.exists() # True 52 | ``` 53 | 54 | Our file was renamed in a few easy steps, in a nice object oriented fashion. The paths returned by json are of type PosixPath, PosixPath represents our path instance, which enables us to do all sorts of operations, like creating a new directory, checking for existence, checking for file type, getting size, checking for user, group, permissions, etc. Basically everything we would previously do with the os.path module. Like fetching a certain type of files with `glob`: 55 | 56 | ```python 57 | from pathlib import Path 58 | tmp = Path('json') 59 | files = list(tmp.glob('*.json')) 60 | # We can also iterate through sub-directories using iterdir 61 | tmp = Path('tmp') 62 | files_in_dir = list(tmp.iterdir()) 63 | ``` 64 | 65 | We can also traverse folder (append to the original path) using a simple backslash / operator and even creating new paths. 66 | 67 | ```python 68 | tmp = Path('tmp') 69 | new_path = tmp / 'pictures' / 'camera' # PosixPath('tmp/pictures/camera') 70 | new_path.mkdir(parents=True) 71 | new_path.exists() # True 72 | ``` 73 | 74 | ### 1.2. Tricks to using IDEs 75 | 76 | Some VScode tricks and shortcuts: 77 | 78 | * Command palette (`Ctrl-shift-p`) 79 | * Go to symbol (`Ctrl-t`) 80 | * Find references (`Shift-F12`) 81 | * Go to definition (`F12`) 82 | * Go back (`alt-left`) 83 | * Hide sidebar (`Ctrl-b`) 84 | * Zen mode (`Ctrl-k,z`) 85 | 86 | ### 1.3. Matplotlib 87 | 88 | One important big-picture matplotlib concept is its object hierarchy. `plt.plot([1, 2, 3])` hides the fact that a plot is really a hierarchy of nested Python objects. A “hierarchy” here means that there is a tree-like structure of matplotlib objects underlying each plot. 89 | 90 | A Figure object is the outermost container for a matplotlib graphic, which can contain multiple Axes objects. One source of confusion is the name: an Axes actually translates into what we think of as an individual plot or graph (rather than the plural of “axis,” as we might expect). 91 | 92 | We can think of the Figure object as a box-like container holding one or more Axes (actual plots). Below the Axes in the hierarchy are smaller objects such as tick marks, individual lines, legends, and text boxes. Almost every “element” of a chart is its own manipulable Python object, all the way down to the ticks and labels: 93 | 94 |

95 | 96 | [More info](https://forums.fast.ai/t/deeplearning-lec8-notes/13684) 97 | 98 | ### 1.4. Python debugger 99 | 100 | We can use the python debugger pdb to step through code, to do this we have to choices: 101 | 102 | * pdb.set_trace() to set a breakpoint 103 | * %debug magic to trace an error (after the exception happened) 104 | 105 | Useful Commands: 106 | 107 | * h (help) 108 | * s (step into) 109 | * n (next line / step over, we can also hit enter) 110 | * c (continue to the next breakpoint) 111 | * u (up the call stack) 112 | * d (down the call stack) 113 | * p (print), force print when there is a single letter variable that’s also a command. 114 | * l (list), show the line above and below it 115 | * q (quit), very important 116 | 117 | Documentation: 118 | 119 | ``` 120 | Documented commands (type help ): 121 | ======================================== 122 | EOF c d h list q rv undisplay 123 | a cl debug help ll quit s unt 124 | alias clear disable ignore longlist r source until 125 | args commands display interact n restart step up 126 | b condition down j next return tbreak w 127 | break cont enable jump p retval u whatis 128 | bt continue exit l pp run unalias where 129 | 130 | Miscellaneous help topics: 131 | ========================== 132 | exec pdb 133 | ``` 134 | 135 | 136 | ## 2. Object Detection 137 | 138 | The goal of object detection is to classify multiple things, and find the bounding boxes around the objects which we're classifying, the object has to be entirely within the bounding box, but not bigger then it needs to be. 139 | 140 | We'll tackle this problem in three stages: 141 | 142 | 1. Find the largest item in the image, classic image classification. 143 | 2. Propose a bounding box for the main object in the image. 144 | 3. Do both of things with an end to end network. 145 | 146 |

147 | 148 | ### 2.1. Load the data / annotations 149 | 150 | We'll use Pascal Voc 2007 ([Download link](http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar) with json annotation, a more simpler version than the original xml ones), which is a dataset that provides the annotations (bounding boxes) of the objects present in the image, there is another version (2012), containing more images, but it this lecture we're only going to use the smaller one, we could combine them, but we need to be careful to not have any data leakage between the two validation sets if we do this. 151 | 152 | ```python 153 | PATH = Path('data/pascal') 154 | list(PATH.iterdir()) 155 | # [PosixPath('data/pascal/VOCdevkit'), ... 156 | 157 | trn_j = json.load((PATH/'pascal_train2007.json').open()) 158 | trn_j.keys() # ['images', 'type', 'annotations', 'categories'] 159 | ``` 160 | 161 | The images field contains all the infomation about the images in the dataset, their name, the height and width and their IDs. for the annotations, here we have the bounding box annotations, the category ID of the object enclosed in the box, the segmentation and the box (xmin, ymin, xmax, ymax). 162 | 163 | IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations', 'categories'] 164 | 165 | ```python 166 | trn_j[IMAGES][0] # {'file_name': '000012.jpg', 'height': 333, 'id': 12, 'width': 500} 167 | trn_j[CATEGORIES][0] # {'id': 1, 'name': 'aeroplane', 'supercategory': 'none'} 168 | trn_j[ANNOTATIONS][0] 169 | ``` 170 | 171 | ``` 172 | {'area': 34104, 173 | 'bbox': [155, 96, 196, 174], 174 | 'category_id': 7, 175 | 'id': 1, 176 | 'ignore': 0, 177 | 'image_id': 12, 178 | 'iscrowd': 0, 179 | 'segmentation': [[155, 96, 155, 270, 351, 270, 351, 96]]} 180 | ``` 181 | 182 | It’s helpful to use constants instead of strings, since we get tab-completion and don’t mistype. so then we can create dicts to convert from IDs to class names, the same thing for image IDs and their names, and we also create a list containing all of the image IDs in the training set. 183 | 184 | ```python 185 | FILE_NAME, ID, IMG_ID, CAT_ID, BBOX = 'file_name', 'id', 'image_id', 'category_id', 'bbox' 186 | 187 | cats = dict((o[ID], o['name']) for o in trn_j[CATEGORIES]) 188 | trn_fns = dict((o[ID], o[FILE_NAME]) for o in trn_j[IMAGES]) 189 | trn_ids = [o[ID] for o in trn_j[IMAGES]] 190 | ``` 191 | 192 | Now, after loading the annotations, we'll take a look into the folder containing the images, in PASCAL VOC, the images are in the folder `VOCdevkit/VOC2007/JPEGImages` 193 | 194 | ```python 195 | JPEGS = 'VOCdevkit/VOC2007/JPEGImages' 196 | IMG_PATH = PATH/JPEGS 197 | ``` 198 | 199 | To simplify the process of using the annotations, we create a dictionary, where the keys are image IDs and the the values are a list of all the bounding boxes in the image, in such cases, a `defaultdict` is useful any time we want to have a default dictionary entry for new keys, so in case we want to access a new key, if it is not in the dict it'll be created, depending on the type we desire, in this case we'd like lists to be created. 200 | 201 | ```python 202 | trn_anno = collections.defaultdict(lambda:[]) 203 | for o in trn_j[ANNOTATIONS]: 204 | if not o['ignore']: 205 | bb = o[BBOX] 206 | bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1]) 207 | trn_anno[o[IMG_ID]].append((bb,o[CAT_ID])) 208 | ``` 209 | 210 | Here we switched between x and y, in the original annotation, we have (xmin, ymin, xmax, ymax), which is the standard in computer vision, but given that in matrix / tensors, we start by the row (y) and then we index the columns (x), we changed them, and we create a new function to convert between the two. 211 | 212 | ```python 213 | def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1],a[2]-a[0]]) 214 | ``` 215 | 216 | ### 2.2. Polting 217 | 218 | Matplotlib’s `plt.subplots` is a useful wrapper for creating plots, regardless of whether we have more than one subplot. Here is a simple function used to create an axis with a subplot, that we'll pass to other function to add / draw some new elements in the plot. 219 | 220 | ```python 221 | def show_img(im, figsize=None, ax=None): 222 | if not ax: fig,ax = plt.subplots(figsize=figsize) 223 | ax.imshow(im) 224 | ax.get_xaxis().set_visible(False) 225 | ax.get_yaxis().set_visible(False) 226 | return ax 227 | ``` 228 | 229 | A simple but rarely used trick to making text visible regardless of background is to use white text with black outline, or vice versa. Here’s how to do it in matplotlib. 230 | 231 | ```python 232 | def draw_outline(o, lw): 233 | o.set_path_effects([patheffects.Stroke( 234 | linewidth=lw, foreground='black'), patheffects.Normal()]) 235 | ``` 236 | 237 | And here are two little function to draw the bounding boxes, passing the x, y and H / W, that are unpacked, and the text (box labels) 238 | 239 | ```python 240 | def draw_rect(ax, b): 241 | patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], fill=False, edgecolor='white', lw=2)) 242 | draw_outline(patch, 4) 243 | 244 | def draw_text(ax, xy, txt, sz=14): 245 | text = ax.text(*xy, txt, 246 | verticalalignment='top', color='white', fontsize=sz, weight='bold') 247 | draw_outline(text, 1) 248 | ``` 249 | 250 | And now we can reuse the created function above to show multiple objects on a single image by simply plotting the image and iterating over the bounding boxes one by one. 251 | 252 | ```python 253 | def draw_im(im, ann): 254 | ax = show_img(im, figsize=(16,8)) 255 | for b,c in ann: 256 | b = bb_hw(b) 257 | draw_rect(ax, b) 258 | draw_text(ax, b[:2], cats[c], sz=16) 259 | ``` 260 | 261 | And a more general function, that takes the ID, loads the image and the annotation, and then displays them. 262 | 263 | ```python 264 | def draw_idx(i): 265 | im_a = trn_anno[i] 266 | im = open_image(IMG_PATH/trn_fns[i]) 267 | print(im.shape) 268 | draw_im(im, im_a) 269 | ``` 270 | 271 |

272 | 273 | ## 3. Image classification 274 | 275 | ### 3.1. The data 276 | 277 | The first and easiest step is to start with a simple image classification task, by predicting the classes of the largest object in the image, so we need to start by creating the labels of the images in the dataset using the areas of each bounding box, for this we can use a simple lambda function, and call it in the appropriate function that returns the largest box, and then call the function in a dictionary comprehension to associate with each image / image ID the largest box, and each box is a tuple with its four coordinates and the associated class (or its category id to be precise). 278 | 279 | ```python 280 | sorted(b, key=lambda x: np.product(x[0][-2:]-x[0][:2]), reverse=True) 281 | 282 | def get_lrg(b): 283 | if not b: raise Exception() 284 | b = sorted(b, key=lambda x: np.product(x[0][-2:]-x[0][:2]), reverse=True) 285 | return b[0] 286 | 287 | trn_lrg_anno = {a: get_lrg(b) for a,b in trn_anno.items()} 288 | ``` 289 | 290 | To avoid the need to apply these steps each time we want to use the dataset, we store the image indexed by `img_id` and the corresponding largest box in a Pandas data frame, we also specify the columns to force the correct ordering of them. 291 | 292 | ```python 293 | (PATH/'tmp').mkdir(exist_ok=True) 294 | CSV = PATH/'tmp/lrg.csv' 295 | df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 296 | 'cat': [cats[trn_lrg_anno[o][1]] for o in trn_ids]}, columns=['fn','cat']) 297 | df.to_csv(CSV, index=False) 298 | ``` 299 | 300 | ### 3.2. The model 301 | 302 | First we'll resize all the image to a fixed size, and without any cropping, only rescaling the image to the right size given that the largest object may not be centred in the image unlike the instances in imagenet per example, and we then choose the model (a simple resnet 34) and create a data loader of the dataset, a data loader is an iterator that will provide a mini-batch of data at each training iteration. But first we need to ensure that we start at the beginning of the dataset. Pythons’ iter() method will create an iterator object and start at the beginning of the dataset. And afterwards our iterator will have `__next__` that can be used to pull a mini-batch, the element returned by the dataloader are in the GPU, and are in the form of torch tensors, and are nomalized between 0 and 1. 303 | 304 | ```python 305 | f_model = resnet34 306 | sz=224 307 | bs=64 308 | 309 | tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on, crop_type=CropType.NO) 310 | md = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms) 311 | md.trn_dl # 312 | ``` 313 | 314 | And like in earlier lessons, we start by doing a learning rate finder `lrf = learn.lr_find(1e-5,100)`, and we find that a learning rate of `lr = 2e-2` is appropriate, and then we fit our model for one cycle, and then we freeze all layers except the last two layers and use differential learning rates for each one (last two + the head), find new learning rate and retrain the model for one cycle, and then unfreeze all the layers and fine tuning a little bit, but we see that the accuracy is stuck at 0.83 given that sometime the biggest box is not the obvious one to be detected. 315 | 316 | ```python 317 | lr = 2e-2 318 | learn.fit(lr, 1, cycle_len=1) 319 | learn.freeze_to(-2) 320 | 321 | lrs = np.array([lr/1000,lr/100,lr]) 322 | learn.freeze_to(-2) 323 | learn.fit(lrs/5, 1, cycle_len=1) 324 | 325 | learn.unfreeze() 326 | learn.fit(lrs/5, 1, cycle_len=2) 327 | ``` 328 | 329 | The results are quite good: 330 | 331 |

332 | 333 | ## 4. Object localization 334 | 335 | Now instead of only predecting the label of the largest object, we're also going to predict the 4 coordinates of the box of the object in question. To predict the bounding box, we'll need a new output branch of the model, this branch will be a regression branch outputing real values corresponding to image coordinates, and the loss function will be a MSE or a L1 loss to not overblow the loss when the error is large, and for the classification we have the same classification output with a softmax output and a cross entropy loss. 336 | 337 | **Boxes only**: First we only going to have a regression output, given that the original resnet output a classification with a vector of 1000 for the probabilities of all the classes in the imagenet dataset, or a 21 classes in our case of PASCAL VOC, we're going to have a model with only four output, so we're going to have a conv layer going from 7x7x512 to 1x1x4 instead of an average pool folowed by a linear layer, and we can also flatten the 7x7x512 == 25088 and then add a linear layer of 25088 -> 4. 338 | 339 | For the dataset, we creat a new datafama, and this time with two fields are the image filenames and their bounding boxes: 340 | 341 | ```python 342 | bb = np.array([trn_lrg_anno[o][0] for o in trn_ids]) 343 | bbs = [' '.join(str(p) for p in o) for o in bb] 344 | 345 | df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 'bbox': bbs}, columns=['fn','bbox']) 346 | df.to_csv(BB_CSV, index=False) 347 | ``` 348 | 349 | Adding a custom head to the resnet 34, and then per the usual, find the LR rate, and train the head, gradually unfreeze the layers and train them with differential learning rates: 350 | 351 | ```python 352 | head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4)) 353 | learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4) 354 | learn.opt_fn = optim.Adam 355 | learn.crit = nn.L1Loss() 356 | 357 | lr = 2e-3 358 | learn.fit(lr, 2, cycle_len=1, cycle_mult=2) 359 | lrs = np.array([lr/100,lr/10,lr]) 360 | learn.freeze_to(-2) 361 | learn.fit(lrs, 2, cycle_len=1, cycle_mult=2) 362 | learn.fit(lrs, 1, cycle_len=2) 363 | learn.freeze_to(-3) 364 | ``` 365 | 366 | And then results are somehow correct, when there is only one object the network predicts a correct bounding box, but when there is multipple object the model does not know which one to choose so it the chooses the middle, or when the object is small and not the obvious: 367 | 368 |

-------------------------------------------------------------------------------- /intro_to_ML/lecture2.md: -------------------------------------------------------------------------------- 1 | 2 | - [Lecture 2 - Random Forest Deep Dive](#Lecture-2---Random-Forest-Deep-Dive) 3 | - [1. Model Evaluation](#1-a-nameModelEvaluationaModel-Evaluation) 4 | - [1.1. R2 Score](#11-a-nameR2ScoreaR2-Score) 5 | - [1.2. Validation set for the Kaggle data](#12-a-nameValidationsetfortheKaggledataaValidation-set-for-the-Kaggle-data) 6 | - [2. Random Forest](#2-a-nameRandomForestaRandom-Forest) 7 | - [2.1. Building a single tree](#21-a-nameBuildingasingletreeaBuilding-a-single-tree) 8 | - [2.2. Side Note: Using Entropy to pick the nodes](#22-a-nameSideNoteUsingEntropytopickthenodesaSide-Note-Using-Entropy-to-pick-the-nodes) 9 | - [2.3. Bagging](#23-a-nameBaggingaBagging) 10 | - [2.4. Out-of-bag (OOB) score](#24-a-nameOut-of-bagOOBscoreaOut-of-bag-OOB-score) 11 | - [2.5. Subsampling](#25-a-nameSubsamplingaSubsampling) 12 | - [2.6. Overfitting](#26-a-nameOverfittingaOverfitting) 13 | 14 | 18 | 19 | # Lecture 2 - Random Forest Deep Dive 20 | 21 | 22 | ## 1. Model Evaluation 23 | 24 | Let's first statr by introduction the measure we'll use to evaluate our model: 25 | 26 | ### 1.1. R2 Score 27 | In an R2 Score, we compare the models error to the error made by the most non-stupid model we can use (always predicting the average), so the range of R2 is [-inf, 1], one is the perfect score, and when the R² is negative, it means that the model is worse than predicting the mean. 28 | 29 | So in a nutshell, R² is the ratio between how good the model is vs. the naïve mean model. 30 | 31 | [comment]: <> ($$R ^ { 2 } \equiv 1 - \frac { S S _ { \mathrm { res } } } { S S _ { \mathrm { tot } } }$$) 32 | 33 | 34 | 35 | Where: 36 | 37 | * The total sum of squares (proportional to the variance of the data): 38 | * The sum of squares of residuals, also called the residual sum of squares: 39 | 40 |

41 | 42 | R² is not necessarily what we are actually trying to optimize, but it is a number we can use for every model and we can start to get a feel of what .8 looks like or what .9 looks like. Something we may find interesting is to create synthetic 2D datasets with different amounts of random noise, and see what they look like on a scatterplot and their R² to get a feel of how close they are to the actual value. 43 | 44 | ### 1.2. Validation set for the Kaggle data 45 | 46 | If the dataset has a time piece in it (as is in Blue Book competition), we would likely to predict future prices/values/etc. What Kaggle did was to give us data representing a particular date range in the training set, and then the test set presented a future set of dates that wasn’t represented in the training set. So we need to create a validation set (for hyper-parameter tuning without over fitted to avoid to the training data) that has the same properties: 47 | 48 | ```python 49 | def split_vals(a,n): return a[:n].copy(), a[n:].copy() 50 | 51 | n_valid = 12000 # same as Kaggle's test set size 52 | n_trn = len(df)-n_valid 53 | raw_train, raw_valid = split_vals(df_raw, n_trn) 54 | X_train, X_valid = split_vals(df, n_trn) 55 | y_train, y_valid = split_vals(y, n_trn) 56 | ``` 57 | 58 | But there is always the possibility of eventually overfitting on the validation set and when we try it on the test set or submit it to Kaggle, it turns out not to be very good. This happens in Kaggle competitions all the time and they actually have a fourth dataset which is called the private leader board set. Every time we submit to Kaggle, we actually only get feedback on how well it does on the public leader board set and we do not know which rows they are. At the end of the competition, we get judged on a different dataset entirely called the private leader board set. 59 | 60 | ## 2. Random Forest 61 | 62 | Now let's jump into the prediction fuction we'll use, or the model, which is a tree based ensamble called Random Forest 63 | 64 | ### 2.1. Building a single tree 65 | 66 | To unserstand Random Forest, let's begin by taking a look into trees, for that we'll use `n_estimators=1` to create a forest with just one tree and `max_depth=3` to make it a small tree 67 | 68 | ```python 69 | m = RandomForestRegressor(n_estimators=1, max_depth=3, 70 | bootstrap=False, n_jobs=-1) 71 | m.fit(X_train, y_train) 72 | print_score(m) 73 | ``` 74 | 75 |

76 | 77 | The first logical question, is how to find the best split of our data into two sub-classes: 78 | 79 | * We need to pick a variable and the value to split on such that the two groups are as different to each other as possible and have simlilar elements internally. 80 | * To find the best variable, and the best values to split on for that variable, we can go through all the variables and all the possible values of the selected varaibles, we do a split and calculate the weighted average of the two resulting groups = MSE1 * samples/samples1 + MSE2 * samples/samples2, and then choose variables/value with the smallest error (note: the MSE of the root node, is naive model, aka predicitng the average). 81 | 82 | Right now, our decision tree has R² of 0.4. Let’s make it better by removing max_depth=3. By doing so, the training R² becomes 1 (as expected since each leaf node contains exactly one element) and validation R² is 0.73, which is better than the shallow tree but not as good as we would like. 83 | 84 | 85 | ### 2.2. Side Note: Using Entropy to pick the nodes 86 | Instead of using MSE to differentiate between different splits, and find the best ones, we can also use the entropy of the leaves (how homogenous they are, or how much the entropy was reduced): 87 | 88 | 1- Given a chosen attribute A, with K distinct values, it divides the training set E into subsets E1 , ... , EK. 89 | 90 | 2- The Expected Entropy (EH) remaining after trying attribute A (with branches i=1,2,..., K), given that is: 91 | 92 | 93 | 94 | 3- Information gain (I) or reduction in entropy for this attribute is (entropy before the split - entropy after it) 95 | 96 | 97 | 98 | 4- Choose the attribute with the largest I 99 | 100 | ### 2.3. Bagging 101 | 102 | *Let’s make our decision tree better*, To make these trees better, we will create a forest. To create a forest, we will use a statistical technique called **bagging**. 103 | 104 | **Bagging** is an interesting idea which is what if we created five different models each of which was only somewhat predictive but the models gave predictions that were not correlated with each other. That would mean that the five models would have profound different insights into the relationships in the data. If we took the average of those five models, we are effectively bringing in the insights from each of them. So this idea of averaging models is a technique for **Ensembling**. 105 | 106 | What if we created a whole a lot of trees, big, deep, massively overfit trees but each one, let’s say, we only pick a random 1/10 of the data and we do that a hundred times (different random sample every time). They are overfitting terribly but since they are all using different random samples, they all overfit in different ways on different things. In other words, they all have errors but the errors are random / not correlated. The average of random errors is zero. If we take the average of these trees each of which have been trained on a different random subset, the error will average out to zero and what is left is the true relationships, and that’s the random forest. 107 | 108 | The key insight here is to construct multiple models which are better than nothing and where the errors are, as much as possible, not correlated with each other. 109 | 110 | And it turns out, the more uncorrelated the trees, the better the results, now even with random samples, when we built a large number of trees, the trees will be somehow correlated, so we can always try to add more randomization to the process of creating the trees, say like choosing the variables to split on radomly and not evaluating all the variables to find the best splits. Like `ExtraTreeClassifier ` in skitlearn 111 | 112 | **Code example:** 113 | ```python 114 | preds = np.stack([t.predict(X_valid) for t in m.estimators_]) 115 | preds[:,0], np.mean(preds[:,0]), y_valid[0] 116 | 117 | (array([ 9.21034, 8.9872 , 8.9872 , 8.9872 , 8.9872 , 9.21034, 8.92266, 9.21034, 9.21034, 8.9872 ]), 118 | 9.0700003890739005, 119 | 9.1049798563183568) 120 | ``` 121 | 122 | Each tree is stored in an attribute called `estimators_` . For each tree, we will call predict with our validation set. `np.stack` concatenates them together on a new axis, so the resulting `preds` has the shape of (10, 12000) (10 trees, 12000 examples in the validation set). The mean of 10 predictions for the first data point preds[:,0] is 9.07, and the actual value is 9.10. As we can see, none of the individual prediction is close to 9.10, but the mean ends up pretty good. 123 | 124 |

125 | 126 | Here is a plot of R² values given first *X* trees. As we add more trees, R² improves. But it seems as though it has flattened out. 127 | 128 | ### 2.4. Out-of-bag (OOB) score 129 | 130 | Sometimes the dataset will be small and we will not want to pull out a validation set because doing so means we now do not have enough data to build a good model. However, random forests have a very clever trick called out-of-bag (OOB) error which can handle this. 131 | 132 | ```python 133 | m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True) 134 | ``` 135 | 136 | What we could do is to recognize that in our first tree, some of the rows did not get used for training. so we can pass those unused rows through the first tree and treat it as a validation set for this first tree. For the second tree, we could pass through the rows that were not used as the second val set, and so on. Effectively, we would have a different validation set for each tree. To calculate our prediction, we would average all the trees where that row is not used for training. If we have hundreds of trees, it is very likely that all of the rows are going to appear many times in these out-of-bag samples. we can then calculate RMSE, R², etc on these out-of-bag predictions. 137 | 138 | ### 2.5. Subsampling 139 | 140 | Earlier, we took 30,000 rows and created all the models which used a different subset of that 30,000 rows to speed things up. Why not take a totally different subset of 30,000 each time? In other words, let’s leave the entire 389,125 records as is, and if we want to make things faster, pick a different subset of 30,000 each time. So rather than bootstrapping the entire set of rows, just randomly sample a subset of the data. 141 | 142 | ```python 143 | def set_rf_samples(n): 144 | """ Changes Scikit learn's random forests to give each tree a random sample of n random rows. 145 | """ 146 | forest._generate_sample_indices = (lambda rs, n_samples: 147 | forest.check_random_state(rs).randint(0, n_samples, n)) 148 | ``` 149 | 150 | This will take the same amount of time to run as before, but every tree has an access to the entire dataset. After using 40 estimators, we get the R² score of 0.876. 151 | 152 | **TIP:** Most people run all of their models on all of the data all of the time using their best possible parameters which is just pointless. If we are trying to find out which feature is important and how they are related to each other, having that 4th decimal place of accuracy is not going to change any of the insights at all. So we can first do most of our teston a large enough sample size that is representative of accuracy but with a resonable size (within a reasonable distance of the best accuracy we can get) and taking a small number of seconds to train so that we can interactively do our analysis. 153 | 154 | **A quick summary** [Link](https://stackoverflow.com/questions/18541923/what-is-out-of-bag-error-in-random-forests) 155 | 156 | Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables). 157 | 158 | T = {(X1,y1), (X2,y2), ... (Xn, yn)} 159 | 160 | and 161 | 162 | Xi is input vector {xi1, xi2, ... xiM} 163 | yi is the label (or output or class). 164 | 165 | Random Forests algorithm is a classifier based on primarily two methods: 166 | - Bagging 167 | - Random subspace method. 168 | 169 | Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in `{T1, T2, ... TS}` datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called *Bootstrapping*. And Bagging is the process of taking bootstraps & then aggregating the models learned on each *bootstrap*. 170 | 171 | Now, RF creates `S` trees and uses m (sqrt(M) or ln(M+1)) random subfeatures out of M possible features to create any tree. This is called *Random subspace method*. 172 | 173 | So for each Ti bootstrap dataset we create a tree `Ki`. If we want to classify some input data `D = {x1, x2, ..., xM}`, we pass it through each tree and produce S outputs (one for each tree) which can be denoted by `Y = {y1, y2, ..., ys}`. Final prediction is a majority vote on this set (or the avearge). 174 | 175 | *Out-of-bag error:* After creating the classifiers (`S` trees). To evaluate our model, for each `(Xi,yi)` in the original training set T, we select all `Tk` which does not include the said exampke `(Xi,yi)`. This subset is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes only over `Tk` such that it does not contain `(Xi,yi)`. 176 | 177 | Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's). 178 | 179 | ### 2.6. Overfitting 180 | 181 | One way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with `min_samples_leaf`) that we require some minimum number of rows in every leaf node. This has two benefits: 182 | 183 | - There are less decision rules for each leaf node; simpler models should generalize better 184 | - The predictions are made by averaging more rows in the leaf node, resulting in less volatility 185 | 186 | For each tree, rather than just taking one point, we are taking the average of at least three points that we would expect the each tree to generalize better. But each tree is going to be slightly less powerful on its own. The numbers that work well are 1, 3, 5, 10, 25, but it is relative to the overall dataset size. 187 | 188 | ```python 189 | m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, n_jobs=-1, oob_score=True) 190 | m.fit(X_train, y_train) 191 | ``` 192 | 193 | Another way is to pick a random number of columns using `max_feature`, The idea is that the less correlated the trees are with each other, the better. Imagine if we had one column that was so much better than all of the other columns of being predictive that every single tree we built always started with that column. But there might be some interaction of variables where that interaction is more important than the individual column. So if every tree always splits on the same thing the first time, we will not get much variation in those trees. For example, `max_feature = 0.5`, we only use half of the variables to find the best splits. 194 | 195 | ```python 196 | m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True) 197 | m.fit(X_train, y_train) 198 | ``` 199 | 200 | -------------------------------------------------------------------------------- /parctical_DL/lecture4.md: -------------------------------------------------------------------------------- 1 | 2 | * 1. [Structured and Time Series Data](#StructuredandTimeSeriesData) 3 | * 1.1. [Joining the data](#Joiningthedata) 4 | * 1.2. [Preprocessing](#Preprocessing) 5 | * 1.3. [The model, Embeddings and training:](#ThemodelEmbeddingsandtraining:) 6 | * 2. [NLP](#NLP) 7 | * 2.1. [spaCy](#spaCy) 8 | * 2.2. [Data](#Data) 9 | * 2.3. [Preprocessing](#Preprocessing-1) 10 | * 2.4. [The model](#Themodel) 11 | * 2.5. [Sentiment analysis](#Sentimentanalysis) 12 | 13 | 17 | 18 | 19 | # Lecture 4: Embeddings 20 | 21 | ## 1. Structured and Time Series Data 22 | 23 | There are two types of columns: 24 | 25 | * Categorical, It has a number of “levels” e.g. StoreType, Assortment. 26 | * Continuous, It has a number where differences or ratios of that numbers have some kind of meanings e.g. `CompetitionDistance`. 27 | 28 | Numbers like `Year` , `Month`, although we could treat them as continuous, we do not have to. If we decide to make `Year` a categorical variable, we are telling our neural net that for every different “level” of `Year` (2000, 2001, 2002), we can treat it totally differently; where-else if we say it is continuous, it has to come up with some kind of smooth function to fit them. So often things that actually are continuous but do not have many distinct levels (e.g. Year, DayOfWeek), it often works better to treat them as categorical. 29 | 30 | Choosing categorical vs. continuous variable is a modeling decision we get to make. In summary, if it is categorical in the data, it has to be categorical. If it is continuous in the data, we get to pick whether to make it continuous or categorical in the model. Generally, floating point numbers are hard to make categorical as there are many levels (Cardinality). 31 | 32 | ### 1.1. Joining the data 33 | In the competition, the winners used additional data like state names, google trends and weather, so first we need to joint this dataframes with different sizes, this is done using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a "\_y" to those on the right. 34 | 35 | `join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table. 36 | 37 | ```python 38 | def join_df(left, right, left_on, right_on=None, suffix='_y'): 39 | if right_on is None: right_on = left_on 40 | return left.merge(right, how='left', left_on=left_on, right_on=right_on, 41 | suffixes=("", suffix)) 42 | 43 | weather = join_df(weather, state_names, "file", "StateName") 44 | ``` 45 | 46 | Here we join state_names to the right of weather, the name of the field is "StateName". 47 | 48 | ### 1.2. Preprocessing 49 | 50 | * Turn all the continuous ones into 32bit floating point for pytorch. 51 | * Pull out the dependent variable it into a separate variable, and deletes it from the original data frame. (the value to be predicted is deleted from `df`, and now we have a single column dataframe with the target value). 52 | * Neural nets like to have the input data to all be somewhere around zero with a standard deviation of somewhere around 1. So we take our data, subtract the mean, and divide by the standard deviation to make that happen. It returns a special object which keeps track of what mean and standard deviation it used for that normalization so we can do the same to the test set later (mapper). 53 | * Handling missing values, for categorical variable, it becomes ID: 0 and other categories become 1, 2, 3, and so on. For continuous variable, it replaces the missing value with the median and create a new boolean column that says whether it was missing or not. 54 | 55 | Now we have a data frame which does not contain the dependent variable and where everything is a number. Now we're ready to deep learning. 56 | 57 | ### 1.3. The model, Embeddings and training: 58 | 59 | As per usual, we will start by creating model data object which has a validation set, training set, and optional test set built into it. From that, we will get a learner, we will then optionally call lr_find, then call learn.fit and so forth. 60 | 61 | The continuous variables are directly fed in to the network, now for the categorical variables, instead of representing them as one hot vectors, we'll embed into a lower space, using different embeddings matrices depending on the cardinality of the categories of each column, for example, the categories we have and their cardinality : 62 | 63 | ```python 64 | [('Store', 1116), ('DayOfWeek', 8), ('Year', 4), ('Month', 13), ('Day', 32), ('StateHoliday', 3), ('CompetitionMonthsOpen', 26), ('Promo2Weeks', 27), ('StoreType', 5), ('Assortment', 4), ('PromoInterval', 4), ('CompetitionOpenSinceYear', 24), ('Promo2SinceYear', 9), ('State', 13), ('Week', 53), ('Events', 22), ('Promo_fw', 7), ('Promo_bw', 7), ('StateHoliday_fw', 4), ('StateHoliday_bw', 4), ('SchoolHoliday_fw', 9), ('SchoolHoliday_bw', 9)] 65 | ``` 66 | 67 | Now depending on the cardinality of each one cat we'll chose an embedding size, with a max of 50: 68 | 69 | ```python 70 | emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz] 71 | 72 | [(1116, 50), (8, 4), (4, 2), (13, 7), (32, 16), (3, 2), (26, 13), (27, 14), (5, 3), (4, 2), (4, 2), (24, 12), (9, 5), (13, 7), (53, 27), (22, 11), (7, 4), (7, 4), (4, 2), (4, 2), (9, 5), (9, 5)] 73 | ``` 74 | 75 | So now we have for each column a specific embedding matrix, we pass all the values of each column per their corresponding matrix, so our model we'll have nb_cat embeddings, each one with the specific size ((1116, 50) ... (9, 5)), so now, per example, all the categorical vars are transformed into vectors (50 ..... 5), the total size of all of them concatenated is 185, now we add the numerical values of the continuous variables which are 185, and the inputs of our model is a vector of size 201, and them we add linear layers (201x100 and then 1000x500), batch norm and dropout. 76 | 77 | ```python 78 | MixedInputModel( 79 | (embs): ModuleList( 80 | (0): Embedding(1116, 50) 81 | (1): Embedding(8, 4) 82 | (2): Embedding(4, 2) 83 | (3): Embedding(13, 7) 84 | (4): Embedding(32, 16) 85 | (5): Embedding(3, 2) 86 | ..... 87 | (17): Embedding(7, 4) 88 | (18): Embedding(4, 2) 89 | (19): Embedding(4, 2) 90 | (20): Embedding(9, 5) 91 | (21): Embedding(9, 5) 92 | ) 93 | (lins): ModuleList( 94 | (0): Linear(in_features=201, out_features=1000, bias=True) 95 | (1): Linear(in_features=1000, out_features=500, bias=True) 96 | ) 97 | (bns): ModuleList( 98 | (0): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 99 | (1): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 100 | ) 101 | (outp): Linear(in_features=500, out_features=1, bias=True) 102 | (emb_drop): Dropout(p=0.04) 103 | (drops): ModuleList( 104 | (0): Dropout(p=0.001) 105 | (1): Dropout(p=0.01) 106 | ) 107 | (bn): BatchNorm1d(18, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) 108 | ``` 109 | 110 |

111 | 112 | ## 2. NLP 113 | 114 | ### 2.1. spaCy 115 | spaCy is a relatively new package for “Industrial strength NLP in Python”. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). If we are familiar with the Python data science stack, spaCy is the numpy for NLP – it’s reasonably low-level, but very intuitive and performant. 116 | 117 | spacy provides a one-stop-shop for tasks commonly used in any NLP project, including: 118 | 119 | * Tokenisation. 120 | * Lemmatisation. 121 | * Part-of-speech tagging. 122 | * Entity recognition. 123 | * Dependency parsing. 124 | * Sentence recognition. 125 | * Word-to-vector transformations. 126 | * Many convenience methods for cleaning and normalizing text. 127 | 128 | Examples: 129 | * **Lemmatisation** is the process of reducing a word to its base form, its mother word 130 | * **Tokenising** text is the process of splitting a piece of text into words, symbols, punctuation, spaces and other elements 131 | * **Part-of-speech tagging** is the process of assigning grammatical properties (e.g. noun, verb, adverb, adjective etc.) to words 132 | * **Entity recognition** is the process of classifying named entities found in a text into pre-defined categories, such as persons, places, organizations, dates, etc. 133 | 134 | ```python 135 | import spacy 136 | nlp = spacy.load("en") 137 | doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!") 138 | 139 | # Tokenization 140 | [token.orth_ for token in doc] 141 | ''' OUT -> ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', ',', 'but', 'fortunately', 'he', 'was', "n't", ' ', 'sick', '!']''' 142 | 143 | # Lemmatization 144 | doc = nlp(""practice practiced practicing" ") 145 | [word.lemma_ for word in doc] 146 | ''' OUT -> [['practice', 'practice', 'practice']]''' 147 | 148 | # POS Tagging 149 | doc = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house") 150 | pos_tags = [(i, i.tag_) for i in doc2] 151 | ''' OUT -> [(Conor, 'NNP'), ('s, 'POS'), (dog, 'NN'), ('s, 'POS'), (toy, 'NN'), (was, 'VBD'), (hidden, 'VBN'), (under, 'IN'), (the, 'DT'), (man, 'NN'), ('s, 'POS'), (sofa, 'NN'), (in, 'IN'), (the, 'DT'), (woman, 'NN'), ('s, 'POS'), (house, 'NN')]''' 152 | 153 | # Entity recognition 154 | doc = nlp("Barack Obama is an American politician") 155 | [(i, i.label_, i.label) for i in nlp_obama.ents] 156 | ''' OUT -> [(Barack Obama, 'PERSON', 346), (American, 'NORP', 347)]''' 157 | ``` 158 | 159 | ### 2.2. Data 160 | 161 | The [large movie view dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews. 162 | 163 | The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text. However, before we try to classify the *sentiment*, we will simply try to create a *language model*; that is, a model that can predict the next word in a sentence. Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment. 164 | 165 | First we'll use a similar approach to the one used in image classification, pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment). 166 | 167 | Unfortunately, given that there aren't any good pretrained language models available to download, we need to create our own. [Dataset](http://files.fast.ai/data/aclImdb.tgz) download link. 168 | 169 | ### 2.3. Preprocessing 170 | Before we can analyze the text, we must first tokenize it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of tokens). 171 | 172 | We use Pytorch's `torchtext` library to preprocess our data. First, we create a torchtext field, which describes how to preprocess a piece of text, in this case, we tell torchtext to make everything lowercase, and tokenize it with `spacy` (or NLTK). 173 | 174 | ```python 175 | TEXT = data.Field(lower=True, tokenize="spacy") 176 | ``` 177 | 178 | We can then save it using `dill`, given python's standard `Pickle` library can't handle this correctly 179 | 180 | ```python 181 | dill.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb')) 182 | ``` 183 | 184 | **About** [Dill](https://pypi.org/project/dill/): dill extends python’s pickle module for serializing and de-serializing python objects to the majority of the built-in python types. Serialization is the process of converting an object to a byte stream, and the inverse of which is converting a byte stream back to on python object hierarchy. 185 | 186 | We can also access and constructed vocabulary, and also *numericalize* the words: 187 | 188 | ```python 189 | # 'itos': 'int-to-string' 190 | TEXT.vocab.itos[:12] 191 | >>> ['', '', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in'] 192 | 193 | TEXT.numericalize([md.trn_ds[0].text[:12]]) 194 | >>> Variable containing: 12 35 227 480 13 76 17 2 7319 769 3 2 [torch.cuda.LongTensor of size 12, 1 (GPU 0)] 195 | ``` 196 | 197 | Now, we need to construct our data into batches to feed it to the model, we choose batches of size 64, and a length of 70 (for back propagating through time), so each batch will contain ~ 70 words, it varies, because pytorch adds some randomness to the process, in vision we can shuffle the images but in NLP the order is important. 198 | 199 | So first we start by taking our whole dataset, tokenize it (all the 50,000 texts, both POS and NEG), and then take all the words (~20M) and group them into 64 sections, and each time we take a window of 70 words for our training, and the labels are the same batch but shifted one word to the right, given that the objective is to predict the upcoming word (. 200 | 201 |

) 202 | 203 | 204 | And if we see the size of the batches, and display the words by sampling one example of the data loader, we get: 205 | ```python 206 | next(iter(md.trn_dl))[0].size() 207 | # Batch sizes 208 | torch.Size([71, 64]) 209 | 210 | next(iter(md.trn_dl)) 211 | # Input and target, the target is flattened for ease computation 212 | (Variable containing: 213 | 12 567 3 ... 2118 4 2399 214 | 35 7 33 ... 6 148 55 215 | 227 103 533 ... 4892 31 10 216 | ... ⋱ ... 217 | 19 8879 33 ... 41 24 733 218 | 552 8250 57 ... 219 57 1777 219 | 5 19 2 ... 3099 8 48 220 | [torch.cuda.LongTensor of size 77,64 (GPU 0)], Variable containing: 221 | 35 222 | 7 223 | 33 224 | ⋮ 225 | 22 226 | 3885 227 | 21587 228 | [torch.cuda.LongTensor of size 4928 (GPU 0)]) 229 | ``` 230 | 231 | ### 2.4. The model 232 | We have a number of parameters to set. 233 | ```python 234 | em_sz = 200 # size of each embedding vector 235 | nh = 500 # number of hidden activations per layer 236 | nl = 3 # number of layers 237 | ``` 238 | 239 | Researchers have found that large amounts of momentum (which we'll learn about later) don't work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than it's default of 0.9. 240 | 241 | ```python 242 | opt_fn = partial(optim.Adam, betas=(0.7, 0.99)) 243 | ``` 244 | 245 | fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. 246 | 247 | The model first contains an embedding matrix, that takes as inputs one hot encodings (the size of the vocabulary), and outputs 200-encodings for each word, and then we pass the 200-vector through three layers of LSTM (200x500 -> 500x500 -> 500x200) endings up with a new 200-vector encoding, which will hopefully be the corresponding to the next word, so we use a linear decoder to get the probability distribution over all the words in the vocabulary, and in between we use LockedDropout() proposed in [AWD LSTM](https://arxiv.org/abs/1708.02182). 248 | 249 | ```python 250 | class LockedDropout(nn.Module): 251 | def __init__(self): 252 | super().__init__() 253 | def forward(self, x, dropout=0.5): 254 | if not self.training or not dropout: 255 | return x 256 | m = x.data.new(1, x.size(1), x.size(2)).bernoulli_(1 - dropout) 257 | mask = Variable(m, requires_grad=False) / (1 - dropout) 258 | mask = mask.expand_as(x) 259 | return mask * x 260 | ``` 261 | 262 |

263 | 264 | ```python 265 | SequentialRNN( 266 | (0): RNN_Encoder( 267 | (encoder): Embedding(13458, 200, padding_idx=1) 268 | (encoder_with_dropout): EmbeddingDropout( 269 | (embed): Embedding(13458, 200, padding_idx=1) 270 | ) 271 | (rnns): ModuleList( 272 | (0): WeightDrop( 273 | (module): LSTM(200, 500, dropout=0.05) 274 | ) 275 | (1): WeightDrop( 276 | (module): LSTM(500, 500, dropout=0.05) 277 | ) 278 | (2): WeightDrop( 279 | (module): LSTM(500, 200, dropout=0.05) 280 | ) 281 | ) 282 | (dropouti): LockedDropout() 283 | (dropouths): ModuleList( 284 | (0): LockedDropout() 285 | (1): LockedDropout() 286 | (2): LockedDropout() 287 | ) 288 | ) 289 | (1): LinearDecoder( 290 | (decoder): Linear(in_features=200, out_features=13458, bias=False) 291 | (dropout): LockedDropout() 292 | )) 293 | ``` 294 | 295 | ### 2.5. Sentiment analysis 296 | 297 | Now we can use the pre-trained language model above, and fine tune it for sentiment analysis, but first of all we need to modify the last layer of the model, instead of using an linear decoder for each output in the sequence, this time, the RNN will not be many to many, but many to one, so we'll only use the last output, so we can either only calculate the loss using the last output of the RNN, or add a linear layer that takes and last hidden layer and gives us P, P>0.5 if the input is POS and NEG otherwise. the last linear layer will take as input 600-vectors, given that the we have three layer and each one is of each hidden state is 200. 298 | 299 |

300 | 301 | And the model is: 302 | 303 | ```python 304 | SequentialRNN( 305 | (0): MultiBatchRNN( 306 | (encoder): Embedding(13458, 200, padding_idx=1) 307 | (encoder_with_dropout): EmbeddingDropout( 308 | (embed): Embedding(13458, 200, padding_idx=1) 309 | ) 310 | (rnns): ModuleList( 311 | (0): WeightDrop( 312 | (module): LSTM(200, 500, dropout=0.3) 313 | ) 314 | (1): WeightDrop( 315 | (module): LSTM(500, 500, dropout=0.3) 316 | ) 317 | (2): WeightDrop( 318 | (module): LSTM(500, 200, dropout=0.3) 319 | )) 320 | (dropouti): LockedDropout() 321 | (dropouths): ModuleList( 322 | (0): LockedDropout() 323 | (1): LockedDropout() 324 | (2): LockedDropout() 325 | )) 326 | (1): PoolingLinearClassifier( 327 | (layers): ModuleList( 328 | (0): LinearBlock( 329 | (lin): Linear(in_features=600, out_features=1, bias=True) 330 | (drop): Dropout(p=0.1) 331 | (bn): BatchNorm1d(600, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) 332 | )))) 333 | ``` 334 | 335 | And because we're fine-tuning a pretrained model, we'll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better. So first we'll fine tune the last layer, and then unfreeze all the layer and use differential learning rates for fine tunning. 336 | 337 | ```python 338 | m3.clip=25. 339 | lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2]) 340 | 341 | m3.freeze_to(-1) 342 | m3.fit(lrs/2, 1, metrics=[accuracy]) 343 | m3.unfreeze() 344 | m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1) 345 | ``` 346 | 347 | Using this approcah, we end up with state of the art resutls in sentiment classification. 348 | 349 | **References**: 350 | * [FastAi lecture 3 notes](https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-4-2048a26d58aa) 351 | * [Spacy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad 352 | ) -------------------------------------------------------------------------------- /cutting_edge_DL/lecture13.md: -------------------------------------------------------------------------------- 1 | 2 | # Lecture 13: Style Transfer 3 | 4 | 5 | - [Lecture 13: Style Transfer](#Lecture-13-Style-Transfer) 6 | - [1. Leaning rate tricks / schedulers](#1-a-nameLeaningratetricksschedulersaLeaning-rate-tricks--schedulers) 7 | - [2. Inception](#2-a-nameInceptionaInception) 8 | - [3. Style transfer](#3-a-nameStyletransferaStyle-transfer) 9 | - [3.1. Content loss](#31-a-nameContentlossaContent-loss) 10 | - [3.2. Forward hooks](#32-a-nameForwardhooksaForward-hooks) 11 | - [3.3. Style loss](#33-a-nameStylelossaStyle-loss) 12 | - [3.4. Style transfer](#34-a-nameStyletransfer-1aStyle-transfer) 13 | 14 | 18 | 19 | 20 | ## 1. Leaning rate tricks / schedulers 21 | 22 | There is a lot of possible learning rates to use / test when expermenting, this type of adjustable learning rates are implemented in the fastai API using training phases, let's go through some of them: 23 | 24 | 1- Step learning rate: used a lot when traninig on imagenet, we start by a high leaning rate and after a given number of epochs, we change it to a lower leanring rate: 25 | 26 |

27 | 28 | So here we train for 1 epoch with 1e-2 learning rate and then 1e-3 for 3 epochs. 29 | ```python 30 | phases = [TrainingPhase(epochs=1, opt_fn=optim.SGD, lr = 1e-2), 31 | TrainingPhase(epochs=2, opt_fn=optim.SGD, lr = 1e-3)] 32 | ``` 33 | 34 | 2- Decaying LR in between two fixed LRs: Starting by a fixed LR for a given number of iteration for exploration, then a decaying leraning rate for zooming in on a local optima and then finetuning with a small learning rate: 35 | 36 | `lr_i = start_lr + (end_lr - start_lr) * i/n` 37 | 38 |

39 | 40 | Instead of using a linear decaying learning rate, we can use a consine annealing: 41 | 42 | `lr_i = end_lr + (start_lr - end_lr)/2 * (1 + np.cos(i * np.pi)/n)` 43 | 44 |

45 | 46 | Same way, we can use exponenetial decay: 47 | 48 | `lr_i = start_lr * (end_lr/start_lr)**(i/n)` 49 | 50 |

51 | 52 | or a Polynomial, which given quite good resutls and not very popular in the literature: 53 | 54 | `lr_i = end_lr + (start_lr - end_lr) * (1 - i/n) ** p` 55 | 56 |

57 | 58 | And one possibility is instead of decay to the lower learning rate, we can decay to zero and then step up back to the lower bound of the learning rate: 59 | 60 |

61 | 62 | 3- Stochastic gradient descient with restarts: using these type of learning rate, we can constuct a cyclic learning rate, we start by a small learning rate of a small number of ierations, then step up to max learning rate, and then use cos anneling to step down to the lower learning rate and repeate the same thing by this time with a larger preiode, 63 | 64 |

65 | 66 | 4- One cycle: We can also use a one cycle leanring rate, having an upward step learning rate for 30% of epochs, and then downward lr rate step back the the base learning rate and continuing for some iteration to lower values, and the inverse is done for the momentum: 67 | 68 |

69 | 70 | ## 2. Inception 71 | 72 | Inception network is quite similar to resnet, we also pass the incoming features into a given layer directly to the next layer, and also elevate it by transforming the input thorugh some convolutions, in resnet we add to the input its convolved version, in inception we concatenate a convolved version of the input to it. 73 | 74 |

75 | 76 | We see here that we have two additional branches, the first one with only 1x1 convolutions, used to compress the input along its channels dimensions, and for the second additional branch we first use a 1x1 conv to reduce the upcoming computations, and the interesting thing is the usage of two symetric filters, so to capture a wide amount of spatial information, we need to use bigger filters like 5x5 or 7x7, but these types of filters are very computationally heavy, for 7x7 filters we have 49 x number of input pixels x number of input channels, and in the last layer we have a large number of channels and we'll endup with a large number of computations, this is why we rarely use kernels larger than 3x3 and if so, only at the beginning like resenet, one solution is to approximate the 7x7 kernel using two 7x1 and 1x7 filters, and representation learned by these two filters will certainly not be able to capture all of that of 7x7 filters, but will be very close to it and even help out model avoid overfitting and reduce the number of computations. 77 | 78 | An other important trick to use during training is progressive resizing of the image, this is detailed in [Progressive Growing of GANs](https://research.nvidia.com/publication/2017-10_Progressive-Growing-of) 79 | 80 | ## 3. Style transfer 81 | 82 | The goal is to take two pictures and generate a new picture with the style of one picture and the content of the other one, this is done through a loss function that will be lower if the generated image style and content are the ones we want, after we have our function we can start from a noise image, with the style and content image to calculate the loss and backpropagate the error and this time instead of changing the weights of the network we change the values of the pixels to get the correct image. 83 | 84 |

85 | 86 | The loss will contain two terms, the content loss and the style loss with a hyperparameter as a weighting factor for content and style reconstruction: 87 | 88 | *Style transfer loss = lambda * style loss + content loss* 89 | 90 | For the content loss one is to compare the pixels of the content picture and update its pixels until (MSE loss) the noise image will be very similar / identical to the content image, but that's not our objective, we don't want to generate an image that is identical to the content image, but a one that similar in its content in a general manner, for this we'll use a CNN, a VGG network to be more specific, and rather than comparing pixels for the two images, we can compare the intermediate activations in the VGG network (say conv5), and given that these layer will contain a higher level of semantic entites that are translation invariant, we'll end up comparing the high level representations instead of the extact values of the pixels, called **perceptual loss** 91 | 92 | ### 3.1. Content loss 93 | 94 | We'll use imagenet samples, but given how big imagenet is we'll only take a small sample from it (can be found [here](http://files.fast.ai/data/)), we'll create the path to the imagenet sample, and load VGG net and freeze it given that we'll only use its representation to calculate the loss and not for predictions: 95 | 96 | ```python 97 | from fastai.conv_learner import * 98 | from pathlib import Path 99 | from scipy import ndimage 100 | 101 | PATH = Path('data/imagenet') 102 | PATH_TRN = PATH/'train' 103 | 104 | m_vgg = to_gpu(vgg16(True)).eval() 105 | set_trainable(m_vgg, False) 106 | ``` 107 | 108 | We than take a sample image, create some custom transforms with a resizing to 288x288 and create a noise image of the same shape as the image and blur it (this gives better resutls, if not it is very hard for the optimizer to get the correct gradients for all the pixels): 109 | 110 | ```python 111 | img_fn = PATH_TRN/'n01558993'/'n01558993_9684.JPEG' 112 | img = open_image(img_fn) 113 | 114 | sz=288 115 | trn_tfms,val_tfms = tfms_from_model(vgg16, sz) 116 | img_tfm = val_tfms(img) 117 | img_tfm.shape # (3, 288, 288) 118 | 119 | opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32) 120 | ``` 121 | 122 |

123 | 124 | Now we'll need to first pass our content image through the VGG network to get the target features, and then at each iteration pass the generated image, which is a simple noise image through the VGG and calculate the loss, and backpropagate the loss and update the pixels, first we divide the noise image values by two to have the same std as the content image, and then add a batch axis to the image using the numpy `np.newaxis` which will be called if we index `None` in the image, and specify that we want the gradients to be caculated with respect to the image: 125 | 126 | ```python 127 | opt_img = val_tfms(opt_img)/2 128 | opt_img_v = V(opt_img[None], requires_grad=True) 129 | opt_img_v.shape # torch.Size([1, 3, 288, 288]) 130 | ``` 131 | 132 | Now we need to extract only the layers we want from the VGG, we choose the 37th layer as the layer we want to extract the features from, we then pass the content image though it (after adding the batch dimension) and get the target features: 133 | 134 | ```python 135 | m_vgg = nn.Sequential(*children(m_vgg)[:37]) 136 | targ_t = m_vgg(VV(img_tfm[None])) 137 | targ_v = V(targ_t) 138 | targ_t.shape # torch.Size([1, 512, 18, 18]) 139 | ``` 140 | 141 | We then create our loss funcion, which is a simple MSE loss between the content VGG features and the generated image, we multiply it by 1000 so that the gradients scale will be correct and be able to change the values of the pixels, if not the loss is very small the gradients vanish (happens often when training with hald precision) 142 | 143 | ```python 144 | def actn_loss(x): 145 | return F.mse_loss(m_vgg(x), targ_v)*1000 146 | ``` 147 | 148 | In this case, we use LBFGS, which is a second order optimization method, which uses both the first (jacobian matrix) and second derivatives (hessian matrix), for efficiancy thet BFSG doen't compute the hessian directly for every possible direction, but approximates the Hessian based on differences of gradients over several iterations, the BFGS Hessian approximation can either be based on the full history of gradients, in which case it is referred to as BFGS, or it can be based only on the most recent m gradients, in which case it is known as limited memory BFGS, abbreviated as L-BFGS. The advantage of L-BFGS is that is requires only retaining the most recent m gradients, where m is usually around 10 to 20. 149 | 150 | To use LBFGS in pytorch, we first create optimizer and the iteration, and then instead of having to call the step function at each iteration inside our traninig loop, we create a function that contain the step of a training loop (zero out the gradients, pass the image thought the vgg and get the loss, calculate the gradients and print some information), and then we loop and only call the optimiez.step with the step function we created (the the loss fct that is an argumnet for it): 151 | 152 | ```python 153 | max_iter = 1000 154 | show_iter = 100 155 | optimizer = optim.LBFGS([opt_img_v], lr=0.5) 156 | 157 | def step(loss_fn): 158 | global n_iter 159 | optimizer.zero_grad() 160 | loss = loss_fn(opt_img_v) 161 | loss.backward() 162 | n_iter+=1 163 | if n_iter%show_iter==0: print(f'Iteration: {n_iter}, loss: {loss.data[0]}') 164 | return loss 165 | 166 | n_iter=0 167 | while n_iter <= max_iter: optimizer.step(partial(step,actn_loss)) 168 | ``` 169 | 170 | We can show the generated image after denormalizing it and we can clearly see the the content we wish to obtain is apparent: 171 | 172 | ```python 173 | x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0] 174 | plt.figure(figsize=(7,7)) 175 | plt.imshow(x) 176 | ``` 177 | 178 |

179 | 180 | ### 3.2. Forward hooks 181 | 182 | Until now, we only tried one layer from which the activations will be extracted, and if we want to test different layers we need to create different networks contrainng the VGG network up to the given layer, which is not very efficient, as an alternative we can use `register_forward_hook`, that, given a module, will automatically store a the activation when the forward function of that module s called, to imlement it, we create a class SaveFeature, that will have a hook, this hook is a function that will be called each time `m.forward` is executed, and it will get the as argument the module in question, its inputs, and its outputs, and we're going to store its activation in a features variable to use for the loss calculations.$ 183 | 184 | ```python 185 | class SaveFeatures(): 186 | features=None 187 | def __init__(self, m): 188 | self.hook = m.register_forward_hook(self.hook_fn) 189 | def hook_fn(self, module, input, output): 190 | self.features = output 191 | def close(self): 192 | self.hook.remove() 193 | ``` 194 | 195 | And all we need is to add some hooks in the correct locations in the VGG network, first we get the indices of the layers we want to extract the features from, which will be the last layer before the maxpooling (containg the most information at each block), and we'll then test the 32 layers features and see if it gives better resutls: 196 | 197 | ```python 198 | m_vgg = to_gpu(vgg16(True)).eval() 199 | set_trainable(m_vgg, False) 200 | 201 | block_ends = [i-1 for i,o in enumerate(children(m_vgg)) 202 | if isinstance(o,nn.MaxPool2d)] 203 | block_ends # [5, 12, 22, 32, 42] 204 | 205 | sf = SaveFeatures(children(m_vgg)[block_ends[3]]) 206 | ``` 207 | 208 | We then create our noise image, and create our optimizer that takes the noise image to be updates, get the target features, create the loss function and call the optimize step inside the loop by passing the stepping function. 209 | 210 | ```python 211 | def get_opt(): 212 | opt_img = np.random.uniform(0, 1, size=img.shape).astype(np.float32) 213 | opt_img = scipy.ndimage.filters.median_filter(opt_img, [8,8,1]) 214 | opt_img_v = V(val_tfms(opt_img/2)[None], requires_grad=True) 215 | return opt_img_v, optim.LBFGS([opt_img_v]) 216 | 217 | opt_img_v, optimizer = get_opt() 218 | m_vgg(VV(img_tfm[None])) 219 | targ_v = V(sf.features.clone()) 220 | targ_v.shape 221 | 222 | def actn_loss2(x): 223 | m_vgg(x) 224 | out = V(sf.features) 225 | return F.mse_loss(out, targ_v)*1000 226 | 227 | n_iter=0 228 | while n_iter <= max_iter: optimizer.step(partial(step,actn_loss2)) 229 | ``` 230 | 231 | And we see that we optain better resutls: 232 | 233 |

234 | 235 | ### 3.3. Style loss 236 | 237 | The objective of the style function is to detect the reoccuring styles/textures in the image without having a relation to its content, and one way to obtain this is to get the correlation between different feature maps in the same layer, so to calculate the style loss, we choose a number of layers, and for each one we will calculate the dot product between each features map in that layer (n² dot products), sum them and normalize them, and this is called the gramm matrix, now if we take each feature map of a given layer l of size (H x W) and flatten it into one vector, then to calculate the gram matrix we need to multiply each vector by all the other vectors, so n² dot product to get the gram matric for one layer, and the gram matrix for layer l between only two feature maps (i and j) is simply the formula of the two product between two vectors: 238 | 239 | $$G _ { i j } ^ { l } = \sum _ { k } F _ { i k } ^ { l } F _ { j k } ^ { l }$$ 240 | 241 | For the style loss, we're going to use all the layer after maxpool in the VGG network instead of only one like we did for the content loss, so first we load our style image (extracted from wikipedia) and resize it to the same size as the content image, and apply the tansformation we created ealier: 242 | 243 | ```python 244 | def scale_match(src, targ): 245 | h,w,_ = img.shape 246 | sh,sw,_ = style_img.shape 247 | rat = max(h/sh,w/sw); rat 248 | res = cv2.resize(style_img, (int(sw*rat), int(sh*rat))) 249 | return res[:h,:w] 250 | 251 | style = scale_match(img, style_img) 252 | style_tfm = val_tfms(style_img) 253 | ``` 254 | 255 |

256 | 257 | And now we're going create a forward hook for all the layers after a max pool, and then pass the style image (after adding one axis for the batch dimension) through the network and see the saved features: 258 | 259 | ```python 260 | sfs = [SaveFeatures(children(m_vgg)[idx]) for idx in block_ends] 261 | 262 | m_vgg(VV(style_tfm[None])) 263 | targ_styles = [V(o.features.clone()) for o in sfs] 264 | [o.shape for o in targ_styles] 265 | 266 | # [torch.Size([1, 64, 288, 288]), 267 | # torch.Size([1, 128, 144, 144]), 268 | # torch.Size([1, 256, 72, 72]), 269 | # torch.Size([1, 512, 36, 36]), 270 | # torch.Size([1, 512, 18, 18])] 271 | ``` 272 | 273 | Now we're going to create our losses, first for the gram matrix, we take our input, which is the feature maps of a given layer, flaten them into C (here batch = 1) vectors of size (HxW), and calculate the gramm matrix as the matrix multiply between the vectors and their transpose, and normalize it by the number of elements and given how small the results are, we scale them up to get good grandients: 274 | 275 | ```python 276 | def gram(input): 277 | b,c,h,w = input.size() 278 | x = input.view(b*c, -1) 279 | return torch.mm(x, x.t())/input.numel()*1e6 280 | ``` 281 | 282 | We pass the noise image thouth the network, we get all the features of the layers we've selected,and now the style loss is the sum of the MSE losses between the gram of the noise image and the target style image for all the layer we've chosen: 283 | 284 | ```python 285 | def gram_mse_loss(input, target): 286 | return F.mse_loss(gram(input), gram(target)) 287 | 288 | def style_loss(x): 289 | m_vgg(opt_img_v) 290 | outs = [V(o.features) for o in sfs] 291 | losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)] 292 | return sum(losses) 293 | ``` 294 | 295 | And now we create our optimizer and call its step function, and show the results: 296 | 297 | ```python 298 | n_iter=0 299 | while n_iter <= max_iter: 300 | optimizer.step(partial(step,style_loss)) 301 | 302 | x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0] 303 | plt.figure(figsize=(7,7)) 304 | plt.imshow(x) 305 | ``` 306 | 307 |

308 | 309 | We can see that we've got good resutls, but we can do better if we weight the loss of the latter layer higher than the ealier one, because as we can see bellow the later layers gives better reslts (from the original paper): 310 | 311 |

312 | 313 | ### 3.4. Style transfer 314 | 315 | And now all we need to do is in the loss calculation function, add the two losses, the style and the content loss with a weighting factor (here we don't), we get the optimize and call its step function: 316 | 317 | ```python 318 | opt_img_v, optimizer = get_opt() 319 | sfs = [SaveFeatures(children(m_vgg)[idx]) for idx in block_ends] 320 | 321 | def comb_loss(x): 322 | m_vgg(opt_img_v) 323 | outs = [V(o.features) for o in sfs] 324 | losses = [gram_mse_loss(o, s) for o,s in zip(outs, targ_styles)] 325 | cnt_loss = F.mse_loss(outs[3], targ_vs[3])*1000000 326 | style_loss = sum(losses) 327 | return cnt_loss + style_loss 328 | 329 | n_iter=0 330 | while n_iter <= max_iter: optimizer.step(partial(step,comb_loss)) 331 | ``` 332 | 333 | And we get good results: 334 | 335 | ```python 336 | x = val_tfms.denorm(np.rollaxis(to_np(opt_img_v.data),1,4))[0] 337 | plt.figure(figsize=(9,9)) 338 | plt.imshow(x, interpolation='lanczos') 339 | plt.axis('off') 340 | ``` 341 | 342 |

343 | --------------------------------------------------------------------------------