├── 2020 ├── ML_workshop_example_run.ipynb ├── ML_workshop_presentation.pptx ├── Q_AND_A.md ├── data │ └── matrix_dm-fitness_features_modnames.csv ├── img │ ├── 0_ml_workflow.png │ ├── 2_data_collection.png │ ├── 2_data_processing_binarize_onehot.png │ ├── 2_data_processing_scaling.png │ ├── 2_data_table_problems.png │ ├── 2_data_train_test_split.png │ ├── 3_feat_engr_intro.png │ ├── 3_feat_engr_select.png │ ├── 3_feat_engr_select_fit_transform.png │ ├── 3_feat_engr_sources.png │ ├── 4_model_select_cv.png │ ├── 4_model_select_hyperparameter.png │ ├── 4_model_select_intro.png │ ├── 4_model_select_sklearn.png │ ├── 5_model_appl_evaluate.png │ ├── 5_model_appl_interpret.png │ ├── 5_model_appl_new_cases.png │ ├── img_clone_repository.png │ └── img_what_ml_is.png ├── models │ └── model_train_select20_rf_rnd_search.joblib └── readme.md ├── 2022 ├── ML_workshop-part_a-preparation.ipynb └── __INSTRUCTOR_ML_workshop-part_b-in_class.ipynb ├── .ipynb_checkpoints └── ML_workshop_example_run-checkpoint.ipynb ├── LICENSE ├── ML_workshop-guide.pptx ├── ML_workshop-part_a-preparation.ipynb ├── ML_workshop-part_b-in_class.ipynb ├── ML_workshop-part_c_xai.ipynb ├── ML_workshop-xai.pptx ├── __INSTRUCTOR_ML_workshop-part_b-in_class.ipynb ├── enzyme_gene.csv ├── feature_list.csv ├── feature_list.xlsx ├── img ├── img_clone_repository.png └── img_what_ml_is.png ├── readme.md └── workshop_outline.xlsx /2020/ML_workshop_presentation.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/ML_workshop_presentation.pptx -------------------------------------------------------------------------------- /2020/Q_AND_A.md: -------------------------------------------------------------------------------- 1 | # Synopsis 2 | 3 | Below are the questions raised during the Plant Biology 2020 workshop session on Wednesday 7/29/20. 4 | 5 | ## Literature 6 | 7 | ### On using ML to investigate gene expression patterns: 8 | 9 | Depend on what specific questions one want to ask, there are quite a few papers out there using ML to investigate gene expression. Here are some examples: 10 | 11 | * [Statistical and Machine Learning Approaches to Predict Gene Regulatory Networks From Transcriptome Datasets](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6281826/) 12 | * [The identification of cis-regulatory elements: A review from a machine learning perspective](https://pubmed.ncbi.nlm.nih.gov/26499213/) 13 | * [The DNA binding landscape of the maize AUXIN RESPONSE FACTOR family](https://pubmed.ncbi.nlm.nih.gov/30375394/) 14 | * [Cis-Regulatory Code for Predicting Plant Cell-Type Transcriptional Response to High Salinity](https://pubmed.ncbi.nlm.nih.gov/31551359/) 15 | 16 | ### On looking for traits in photos that are too subtle for the human eye (e.g. leaf shape patterns): 17 | 18 | Couple interesting studies: 19 | 20 | * [Automated classification of tropical shrub species: a hybrid of leaf shape and machine learning approac](https://pubmed.ncbi.nlm.nih.gov/28924506/) 21 | * [Computer vision cracks the leaf code](https://pubmed.ncbi.nlm.nih.gov/26951664/) 22 | 23 | ### On identifying metabolites with metabolomics data: 24 | 25 | In the Plant Biology meeting 2020, [Gaurav Moghe from Cornell gave a talk on this](https://www.eventscribe.com/2020/ASPB/fsPopup.asp?Mode=presInfo&PresentationID=742044&query=gaurav%20moghe). He will be a much better person to answer this qusetion! 26 | 27 | ## What can ML do? 28 | 29 | ### Can ML be used to identify patterns in images? 30 | 31 | Serena Ghantous Lotreck : Yes! The types of tasks we’ll discuss in this workshop relate to numerical or categorical data, but there are many approaches that can be applied to image analysis 32 | 33 | samantha.neeno : Hi Ira---yes, it is possible. That can be done with Deep Learning models (for example), but then we get into data quality/labeling discussions 34 | 35 | ### Can I use ML to predict enzyme function (substrate specificity and product prediction), especially in large plant gene families? 36 | 37 | We are not expert in this area. Nonetheless, if there are high quality labels of which enzyme catalyze what substrates AND there are structural features (primiary, secondary, and tertiary) available, then one can attempt to build one. 38 | 39 | ### Can ML be used to score plants as alive/dead? Does this provide more accurate results compared to manual scoring? 40 | 41 | If you are able to provide a dataset where you have plant images scored for “live” or “dead”, you’d definitely be able to distinguish. I’m unaware of efforts to use this on plants life/death, but for example, AI systems trained to identify cancer in pathology slides are better able to detect cancer than human pathologists 42 | 43 | ## Supervised vs. Unsupervised Learning 44 | 45 | ### Why isn’t clustering a supervised algorithm? Isn’t the grouping (clustering) itself be a label? 46 | 47 | You can associate labels with groups, but the algorithm isn’t learning how to label the groups, it merely uses patterns in the data to group instances 48 | 49 | ### In my lab, I've used data from experiments with multiple predictors to try to find a regression model that best explains the data. Is this machine learning? Or would machine learning be applying this model to predict the outcome of future experiments based on known predictors? 50 | 51 | Linear regression is absolutely a type of machine learning! But it’s using the regression that you’ve found to make predictions for new instances that would constitute machine learning - finding the best regression is what we call fitting the model 52 | 53 | ### What is the best way to identify patterns of gene expression that predict the treatment that was imposed? For example: do plants that were exposed to weeds exhibit different patterns of expression than those that weren’t? 54 | 55 | Given the amount of transcriptome data available, this is feasible. You need carefully curated treatment labels for the experiments. Once you have that, you can use all genes where expression levels are available as features to make predictions. For the example you mentioned, perhsps you can take an unsupervised approach clustering the gene expression patterns based on experiments - then you can see if the weed-related experiments stands out on its own or whether it is related to some other treatments. 56 | 57 | ## I’m just getting started 58 | 59 | ### I’m just getting started with ML. What are your recommendations to get started? 60 | 61 | 0. Find a colleague who known machine learning already - in stat, math, and engineering, they are quite a few people with expertise. 62 | 1. Find a question relevant to your research that can be solved by building a statistical model. 63 | 2. Start with online resources like this workshop to get a basic understanding of the terms and processes. 64 | 3. Check out introductory books like [this one])(https://www.cs.waikato.ac.nz/ml/weka/book.html) that does not require programming experiences or, if you have some Python background, check [this one](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) out. 65 | 4. Follow the books and, in addition to practice on the examples provided by the book, use your own data to see if you can use the approaches in the books to solve your own problems. 66 | 5. If at all possible, take one of the many [online courses](https://www.freecodecamp.org/news/every-single-machine-learning-course-on-the-internet-ranked-by-your-reviews-3c4a7b8026c0/). 67 | 68 | ### I am working on completing my PhD in wet lab research. What can I do to also study machine learning in an efficient but effective way? 69 | 70 | The most efficient way to learn is to work with real data as you follow along with books or tutorials. If you have a question and a dataset you’d like to use, we’d recommend flexing your new skills this way. 71 | 72 | Also see the answer to the previous question. 73 | 74 | ### How much programming do I need to use ML? 75 | 76 | You don’t need much at all if you use [Weka](https://www.cs.waikato.ac.nz/ml/weka/index.html) a powerful machine learning software that only require that you put data in. 77 | 78 | If you have experiences working with command-line interface, you can use [our machine learning pipeline](https://github.com/ShiuLab/ML-Pipeline), which is an end-to-end pipeline that runs on the command line. 79 | 80 | Additionally, if you have basic skills in Python, scikit-learn provides many tools to do machine learning in a fairly streamlined way. 81 | 82 | ### What options are available on the command line for your ML pipeline tool? 83 | 84 | I'd suggest that you check it out in our Github repository: 85 | 86 | https://github.com/ShiuLab/ML-Pipeline 87 | 88 | It is rather extensive and is our code-base for multiple ML-related publications in the last five years. 89 | 90 | ## Data selection 91 | 92 | ### There are a lot of biological data online, but the culture and treatment conditions are different: Can I still use these data to train one ML model? 93 | 94 | The short answer is yes! Longer version: it is rarely the case that the environment and development stages match among the dataset one want to use. The mismatch is expected to reduce the performance of your models but you should always try. Invariably, they will provide SOME information and ML algorithms are very good at picking relevant information out. If you are worried that you got junky info out of the model, you just need to apply the model to the test set to see if good predictions can be made. If so, it is useful. 95 | 96 | ### Do all features have to be numerical? 97 | 98 | You can also have categorical features! There are a few differences in how we preprocess categorical data, which we’ll mention further on, but perfectly able to incorporate them! 99 | 100 | ## Data pre-processing and feature selection 101 | 102 | ### How do I split my data between training and test? 103 | 104 | There are two main ways: one is to randomly split the data between training and test, and the other is called stratified sampling, which tries to make sure the test and training sets are representative of the true data 105 | 106 | ### You’ve said that we must use the same training and test sets in order to reproduce our results. If using a different split gives different results, doesn’t that call into question the quality of the analysis? 107 | 108 | Here "gives different results" I take it to means that the model performance changes depend on the test set. I don't think this call into the quality of the analysis but is an inevitable consequence of sampling. In the best case senario, one would want to use a test set that is representative of the unknowns out there. This is not trivial since we only have a limited amount of test cases. So what you will see is that, given the training samples are fixed, different test sets will give you somewhat different model performance values. But if you look at enough testing sets, then you will find that the averge of the perofrmance is a good estimate of the true performance. 109 | 110 | ### How can I impute missing values? 111 | 112 | One practice is simply drop any columns or rows with missing values. If there is little missing data, this practice is fine. IF not imputation is necessary. 113 | 114 | For numerical features, it’s common to use the median of the training instances to fill in missing values for features, and for categorical features it’s common to use the most common feature value. There are other options as well! 115 | 116 | Julissa Ek Ramos : @MIn-Yao…I am new in ML, but when I am doing some predictions, I use bootstrapping to find the best fitted missing data. 117 | 118 | ### Does imputation skew the distribution of my original data? 119 | 120 | If there are a lot of missing values, for example if median is used, then the distribution will look tighter than it really is. 121 | 122 | ### Can one-hot encoding be done automatically? 123 | 124 | In Scikit-Learn, there is the OneHotEncoder class that you can call to do it. 125 | 126 | ### At what point is one-hot encoding my categorical features too much? For example, a categorical feature with 10 possible values would result in 10 very sparse features. 127 | 128 | On one hand, such practice increases the number of features, makes the model more complicated, and leads to overfitting. On the other hand, one cannot rule of the possibility that they can be useful. Paricularly considering some ML algorithms are really good at using features with little, but useful infromation, we tend to keep them unless they are too sparse (e.g. only a few percentages of the data is 1 and the rest is zero or vice versa). In those case, we'd drop those. 129 | 130 | ## Feature selection 131 | 132 | ### Will having redundant or similar data affect the final model quality? 133 | 134 | Yes! having several highly correlated features can actually negatively impact model performance, and makes model interpretation more difficult That’s what makes feature selection/engineering so important! 135 | 136 | ### Is there a threshold for removing correlated features? 137 | 138 | It's hard to define a threshold for removing correlated features. One approach is to "merge" correlated features using principle componenet analysis (see [this post](https://medium.com/towards-artificial-intelligence/training-a-machine-learning-model-on-a-dataset-with-highly-correlated-features-debddf5b2e34)). 139 | 140 | ### In ecology, we know that several environmental variables (features) are highly correlated: for example rainfall and temperature. How we can choose the most useful of these correlated features, and how machine learning can help us? 141 | 142 | Song Li : You do PCA to reduce the correlated features to smaller number of features. We typically just use the PC1, PC2 etc as your features 143 | 144 | Serena: ne thought is that you could train models with just one and combinations of the correlated group and evaluate performance to see which are the most useful. Random Forest also outputs overall feature importances, so you could take a look at those 145 | 146 | Serena: using the PC’s as features does make model interpretation more difficult 147 | 148 | ### During feature selection, is it best practice to include or exclude information (e.g. experimental blocking, biological replicate) that I know influence the outcome/measured variable? 149 | 150 | Song Li : For Brianne, you do a regression to remove those confounding factors from your data before machine learning 151 | 152 | ### Is it good to use the same ML algorithm (for example, Random Forest) for both feature selection and regression? 153 | 154 | In this case we think that the random forest may have superior performance because we also used it to do feature selection - that the features selected were optimal for random forest but not for the others. However, there isn’t a rule about whether you should or shouldn’t do that! You should test many algorithms and choose the highest-performing one 155 | 156 | ## Performance evaluation 157 | 158 | ### How can I determine how different algorithms perform on my data? 159 | 160 | In our example, r2 is our performance measure, so the value of r2 from different algorithms allows us to evaluate which algorithms perform best 161 | 162 | ### In terms of applications of ML in biology, is the general problem model underfitting or model overfitting? 163 | 164 | We typically do 5 or 10 cross-validations. In my own experience, overfitting is more of the problem, as we have highly dimensional data that’s extremely information rich 165 | 166 | ### Is there an accepted value for performance measures (e.g. r2) in cross-validation? 167 | 168 | This depends on: 169 | 170 | 1. If there is a model established previously, if the performance measure is better than the previous one. 171 | 2. If there is no previous model, is it signfiicantly better than random expectation. 172 | 173 | ### What does the r2 value during cross-validation tell us? What is the impact of sample size on the conclusions we can draw from looking at r2? 174 | 175 | It tells us what the average r2 is after splitting the data 10 times and train 10 differnt models using the training data. 176 | 177 | Sample size will impact the interpretation of r2 signfiicantly. In our case there are over 6,000 data points, so even with an r2 of 0.1, the p-value will be extremely small so we did not worry about it. But if the sample size is small, the one should also consider what the p-value is. 178 | 179 | ## The cutting edge 180 | 181 | ### Are you guys applying methods of transfer learning? 182 | 183 | We are currently working on transfer learning methods. A manuscript following the general idea of transfer learning using Arabidopsis metaoblic gene annotation to help predicting tomato ones has recently been accepted in [In Silico Plants](https://doi-org.proxy2.cl.msu.edu/10.1093/insilicoplants/diaa005). 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | -------------------------------------------------------------------------------- /2020/img/0_ml_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/0_ml_workflow.png -------------------------------------------------------------------------------- /2020/img/2_data_collection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/2_data_collection.png -------------------------------------------------------------------------------- /2020/img/2_data_processing_binarize_onehot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/2_data_processing_binarize_onehot.png -------------------------------------------------------------------------------- /2020/img/2_data_processing_scaling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/2_data_processing_scaling.png -------------------------------------------------------------------------------- /2020/img/2_data_table_problems.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/2_data_table_problems.png -------------------------------------------------------------------------------- /2020/img/2_data_train_test_split.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/2_data_train_test_split.png -------------------------------------------------------------------------------- /2020/img/3_feat_engr_intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/3_feat_engr_intro.png -------------------------------------------------------------------------------- /2020/img/3_feat_engr_select.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/3_feat_engr_select.png -------------------------------------------------------------------------------- /2020/img/3_feat_engr_select_fit_transform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/3_feat_engr_select_fit_transform.png -------------------------------------------------------------------------------- /2020/img/3_feat_engr_sources.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/3_feat_engr_sources.png -------------------------------------------------------------------------------- /2020/img/4_model_select_cv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/4_model_select_cv.png -------------------------------------------------------------------------------- /2020/img/4_model_select_hyperparameter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/4_model_select_hyperparameter.png -------------------------------------------------------------------------------- /2020/img/4_model_select_intro.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/4_model_select_intro.png -------------------------------------------------------------------------------- /2020/img/4_model_select_sklearn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/4_model_select_sklearn.png -------------------------------------------------------------------------------- /2020/img/5_model_appl_evaluate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/5_model_appl_evaluate.png -------------------------------------------------------------------------------- /2020/img/5_model_appl_interpret.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/5_model_appl_interpret.png -------------------------------------------------------------------------------- /2020/img/5_model_appl_new_cases.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/5_model_appl_new_cases.png -------------------------------------------------------------------------------- /2020/img/img_clone_repository.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/img_clone_repository.png -------------------------------------------------------------------------------- /2020/img/img_what_ml_is.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/img/img_what_ml_is.png -------------------------------------------------------------------------------- /2020/models/model_train_select20_rf_rnd_search.joblib: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/2020/models/model_train_select20_rf_rnd_search.joblib -------------------------------------------------------------------------------- /2020/readme.md: -------------------------------------------------------------------------------- 1 | # How machine learning can be used to solve plant biology problems 2 | 3 | ![alt text](./img/img_what_ml_is.png) 4 | 5 | ## Workshop information 6 | 7 | ### What this is about 8 | 9 | More and more data are available in plant science that have fueled ground breaking discoveries. Beyond the original intents of the experiments, these data can be used to discover even more. This is where machine learning comes in. We can use computers to learn from data and generate models that can predict a biological phenomenon of interest - e.g. will this gene be lethal when it is knocked-out, or which genetic variants can meaningfully predict a phenotype of interests. Specifically, this workshop will touch on the following topics: What is machine learning and why is it useful? How does machine learning work? What are some example machine learning applications in plant science? How can we feed data into machine learning tools to make discoveries? What are the best practices when doing machine learning? Where to go to learn more? The workshop will include presentations, discussions, and a short hands-on section using online machine learning resources. 10 | 11 | ### When and where 12 | 13 | The last time we offered the workshop: 14 | * 3:30-4:30pm, 7/29/2020 15 | * PlantBiology 2020 Worldwide Summit - Online 16 | * [Event link](https://www.eventscribe.com/2020/ASPB/fsPopup.asp?Mode=presInfo&PresentationID=742105) 17 | 18 | ### Who we are 19 | 20 | * Shin-Han Shiu: Departments of Plant Biology and Comp. Math., Sci., & Engr., Michigan State University 21 | * Serena Lotreck: Department of Plant Biology and Comp. Math., Sci., & Engr., Michigan State University 22 | * Our lab [website](https://shiulab.github.io/) 23 | 24 | ### What kinds of materials we are sharing 25 | 26 | There are two main documents: 27 | 1. [Workshop presentation slides](https://github.com/ShiuLab/ML_workshop/blob/master/ML_workshop_presentation.pptx) 28 | 2. [Workshop jupyter notebook](https://github.com/ShiuLab/ML_workshop/blob/master/ML_workshop_example_run.ipynb) 29 | 30 | ## Instructions for runnning the notebook 31 | 32 | * Developed by Christina Azodi for the [2019 workshop](https://github.com/azodichr/ML-Pipeline/tree/master/Workshop). 33 | * Modified for the 2020 workshop by Shinhan Shiu. 34 | 35 | ### What's needed 36 | 37 | The workshop example is provided as a [Jupyter notebook](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). It is a document generated by the [Jupyter Lab or Jupyter Notebook applications](](https://jupyter.org/install.html)). A notebook can contain both computer codes in popular languages such as Python and R, and texts in the form of paragraph, equations, figures, links, etc. 38 | 39 | To follow what we have shown in the workshop, you need the following: 40 | * Git: for you to "clone" this repository to your computer to play with. 41 | * Jupyter Lab: the application to view, edit, and execute codes in the notebook. 42 | * Scikit-Learn and others: the software packages Jupyter Lab relies on to run the codes. 43 | 44 | ### Install Git and clone ml_workshop 45 | 46 | [Github](https://github.com/) is a code hosting platform for version control (i.e., keep track of updates to codes) and collaboration (i.e., many people can work on the same codes). We have put the workshop materials in a Github repository called [ML_workshop]() 47 | 48 | 1. Create a [GitHub Account](https://github.com/join) 49 | 2. Download and install [Github Desktop](https://desktop.github.com/) 50 | * Or you can use [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) if you are familar with version control and command-line interface. Note that the following info is for using Github Desktop. 51 | 3. Clone the ML_workshop by following [this instruction](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop) and the following screenshot. 52 | * __Note:__ You can specify where the repository goes in your computer. We suggest leaving it as default and to remember where it is - we need it later. 53 | 54 | ![alt text](./img/img_clone_repository.png) 55 | 56 | ### Install Anaconda 57 | 58 | [Anaconda](https://www.anaconda.com/) is a a free and open-source distribution of the programming languages Python and R and is a widely used platform for computational and data science applications. 59 | 60 | 1. Download the Python 3.X version of [Anaconda](https://www.anaconda.com/products/individual#Downloads). Current, it is Python 3.7. 61 | 2. Install Anaconda using the [instructions](https://docs.anaconda.com/anaconda/install/). 62 | 3. Open your terminal in [Mac](https://support.apple.com/guide/terminal/open-or-quit-terminal-apd5265185d-f365-44cb-8b09-71a064a42125/mac) or [PC](https://www.wikihow.com/Open-Terminal-in-Windows) 63 | * __Note__: For PC, you need to open the terminal by "Running as Administrator". If you are not familiar with this, see [this post](https://www.itechtics.com/run-programs-administrator/) for more info. 64 | 65 | 4. Issue the following command to make sure Anaconda installation is complete: 66 | ``` 67 | conda list 68 | ``` 69 | The above command allows you to see what software packages have been installed. 70 | 71 | ### Install software packages 72 | 73 | [Conda](https://docs.conda.io/en/latest/) is a package/environment management system. It deals with installing software packages in your computer. It also creates and manage virtual environments where each environment you have a specific set of software for a general category of tasks. 74 | 75 | 1. Create an ml environment and activate it: 76 | ``` 77 | conda create -n ml 78 | conda activate ml 79 | ``` 80 | 81 | 2. Install software packages and their dependencies: 82 | ``` 83 | conda install jupyterlab matplotlib nb_conda pandas scikit-learn 84 | ``` 85 | 86 | ### Navigate to ML_workshop tutorial 87 | 88 | 1. Run Jupyter Lab 89 | 90 | * If you use Mac: 91 | 92 | ``` 93 | jupyter lab 94 | ``` 95 | * If you use PC, and your Github folder is in __C:/__ drive, then do: 96 | ``` 97 | jupyter lab --notebook-dir=C:/ 98 | ``` 99 | * If you use PC and your Github folder is in __D:/__ drive, do: 100 | ``` 101 | jupyter lab --notebook-dir=D:/ 102 | ``` 103 | 104 | 2. In the Jupyter lab window that opens, on the left panel, navigate to __ML_workshop__, the directory where the cloned Github repository is stored. 105 | 106 | 3. Open ML_workshop.ipynb 107 | 108 | 4. Run each code element by clicking ```SHIFT + ENTER```. 109 | -------------------------------------------------------------------------------- /2022/ML_workshop-part_a-preparation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "# __Machine Learning Part-A: Pre-Workshop Preparation__" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## Learning objectives\n", 16 | "\n", 17 | "At the end of the exercise, you should be able to:\n", 18 | "- Distinguish unsupervised and supervised learning.\n", 19 | "- Distinguish regression and classification tasks.\n", 20 | "- Explain the major steps in a machine learning project.\n", 21 | "- Check that the ncessary software packages are installed." 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "----\n", 29 | "\n", 30 | "# 1. Introduction to Machine Learning" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "✅ **DO THIS:** Please watch the following two videos introduces many key concepts in machine learning.\n", 38 | "\n", 39 | "To run the code cell, press `SHIFT`+`ENTER`." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": {}, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "image/jpeg": "", 50 | "text/html": [ 51 | "\n", 52 | " \n", 59 | " " 60 | ], 61 | "text/plain": [ 62 | "" 63 | ] 64 | }, 65 | "execution_count": 2, 66 | "metadata": {}, 67 | "output_type": "execute_result" 68 | } 69 | ], 70 | "source": [ 71 | "from IPython.display import YouTubeVideo\n", 72 | "YouTubeVideo(\"cfj6yaYE86U\",width=640,height=360)\n" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "✅ **QUESTION:** Answer the following:\n", 80 | "1. What does train a model mean?\n", 81 | "1. What are the differences between attribute, feature, and observation?\n", 82 | "1. What is the difference between Supervised and Unsupervised machine learning?" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | " Double click this cell and put your answer to the above questions below:\n", 90 | "\n", 91 | "1. Your answer\n", 92 | "2. Your answer\n", 93 | "3. Your answer" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 3, 99 | "metadata": {}, 100 | "outputs": [ 101 | { 102 | "data": { 103 | "image/jpeg": "", 104 | "text/html": [ 105 | "\n", 106 | " \n", 113 | " " 114 | ], 115 | "text/plain": [ 116 | "" 117 | ] 118 | }, 119 | "execution_count": 3, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "from IPython.display import YouTubeVideo\n", 126 | "YouTubeVideo(\"-_XYmr4vkwc\",width=640,height=360)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "✅ **QUESTION:** What is the difference between Regression and Classification models?" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | " Put your answer to the above question here" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "tags": [] 147 | }, 148 | "source": [ 149 | "----\n", 150 | "\n", 151 | "\n", 152 | "# 2. Machine learning workflow\n", 153 | "\n", 154 | "Below are the major steps in a machine learning project. Please go through them and bring questions.\n", 155 | "\n", 156 | "|Step|Purpose|Notes|\n", 157 | "|---|---|---|\n", 158 | "|1|State the question and frame it as a machine learning problem|Is it a supervised or unsupervised learning problem? There are other major types (semi-supervised and reinforced learning).|\n", 159 | "|2|Collect data, and exploratory data analysis (EDA)|Apply four types of EDA.|\n", 160 | "|3|Split training and testing data|Seperate out testing data for evaluation purpose.|\n", 161 | "|4|Engineer features| - Format data, impute missing values, normalize/scale features.
- Transform feature values, select informative features,
- Create combinatorial features (e.g., the interaction terms for basic linear model).
- Do dimension reduction to reduce the number of features.|\n", 162 | "|5|Select models| - There are many models/algorithms/classifers making different assumptions about how modeling should be done. Thus, it is important to go through a wide range of them.
- And for each model/algorith, we need to tune __hyperparameters__, i.e., parameters that are set manually.
- Cross-validate model, i.e., separate data into training/validation subsets, use the training subset to train a model and score the model with the validation subset. Then split the training/validation subset again but differently, train and score. Repeat the process $k$ times.|\n", 163 | "|6|Repeat 2-5|Work to iteratively improve the models.|\n", 164 | "|7|Evaluate the best performing model|- After a best performing model is identified based on the training data, the model is applied to the testing set.
- Various model performance metric can be used to access how good the model is.|\n", 165 | "|8|Interpret the model|- Dissect the model to better understand how it works.
- Identify most informative features for the best performing model.
- Assess the reasons behind false predictions to further improve model.|\n", 166 | "|9|Deploy model|Apply model to new data and make predictions.|" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "----\n", 174 | "\n", 175 | "# 3. Setting up for the workshop" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "## 3.1 Setting up required packages\n", 183 | "\n", 184 | "✅ **DO THIS:** In Jupyter, you can execute terminal command by adding `!` in front of the commandline. Let's use this message to install the required packages:\n", 185 | "\n", 186 | "- Create a new conda environment called `ml_workshop`.\n", 187 | "- Activate the environment,\n", 188 | "- Install `sklearn`, `pandas`, `matplotlib`, `seaborn`, `imblearn`, `numpy`, `shap`, `ipywidgets`, `tqdm` by following [the workshop instruction](https://github.com/ShiuLab/ML_workshop#24-install-software-packages)." 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": 2, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [ 197 | "# Test your install\n", 198 | "import sklearn, pandas, matplotlib, seaborn, imblearn, numpy, shap, tqdm" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "__Note__: If you encounter `ModuleNotFoundError`:\n", 206 | "- Close the terminal window.\n", 207 | "- Close your browser with the Jupyter Lab instance.\n", 208 | "- Open the terminal window, and follow the instruction before about opening this notebook.\n", 209 | "- Re-run the above code cell.\n", 210 | "\n", 211 | "If that still won't work, don't fret, we will see if we can fix this before workshop." 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "## 3.2 Where the data is from\n", 219 | "\n", 220 | "We will be using a dataset `enzyme_gene.csv` from a [PNAS paper](https://pubmed.ncbi.nlm.nih.gov/30674669/). Let's use this dataset to illustrate how an end-to-end machine learning project is done." 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "✅ **QUESTION:** Read the abstract for the paper. In the space below, state the biological problem as a machine learning problem." 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | " Put your answer to the above question here" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "---------\n", 242 | "### Congratulations, we're done for now. Will see you in the workshop!" 243 | ] 244 | } 245 | ], 246 | "metadata": { 247 | "anaconda-cloud": {}, 248 | "kernelspec": { 249 | "display_name": "Python 3.10.4 ('bert_finetune': conda)", 250 | "language": "python", 251 | "name": "python3" 252 | }, 253 | "language_info": { 254 | "codemirror_mode": { 255 | "name": "ipython", 256 | "version": 3 257 | }, 258 | "file_extension": ".py", 259 | "mimetype": "text/x-python", 260 | "name": "python", 261 | "nbconvert_exporter": "python", 262 | "pygments_lexer": "ipython3", 263 | "version": "3.10.4" 264 | }, 265 | "vscode": { 266 | "interpreter": { 267 | "hash": "323c618d0395b34183a36199d7c8eddbd4e55d51aee9dabbfbc9809db817fb2b" 268 | } 269 | } 270 | }, 271 | "nbformat": 4, 272 | "nbformat_minor": 4 273 | } 274 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Shiu Lab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /ML_workshop-guide.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/ML_workshop-guide.pptx -------------------------------------------------------------------------------- /ML_workshop-part_b-in_class.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "# __Machine Learning Workshop Part-B__" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": { 14 | "tags": [] 15 | }, 16 | "source": [ 17 | "## ___Learning objectives___\n", 18 | "\n", 19 | "At the end of the exercise, you should be able to:\n", 20 | "- Conduct an end-to-end classification analysis given data\n", 21 | "- Interpret model with example global and local interpretaion methods." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# First thing first: test your install\n", 31 | "import sklearn, pandas, matplotlib, seaborn, imblearn, numpy, shap, tqdm\n", 32 | "\n", 33 | "# If you encounter error, talk to the instructors and/or your neighbor ASAP" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## ___Outline___\n", 41 | "\n", 42 | "- Session 1\n", 43 | " - Intro to machine learning\n", 44 | " - [Part A review](#review)\n", 45 | " - [Step 1. Define ML problem](#step1)\n", 46 | " - [Step 2. Exploratory data analysis](#step2)\n", 47 | " - [Step 3: Split train/test](#step3)\n", 48 | "- Session 2\n", 49 | " - [Step 4: Feature engineering](#step4)\n", 50 | " - [Step 5: Select model](#step5)\n", 51 | " - [Step 6: Repeat 2-5](#step6)\n", 52 | "- Session 3\n", 53 | " - [Step 7: Evalute model](#step7)\n", 54 | " - [Step 8: Interpret model](#step8)\n", 55 | " - Workshop wrap-up" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": { 61 | "tags": [] 62 | }, 63 | "source": [ 64 | "----\n", 65 | "\n", 66 | "\n", 67 | "## ___Part A review___\n", 68 | "\n", 69 | "✅ **QUESTION:** In __Part A__, we have gone thorugh the major steps. __In the next 5 minutes__, discuss with your neighbors: What is your understanding of each step? Is there any step or term that is confusng to you? " 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "----\n", 84 | "\n", 85 | "## ___Step 1. Define ML problem___\n", 86 | "\n", 87 | "___Problem statement___: \n", 88 | "- Given:\n", 89 | " - Known genes in general metabolism (GM) or specialized metabolism (SM)\n", 90 | " - Various features of genes\n", 91 | " - E.g., expression levels, functional category they belong to \n", 92 | "- How can we use them to:\n", 93 | " - Distinguish genes involved in GM from those involved in SM?\n" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "✅ **QUESTION:** __In the next two minutes__, discuss with your neighbors: given a list of features in [this table](feature_list.csv), what features should we use to best distinguish GM and SM genes?" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": {}, 106 | "source": [ 107 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "----\n", 115 | "\n", 116 | "## __Step 2. Exploratory data analysis (EDA)__" 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "✅ **DO THIS:** Run the following cell to load the data. Note that we set a random seed (`rand_seed`). This is so we can reprdouce the random data generated along the way so others can repeat the same analysis and get the similar, if not the same results." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "import pandas as pd\n", 133 | "\n", 134 | "rand_seed = 20240507\n", 135 | "\n", 136 | "# Load the dataset from the file enzyme_gene.csv as a Pandas DataFrame where the\n", 137 | "# `Gene` column is read in as index.\n", 138 | "enzyme_gene = pd.read_csv(\"enzyme_gene.csv\", index_col=0)\n", 139 | "\n", 140 | "# get all feature names\n", 141 | "features = enzyme_gene.columns[1:] \n", 142 | "\n", 143 | "# get all feature names\n", 144 | "# Print out 4 randomly sampled instances\n", 145 | "enzyme_gene.sample(4)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "✅ **QUESTION:** __In the next minute__, discuss with your neighbors: which features do you think will be the best for distinguishing GM and SM genes? For what the feature names mean, see [this spreadsheet](https://docs.google.com/spreadsheets/d/1VjnlJgKsGC8GNNds7VtbYfjAQpjETkZNI6AlF9c2DAY/edit?usp=sharing). Also, put your name down for the feature you think will be the best!" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "### ___2.1 Univariate EDA___" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "✅ **DO THIS:** Earlier, we use `enzyme_gene.sample(4)` to get a few samples from the `enzyme_gene` dataframe. Replace `sample(4)` with `describe()` and see what information become available." 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": {}, 180 | "outputs": [], 181 | "source": [ 182 | "# Put your code here\n", 183 | "# If you are not sure how to proceed, look at the answer in the following cell.\n" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [ 192 | "##ANSWER##\n", 193 | "# Generate summary statics for each features\n", 194 | "enzyme_gene.describe()\n", 195 | "##ANSWER##" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "✅ **DO THIS:** Tell us more about `enzyme_gene` by using `shape` and `nunique()`:" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# put your codes here\n", 212 | "\n", 213 | "# Display the # of columns and rows\n", 214 | "print(\"\\nShape:\", enzyme_gene.shape)\n", 215 | "\n", 216 | "# Display the # of unique values in each feature\n", 217 | "print(\"\\n### Number unique:\\n\", enzyme_gene.nunique())" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "✅ **DO THIS:** Run the following to determine how many entires have `GM`, `SM`, and `Unknown` labels, respectively." 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": null, 230 | "metadata": {}, 231 | "outputs": [], 232 | "source": [ 233 | "print(enzyme_gene[\"Label\"].value_counts())" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "✅ **DO THIS:** Determine the number of `null` entries in each column and print them out." 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "enzyme_gene.isnull().sum()" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": {}, 255 | "source": [ 256 | "### ___2.2 Univariate graphical EDA___" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "✅ **DO THIS:** Plot the histograms of all features." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": {}, 270 | "outputs": [], 271 | "source": [ 272 | "# The 1st \":\" is to get all rows. The \"1:\" part is to get the 2nd column and on.\n", 273 | "feature_values = enzyme_gene.iloc[:, 1:]\n", 274 | "\n", 275 | "# Draw histogram\n", 276 | "hist = feature_values.hist(figsize=(12,12), bins=50)\n" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "✅ **QUESTION:** __In the next 1 minute__, discuss with your neighbors: did you see any issues with the dataset based on the above analyses?" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "### ___2.3 Multi-variate non-graphical EDA___" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "✅ **DO THIS:** Determine pairwise Spearman's rank correlations of all features:" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "# Calculate Spearman's rank correlations for all feature pairs\n", 314 | "corr = feature_values.corr(method = 'spearman')\n", 315 | "corr" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "### ___2.4 Multi-variate graphical EDA___" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "✅ **DO THIS:** Plot the pairwise Spearman's rank correlations of all features as a heatmap using Seaborn." 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": null, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "import seaborn as sns\n", 339 | "import matplotlib.pyplot as plt\n", 340 | "\n", 341 | "plt.figure(figsize=(7,6))\n", 342 | "sns.heatmap(corr, cmap=\"coolwarm\")" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "✅ **QUESTION:** __In the next 1 minute__, discuss with your neighbors: did you see any issues with the dataset based on the univariate analyses?" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": {}, 362 | "source": [ 363 | "---\n", 364 | "\n", 365 | "## __Step 3: Split train/test__\n", 366 | "\n", 367 | "The best practice is alway to set aside testing data __as the FIRST STEP__ or as early as possible. This way, the testing data is truly independent from any of the modeling process aside from the processing steps needed. Nonetheless, there may be things you need to take care of first before splitting the data. In this example, we need to rid of unwanted instances before data split." 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "\n", 375 | "### ___3.1 Deal with unwanted instances___\n", 376 | "\n", 377 | "✅ **DO THIS:** Filter data so only instances with `SM` and `GM` labels are kept." 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "labels = ['GM', 'SM']\n", 387 | "label_column = enzyme_gene['Label']\n", 388 | "label_column_filter = label_column.isin(labels)\n", 389 | "\n", 390 | "# enzyme_gene dataframe with only GM and SM\n", 391 | "enzyme_gene_fil = enzyme_gene[label_column_filter]\n", 392 | "\n", 393 | "# Count the occurence of unique values\n", 394 | "enzyme_gene_fil['Label'].value_counts()" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "✅ **DO THIS:** For classification tasks, class values are typically integers instead of texts (like SM or GM here). So, we will convert `GM` and `SM` to 0 and 1, respectively. \n", 402 | "\n", 403 | "Write code at the end to show that the filtering is working." 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "metadata": {}, 410 | "outputs": [], 411 | "source": [ 412 | "# import the proprecessing functions\n", 413 | "from sklearn import preprocessing\n", 414 | "\n", 415 | "# Create a LabelEncoder object: this is simply a software tool that turn \n", 416 | "# (encode) texts into 0 or 1 (labels) in this case.\n", 417 | "le = preprocessing.LabelEncoder()\n", 418 | "\n", 419 | "# Send the Label column of enzyme_gene_fil dataframe to the LabelEncoder so\n", 420 | "# it can fit (i.e., learn) how to encode the labels.\n", 421 | "le.fit(enzyme_gene_fil.Label)\n", 422 | "\n", 423 | "# Now, used the fitted (learned) encoder to transform texts to labels\n", 424 | "enzyme_gene_fil['Label'] = le.transform(enzyme_gene_fil.Label)\n", 425 | "\n", 426 | "# Write code below to show that the Label encoding is working: 0s and 1s\n", 427 | "\n", 428 | "\n", 429 | "# If you are not sure how to proceed, look at the answer in the following cell." 430 | ] 431 | }, 432 | { 433 | "cell_type": "code", 434 | "execution_count": null, 435 | "metadata": {}, 436 | "outputs": [], 437 | "source": [ 438 | "##ANSWER##\n", 439 | "enzyme_gene_fil['Label'].value_counts()\n", 440 | "##ANSWER##" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 448 | ] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "metadata": {}, 453 | "source": [ 454 | "### ___3.2 Split training/testing sets___\n", 455 | "\n", 456 | "✅ **DO THIS:** Let's split the training and testing data. \n", 457 | "\n", 458 | "Please comments on the lines as indicated." 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "from sklearn.model_selection import train_test_split\n", 468 | "\n", 469 | "# Split dataset into training and testing sets\n", 470 | "train, test = train_test_split(\n", 471 | " enzyme_gene_fil, # The data to split\n", 472 | " test_size=0.2, # Proportion data for testing\n", 473 | " stratify=enzyme_gene_fil.Label, # Make sure proportions of 0/1\n", 474 | " # labels are similar between\n", 475 | " # training and testing sets\n", 476 | " random_state=rand_seed)\n", 477 | "\n", 478 | "# Print out proportions of different labels in the training data\n", 479 | "print(train['Label'].value_counts()/train.shape[0])\n", 480 | "\n", 481 | "# Print out proportions of different labels in the testing data\n", 482 | "print(test['Label'].value_counts()/test.shape[0])" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "✅ **QUESTION:** In __the next 2 minutes__, discuss with your neighbor: what's the point of splitting training and testing data again? How big should the testing data be?" 490 | ] 491 | }, 492 | { 493 | "cell_type": "markdown", 494 | "metadata": {}, 495 | "source": [ 496 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 497 | ] 498 | }, 499 | { 500 | "cell_type": "markdown", 501 | "metadata": {}, 502 | "source": [ 503 | "---\n", 504 | "\n", 505 | "## __Step 4: Feature engineering__\n", 506 | "\n", 507 | "Feature engineering involves processing, transforming, selecting, combining features in ways that will improve the model. " 508 | ] 509 | }, 510 | { 511 | "cell_type": "markdown", 512 | "metadata": {}, 513 | "source": [ 514 | "### ___4.1 Deal with missing data___\n", 515 | "\n", 516 | "We will just try to deal with this in one way. __In reality__, You need to try multiple approaches to see how you can get the best results." 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "✅ **DO THIS:** First let's remind ourself how many instances are there after we get rid of `unknown`." 524 | ] 525 | }, 526 | { 527 | "cell_type": "code", 528 | "execution_count": null, 529 | "metadata": {}, 530 | "outputs": [], 531 | "source": [ 532 | "enzyme_gene_fil.shape" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "✅ **DO THIS:** Let's drop any rows with >25% missing values and see how many instances are still there." 540 | ] 541 | }, 542 | { 543 | "cell_type": "code", 544 | "execution_count": null, 545 | "metadata": {}, 546 | "outputs": [], 547 | "source": [ 548 | "# ask which values are null.\n", 549 | "row_na = enzyme_gene_fil.isnull()\n", 550 | "row_na" 551 | ] 552 | }, 553 | { 554 | "cell_type": "code", 555 | "execution_count": null, 556 | "metadata": {}, 557 | "outputs": [], 558 | "source": [ 559 | "# count the mising values (null, NA, or called NaN: Not a Number) of each crow\n", 560 | "row_na_num = train.isnull().sum(axis=1)\n", 561 | "row_na_num[:5]" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "num_feat = train.shape[1] - 1 # number of features in the data\n", 571 | "rows_to_keep = row_na_num/num_feat < 0.25 # rows with <25% missing values\n", 572 | "train_keep = train[rows_to_keep] # training data with rows to keep \n", 573 | "train_keep['Label'].value_counts()" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "✅ **DO THIS:** A lot of data is removed but this is much better than just drop any row with missing values. Next, let's try to impute the missing values with `KNNImputer`." 581 | ] 582 | }, 583 | { 584 | "cell_type": "code", 585 | "execution_count": null, 586 | "metadata": {}, 587 | "outputs": [], 588 | "source": [ 589 | "from sklearn.impute import KNNImputer\n", 590 | "\n", 591 | "# Create an imputer object to imptue our data\n", 592 | "# n_neighbors is the number of neighbors used to estimate the missing values.\n", 593 | "imputer = KNNImputer(n_neighbors=5)\n", 594 | "\n", 595 | "# Train the imputer with training data\n", 596 | "imputer.fit(train_keep)\n", 597 | "\n", 598 | "# Transform missing values into imputed values, hence train_keep_imp (imputed)\n", 599 | "train_keep_imp = imputer.transform(train_keep)\n", 600 | "\n", 601 | "# The thing with KNNImputer is it create train_keep_imp as a Numpy array so\n", 602 | "# we don't have the column names any more. Because I really want to know what \n", 603 | "# these columns are, so let's turn this back to a DataFrame with column names.\n", 604 | "train_keep_imp = pd.DataFrame(train_keep_imp, columns=train.columns)\n", 605 | "train_keep_imp.sample(4)" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "✅ **CAUTION:** There is a hyperparamter here: `n_neighbors`. We set it to 5 here but __in reality__, multiple values need to be evaluated. Also, you DO NOT impute labels. We did not exclude the label column because we have make sure there is no missing value early on. So imputation will not impact it." 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "✅ **DO THIS:** Follow what we have done for the training data. Write code that will impute the testing set:\n", 620 | "1. Drop any rows with > 25% missing values.\n", 621 | "2. Impute missing value with KNNImputer.\n", 622 | "3. Check that there is no missing value." 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "# put your code here\n", 632 | "\n", 633 | "\n", 634 | "# I encourage you to try to figure this out. If you get stuck. Look at the \n", 635 | "# answer below and comments on what each line does." 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "##ANSWER##\n", 645 | "# Sum the number of NAs in each row of the testing set\n", 646 | "row_na_num = test.isnull().sum(axis=1)\n", 647 | "\n", 648 | "# Get the number of features\n", 649 | "num_feat = test.shape[1] - 1\n", 650 | "\n", 651 | "# Determine which rows have below threshold (25%) NAs\n", 652 | "row_na_below_threshold = row_na_num/num_feat < 0.25\n", 653 | "\n", 654 | "# Get the rows we want to keep\n", 655 | "test_keep = test[row_na_below_threshold]\n", 656 | "\n", 657 | "# Impute missing values\n", 658 | "test_keep_imp = imputer.transform(test_keep)\n", 659 | "test_keep_imp = pd.DataFrame(test_keep_imp, columns=test.columns)\n", 660 | "test_keep_imp.isnull().sum()\n", 661 | "##ANSWER##" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "✅ **QUESTION:** In __the next 2 minutes__, discuss with your neighbor: note that here `testing` data is not used to `fit` the imputer. Instead, the imputer fitted with training data is used to `tranform` test set. Why is that?" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": {}, 674 | "source": [ 675 | "### ⛺ **PAUSE: when you finish, please turn your attention to the instructor. **" 676 | ] 677 | }, 678 | { 679 | "cell_type": "markdown", 680 | "metadata": {}, 681 | "source": [ 682 | "### ___4.2 Deal with data imbalance___" 683 | ] 684 | }, 685 | { 686 | "cell_type": "markdown", 687 | "metadata": {}, 688 | "source": [ 689 | "✅ **DO THIS:** Here we will use a hybrid approach:\n", 690 | "- Up-sample the minority class so it has twice as many instances __AND__ \n", 691 | "- Downsample the majority class so it is the same number as the minority." 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "metadata": {}, 698 | "outputs": [], 699 | "source": [ 700 | "train_keep_imp.head()" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "from collections import Counter\n", 710 | "from imblearn.over_sampling import SMOTE\n", 711 | "from imblearn.under_sampling import RandomUnderSampler\n", 712 | "from imblearn.pipeline import Pipeline\n", 713 | "\n", 714 | "# training features\n", 715 | "X_train = train_keep_imp.iloc[:,1:] \n", 716 | "\n", 717 | "# training labels\n", 718 | "y_train = train_keep_imp.iloc[:,0] \n", 719 | "\n", 720 | "# This will be used in many other occasions.\n", 721 | "feat_names = X_train.columns \n", 722 | "\n", 723 | "# summarize class distribution\n", 724 | "counter = Counter(y_train)\n", 725 | "print(\"Before:\", counter)\n", 726 | "\n", 727 | "# Over-sample minority, under-sample majority\n", 728 | "over = SMOTE(sampling_strategy=0.4)\n", 729 | "under = RandomUnderSampler(sampling_strategy=1)\n", 730 | "steps = [('o', over), ('u', under)]\n", 731 | "pipeline = Pipeline(steps=steps)\n", 732 | "\n", 733 | "# transform the dataset\n", 734 | "X_train_bal, y_train = pipeline.fit_resample(X_train, y_train)\n", 735 | "\n", 736 | "# summarize the new class distribution\n", 737 | "counter = Counter(y_train)\n", 738 | "print(\"After :\", counter)" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "metadata": {}, 744 | "source": [ 745 | "✅ **QUESTION:** In __the next 2 minutes__, discuss with your neighbor: Resampling, particularly upsampling can __only__ be applied to the training data. The testing set __should not__ be changed in this step. Why is that?" 746 | ] 747 | }, 748 | { 749 | "cell_type": "markdown", 750 | "metadata": {}, 751 | "source": [ 752 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 753 | ] 754 | }, 755 | { 756 | "cell_type": "markdown", 757 | "metadata": {}, 758 | "source": [ 759 | "### ___4.3. Deal with data scaling___\n", 760 | "\n", 761 | "✅ **DO THIS:** Bassed on your exploratory data analsysi you probably the data range differ widely. Before we work on scaling the data, let's see how the data ranges differ:" 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": null, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "X_train_bal.describe()" 771 | ] 772 | }, 773 | { 774 | "cell_type": "markdown", 775 | "metadata": {}, 776 | "source": [ 777 | "✅ **DO THIS:** In the cell below, let's use `RobustScaler` to scale the balanced training data feature values (`X_train_bal`).\n", 778 | "\n", 779 | "__DO NOT__ applying scaling to the labels (`y`).\n", 780 | "\n", 781 | "Call the scaled features as `X_train_scale`." 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "from sklearn.preprocessing import RobustScaler\n", 791 | "\n", 792 | "# initialize a scaler\n", 793 | "scaler = RobustScaler()\n", 794 | "\n", 795 | "# fit the scaler with training features\n", 796 | "scaler.fit(X_train_bal)\n", 797 | "\n", 798 | "# transform the training feature values with the fitted scaler\n", 799 | "X_train_scale = scaler.transform(X_train_bal)\n", 800 | "X_train_scale = pd.DataFrame(X_train_scale, columns=X_train.columns)\n", 801 | "X_train_scale.describe()" 802 | ] 803 | }, 804 | { 805 | "cell_type": "markdown", 806 | "metadata": {}, 807 | "source": [ 808 | "✅ **DO THIS:** Provide code below to process testing data so it is scaled the same way.\n", 809 | "- Create `X_test` and `y_test` with `test_keep_imp`.\n", 810 | "- Transform (but __do not__ fit) `X_test` with the `RobustScaler`." 811 | ] 812 | }, 813 | { 814 | "cell_type": "code", 815 | "execution_count": null, 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [ 819 | "# put your code here\n", 820 | "\n", 821 | "\n", 822 | "# If you don't feel comfortable doing this, check out the answer below and\n", 823 | "# comment on what they are doing." 824 | ] 825 | }, 826 | { 827 | "cell_type": "code", 828 | "execution_count": null, 829 | "metadata": {}, 830 | "outputs": [], 831 | "source": [ 832 | "# put your code here\n", 833 | "\n", 834 | "##ANSWER##\n", 835 | "X_test = test_keep_imp.iloc[:,1:]\n", 836 | "y_test = test_keep_imp.iloc[:,0]\n", 837 | "\n", 838 | "X_test_scale = scaler.transform(X_test)\n", 839 | "X_test_scale = pd.DataFrame(X_test_scale, columns=X_test.columns)\n", 840 | "##ANSWER##" 841 | ] 842 | }, 843 | { 844 | "cell_type": "markdown", 845 | "metadata": {}, 846 | "source": [ 847 | "✅ **CAUTION:** We __did not__ deal with non-normal distributions or co-linear features. In practice, they need to be dealth with. To make the exercise simpler, we did not include categorical variable. For starter, please checkout, after class:\n", 848 | "- [This post on data transformation](https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/).\n", 849 | "- [This post on colinearity](https://towardsdatascience.com/multicollinearity-in-data-science-c5f6c0fe6edf).\n", 850 | "- [This post on how to work with categorical variables](https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/).\n", 851 | "- More generally, [here is a good article on feature engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)." 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "### ⛺ **PAUSE: please turn your attention to the instructor. **" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "---\n", 866 | "\n", 867 | "## __Step 5: Select model__" 868 | ] 869 | }, 870 | { 871 | "cell_type": "markdown", 872 | "metadata": {}, 873 | "source": [ 874 | "### ___5.1 Random forest___\n", 875 | "\n", 876 | "✅ **DO THIS:** Here we will not go into details on how the RandomForest algorithm works. there is a substantial number of good tutorial/blog posts on RandomForest (e.g., [this one](https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/)) and I encourage you to look into it. Comments on the major steps as indicated." 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": null, 882 | "metadata": {}, 883 | "outputs": [], 884 | "source": [ 885 | "from sklearn.ensemble import RandomForestClassifier\n", 886 | "from sklearn.model_selection import GridSearchCV\n", 887 | "from sklearn.model_selection import train_test_split\n", 888 | "\n", 889 | "# Create a function for running RandomForest\n", 890 | "def run_randomforest(X_train, y_train):\n", 891 | " # Below is a Python dictionary specify the hyperparameters to be tested\n", 892 | " # 2x3x4x1 = 24\n", 893 | " param_grid = {'n_estimators': [200, 500],\n", 894 | " 'max_features': ['sqrt', 'log2'],\n", 895 | " 'max_depth' : [3,5,7,9],\n", 896 | " 'criterion' :['entropy']}\n", 897 | "\n", 898 | " # Initialize a random forest classifier (rfc) with a random seed\n", 899 | " rfc = RandomForestClassifier(random_state=rand_seed)\n", 900 | "\n", 901 | " # Initialize a grid search object that will search through each of the 24\n", 902 | " # hyperparameter combinations. For each combination, a five fold cross-\n", 903 | " # validation (cv) is done. So totally 24x5 = 120 random forest classifiers \n", 904 | " # will be build.\n", 905 | " rfc_gs = GridSearchCV(\n", 906 | " rfc,\n", 907 | " param_grid,\n", 908 | " cv=5, # cross validation folds\n", 909 | " verbose=2, # \n", 910 | " scoring='roc_auc', # find model with the best ROC-AUC\n", 911 | " n_jobs=8) # number of concurrent jobs, you need to\n", 912 | " # adjust this based on the number of CPU cores\n", 913 | " # available on your machine.\n", 914 | "\n", 915 | " # Pass the training feaure and label data to the grid search object and\n", 916 | " # start fitting (training) models\n", 917 | " rfc_gs.fit(X_train, y_train)\n", 918 | "\n", 919 | " # Return the fitted grid search object\n", 920 | " return rfc_gs" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": {}, 927 | "outputs": [], 928 | "source": [ 929 | "# Call the run_randomforest function defined above\n", 930 | "rfc_gs = run_randomforest(X_train_scale, y_train)" 931 | ] 932 | }, 933 | { 934 | "cell_type": "markdown", 935 | "metadata": {}, 936 | "source": [ 937 | "✅ **DO THIS:** In the `grid_search` object, there is whole punch of useful information. Run the following cells." 938 | ] 939 | }, 940 | { 941 | "cell_type": "code", 942 | "execution_count": null, 943 | "metadata": {}, 944 | "outputs": [], 945 | "source": [ 946 | "# The best model (also called estimator)\n", 947 | "best_model = rfc_gs.best_estimator_\n", 948 | "\n", 949 | "# Note that the best hyperparameters are also reported\n", 950 | "best_model" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [ 959 | "# The best ROC-AUC score averged across CV folds for the best model\n", 960 | "print(rfc_gs.best_score_)" 961 | ] 962 | }, 963 | { 964 | "cell_type": "markdown", 965 | "metadata": {}, 966 | "source": [ 967 | "✅ **DO THIS:** Although we finished the run rather quickly here, a typical model fitting process can take hours or even days! Thus, the models should be saved so you can reused them in the future. Run the following to save the best estimator. " 968 | ] 969 | }, 970 | { 971 | "cell_type": "code", 972 | "execution_count": null, 973 | "metadata": {}, 974 | "outputs": [], 975 | "source": [ 976 | "import pickle\n", 977 | "\n", 978 | "filename = \"model_randomforest_gridsearch.save\"\n", 979 | "\n", 980 | "pickle.dump(rfc_gs.best_estimator_, open(filename, 'wb'))" 981 | ] 982 | }, 983 | { 984 | "cell_type": "markdown", 985 | "metadata": {}, 986 | "source": [ 987 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbors: what just happened here? Can you describe what you have accomplished in this step and what are done? Discuss with your neighbors." 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 995 | ] 996 | }, 997 | { 998 | "cell_type": "markdown", 999 | "metadata": {}, 1000 | "source": [ 1001 | "### ___5.2 Support Vector Classifier (SVC)___\n", 1002 | "\n", 1003 | "✅ **DO THIS:** There are [many other supervised learning algorithms in Scikit-Learn](https://scikit-learn.org/stable/supervised_learning.html). Let's use [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC). \n", 1004 | "\n", 1005 | "Provide comment on the indicated lines." 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "code", 1010 | "execution_count": null, 1011 | "metadata": {}, 1012 | "outputs": [], 1013 | "source": [ 1014 | "# Train a SVM classification model\n", 1015 | "from sklearn.model_selection import GridSearchCV\n", 1016 | "from sklearn.svm import SVC\n", 1017 | "\n", 1018 | "# COMMENT: What does this do?\n", 1019 | "#\n", 1020 | "param_grid = {'C': [1, 10, 1e2],\n", 1021 | " 'gamma': [0.0001, 0.001, 0.01, 0.1], \n", 1022 | " 'kernel': ['linear', 'rbf']}\n", 1023 | "\n", 1024 | "# COMMENT: What does this do?\n", 1025 | "# \n", 1026 | "svc = SVC()\n", 1027 | "\n", 1028 | "# COMMENT: What does this do?\n", 1029 | "# \n", 1030 | "svc_gs = GridSearchCV(svc, param_grid, cv=5, verbose=2, scoring='roc_auc',\n", 1031 | " n_jobs=8)\n", 1032 | "\n", 1033 | "# COMMENT: What does this do?\n", 1034 | "# \n", 1035 | "svc_gs.fit(X_train_scale, y_train)\n", 1036 | "\n", 1037 | "# COMMENT: What does this do?\n", 1038 | "# \n", 1039 | "filename = \"model_svc_gridsearch.save\"\n", 1040 | "pickle.dump(svc_gs.best_estimator_, open(filename, 'wb'))\n", 1041 | "\n", 1042 | "# COMMENT: What does these do?\n", 1043 | "# \n", 1044 | "print(svc_gs.best_params_)\n", 1045 | "print(svc_gs.best_score_)\n", 1046 | "\n", 1047 | "##COMMENT ANSWERS##\n", 1048 | "# Set up the hyperparameter combinations: 3x4x2 = 24 runs\n", 1049 | "# Intialize a support vector classifier\n", 1050 | "# Initiate a grid search object with cross validation\n", 1051 | "# Search for the best hyperparameters with training data\n", 1052 | "# Save the best model as a file\n", 1053 | "# Print out the best parameters and best scores" 1054 | ] 1055 | }, 1056 | { 1057 | "cell_type": "markdown", 1058 | "metadata": {}, 1059 | "source": [ 1060 | "✅ **CAUTION:** We just try two algorithms here, you should try a lot more.\n", 1061 | "\n", 1062 | "In addition, beyond the hyperparameters associated with the algorithms, there are other things to tuned here:\n", 1063 | "- Cross validation methods: There are quite a number of approaches. See [this](https://scikit-learn.org/stable/modules/cross_validation.html) for examples.\n", 1064 | "- Searching parameters: Grid search is but one approach. Two other popular methods are [randomized search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and [Bayesian optimization](https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html). These should also be tested." 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "metadata": {}, 1070 | "source": [ 1071 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbors: based on the above two runs, can you decide which algorithm is better? Why and why not?" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "markdown", 1076 | "metadata": {}, 1077 | "source": [ 1078 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "markdown", 1083 | "metadata": {}, 1084 | "source": [ 1085 | "---\n", 1086 | "\n", 1087 | "## __Step 6. Repeat Step 2-5__\n", 1088 | "\n", 1089 | "In a typical ML project, after you have explored different algorithms to find the best __initial__ model, it is time to go back to tweak everything to see if you can do even better. Things don't just end here!\n", 1090 | "\n", 1091 | "An important step here is __feature selection__, a part of feature engineering, that involves selecting the most important features to rebuild your models." 1092 | ] 1093 | }, 1094 | { 1095 | "cell_type": "markdown", 1096 | "metadata": { 1097 | "slideshow": { 1098 | "slide_type": "slide" 1099 | } 1100 | }, 1101 | "source": [ 1102 | "### ___6.1 Get feature importance using trained model___" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "markdown", 1107 | "metadata": { 1108 | "slideshow": { 1109 | "slide_type": "subslide" 1110 | } 1111 | }, 1112 | "source": [ 1113 | "✅ **DO THIS:** Uses the feature importance scores generated by the Random Forest model to choose the top features.\n" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "code", 1118 | "execution_count": null, 1119 | "metadata": {}, 1120 | "outputs": [], 1121 | "source": [ 1122 | "from sklearn.inspection import permutation_importance\n", 1123 | "\n", 1124 | "# Specify the best model (estimator) from our RandomForest run\n", 1125 | "rfc = rfc_gs.best_estimator_\n", 1126 | "\n", 1127 | "# Calculate permutation importance of each feature\n", 1128 | "result = permutation_importance(\n", 1129 | " rfc, X_train_scale, y_train, n_repeats=10, random_state=42, n_jobs=8)\n", 1130 | "\n", 1131 | "# COMMENTS: What can you find in the result?\n", 1132 | "#\n", 1133 | "#\n", 1134 | "result" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "code", 1139 | "execution_count": null, 1140 | "metadata": {}, 1141 | "outputs": [], 1142 | "source": [ 1143 | "##COMMENT ANSWER##\n", 1144 | "# importance_mean: mean importance value of each feature\n", 1145 | "# importance_std: imporrtance standard deviation of each feature\n", 1146 | "# importance: the importance scores of each feature for each repeat " 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "execution_count": null, 1152 | "metadata": {}, 1153 | "outputs": [], 1154 | "source": [ 1155 | "# sort the permutation importance based on mean values\n", 1156 | "sorted_idx = result.importances_mean.argsort()[::-1]\n", 1157 | "sorted_idx" 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "code", 1162 | "execution_count": null, 1163 | "metadata": {}, 1164 | "outputs": [], 1165 | "source": [ 1166 | "# Get the importance values in order of the sorted_idx\n", 1167 | "importance_values = result.importances[sorted_idx].T" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": {}, 1174 | "outputs": [], 1175 | "source": [ 1176 | "# Get the feature names based on the sorted index\n", 1177 | "ordered_feature_label = X_train_scale.columns[sorted_idx]\n", 1178 | "ordered_feature_label" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "code", 1183 | "execution_count": null, 1184 | "metadata": {}, 1185 | "outputs": [], 1186 | "source": [ 1187 | "# Plot the permutation importance results\n", 1188 | "fig, ax = plt.subplots(figsize=(6,6))\n", 1189 | "ax.boxplot(importance_values, \n", 1190 | " vert=False, \n", 1191 | " labels=ordered_feature_label)\n", 1192 | "ax.set_title(\"Permutation Importances (training set)\")\n", 1193 | "fig.tight_layout()\n", 1194 | "plt.show()" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "markdown", 1199 | "metadata": {}, 1200 | "source": [ 1201 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbors: how would you interpret the figure above? Is this as what you have expected when you hypothesize the most important feature?" 1202 | ] 1203 | }, 1204 | { 1205 | "cell_type": "markdown", 1206 | "metadata": {}, 1207 | "source": [ 1208 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1209 | ] 1210 | }, 1211 | { 1212 | "cell_type": "markdown", 1213 | "metadata": { 1214 | "slideshow": { 1215 | "slide_type": "slide" 1216 | } 1217 | }, 1218 | "source": [ 1219 | "### ___6.2 Retrain model using top 10 features___" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "markdown", 1224 | "metadata": { 1225 | "slideshow": { 1226 | "slide_type": "subslide" 1227 | } 1228 | }, 1229 | "source": [ 1230 | "✅ **DO THIS:** A simpler model is always better, because it is easier to understand and because it tends not to be overfitted. Let's use the top 10 features to train a new RandomForest model and see how well it does.\n" 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "code", 1235 | "execution_count": null, 1236 | "metadata": { 1237 | "slideshow": { 1238 | "slide_type": "fragment" 1239 | } 1240 | }, 1241 | "outputs": [], 1242 | "source": [ 1243 | "# Get top 10 feature names\n", 1244 | "feat_top10 = ordered_feature_label[:10]\n", 1245 | "\n", 1246 | "# Get training data with only the top 10 features\n", 1247 | "X_train_top10 = X_train_scale[feat_top10]\n", 1248 | "X_train_top10.shape" 1249 | ] 1250 | }, 1251 | { 1252 | "cell_type": "code", 1253 | "execution_count": null, 1254 | "metadata": { 1255 | "slideshow": { 1256 | "slide_type": "fragment" 1257 | } 1258 | }, 1259 | "outputs": [], 1260 | "source": [ 1261 | "# Do the same for testing data\n", 1262 | "X_test_top10 = X_test_scale[feat_top10]\n", 1263 | "X_test_top10.shape" 1264 | ] 1265 | }, 1266 | { 1267 | "cell_type": "code", 1268 | "execution_count": null, 1269 | "metadata": { 1270 | "slideshow": { 1271 | "slide_type": "fragment" 1272 | } 1273 | }, 1274 | "outputs": [], 1275 | "source": [ 1276 | "# Train RandomForest model via grid search\n", 1277 | "rfc_gs_top10 = run_randomforest(X_train_top10, y_train)" 1278 | ] 1279 | }, 1280 | { 1281 | "cell_type": "code", 1282 | "execution_count": null, 1283 | "metadata": {}, 1284 | "outputs": [], 1285 | "source": [ 1286 | "# Save model\n", 1287 | "filename = \"model_randomforest_gridsearch_top10feat.save\"\n", 1288 | "\n", 1289 | "pickle.dump(rfc_gs_top10.best_estimator_, open(filename, 'wb'))" 1290 | ] 1291 | }, 1292 | { 1293 | "cell_type": "code", 1294 | "execution_count": null, 1295 | "metadata": {}, 1296 | "outputs": [], 1297 | "source": [ 1298 | "# Get model score\n", 1299 | "rfc_gs_top10.best_score_" 1300 | ] 1301 | }, 1302 | { 1303 | "cell_type": "markdown", 1304 | "metadata": {}, 1305 | "source": [ 1306 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbors: compare this result to the model with all features, should we just use the top 10? Why and why not?" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "markdown", 1311 | "metadata": {}, 1312 | "source": [ 1313 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "markdown", 1318 | "metadata": {}, 1319 | "source": [ 1320 | "---\n", 1321 | "\n", 1322 | "## __Step 7. Evaluate model with the testing set__\n", 1323 | "\n", 1324 | "Assume that we are done with model building and have a final model that we cannot improve further. Then it is time to use the testing set to evaluate the model." 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "markdown", 1329 | "metadata": {}, 1330 | "source": [ 1331 | "### ___7.1 Using the optimal model to predict testing set___" 1332 | ] 1333 | }, 1334 | { 1335 | "cell_type": "markdown", 1336 | "metadata": {}, 1337 | "source": [ 1338 | "✅ **DO THIS:** Random Forest is a little better than SVC. And because using 10 features leads to a model that perofrm almost as well as the one with more features, we will pick the \"optimal\" model to be the random forest model using only 10 features." 1339 | ] 1340 | }, 1341 | { 1342 | "cell_type": "code", 1343 | "execution_count": null, 1344 | "metadata": {}, 1345 | "outputs": [], 1346 | "source": [ 1347 | "filename2 = \"model_randomforest_gridsearch_top10feat.save\"\n", 1348 | "\n", 1349 | "# load model from file\n", 1350 | "rfc_loaded_top10 = pickle.load(open(filename2, 'rb')) # model using top 10\n", 1351 | "\n", 1352 | "# predict testing data labels with the model using top 10 features\n", 1353 | "y_test_pred = rfc_loaded_top10.predict(X_test_top10)" 1354 | ] 1355 | }, 1356 | { 1357 | "cell_type": "code", 1358 | "execution_count": null, 1359 | "metadata": {}, 1360 | "outputs": [], 1361 | "source": [ 1362 | "# Take a look at the 1st 40 predictions\n", 1363 | "print(y_test_pred[:40])" 1364 | ] 1365 | }, 1366 | { 1367 | "cell_type": "code", 1368 | "execution_count": null, 1369 | "metadata": {}, 1370 | "outputs": [], 1371 | "source": [ 1372 | "# Also take a look at the 1st 40 TRUE values\n", 1373 | "print(numpy.array(y_test[:40]))" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "markdown", 1378 | "metadata": { 1379 | "tags": [] 1380 | }, 1381 | "source": [ 1382 | "✅ **QUESTION:** Based on the above results? Do you feel that the model is doing well?" 1383 | ] 1384 | }, 1385 | { 1386 | "cell_type": "markdown", 1387 | "metadata": {}, 1388 | "source": [ 1389 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1390 | ] 1391 | }, 1392 | { 1393 | "cell_type": "markdown", 1394 | "metadata": {}, 1395 | "source": [ 1396 | "### ___7.2 Confusion matrix___" 1397 | ] 1398 | }, 1399 | { 1400 | "cell_type": "markdown", 1401 | "metadata": {}, 1402 | "source": [ 1403 | "✅ **DO THIS:** Let's generate __confusion matrices__ for the predicted results from the model with all features and the one with just the top 10." 1404 | ] 1405 | }, 1406 | { 1407 | "cell_type": "code", 1408 | "execution_count": null, 1409 | "metadata": {}, 1410 | "outputs": [], 1411 | "source": [ 1412 | "from sklearn.metrics import confusion_matrix\n", 1413 | "from sklearn.metrics import ConfusionMatrixDisplay\n", 1414 | "\n", 1415 | "# Get confusion matrrix\n", 1416 | "cm_top10 = confusion_matrix(y_test, y_test_pred)" 1417 | ] 1418 | }, 1419 | { 1420 | "cell_type": "code", 1421 | "execution_count": null, 1422 | "metadata": {}, 1423 | "outputs": [], 1424 | "source": [ 1425 | "# Plot the confusion matrix\n", 1426 | "cm_display = ConfusionMatrixDisplay(cm_top10).plot()" 1427 | ] 1428 | }, 1429 | { 1430 | "cell_type": "markdown", 1431 | "metadata": {}, 1432 | "source": [ 1433 | "The confusion matrix on the left tell us that:\n", 1434 | "- Number of true negative ($tn$) = 292\n", 1435 | "- Number of false positive ($fp$) = 75\n", 1436 | "- Number of false negative ($fn$) = 17\n", 1437 | "- Number of true positive ($tp$) = 46" 1438 | ] 1439 | }, 1440 | { 1441 | "cell_type": "markdown", 1442 | "metadata": { 1443 | "tags": [] 1444 | }, 1445 | "source": [ 1446 | "✅ **QUESTION:** In __the next 1 min__, discuss with your neighbors: based on the above results? Do you feel that the model is doing well? Why and why not?" 1447 | ] 1448 | }, 1449 | { 1450 | "cell_type": "markdown", 1451 | "metadata": {}, 1452 | "source": [ 1453 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1454 | ] 1455 | }, 1456 | { 1457 | "cell_type": "markdown", 1458 | "metadata": {}, 1459 | "source": [ 1460 | "### ___7.3 Classification report___" 1461 | ] 1462 | }, 1463 | { 1464 | "cell_type": "markdown", 1465 | "metadata": {}, 1466 | "source": [ 1467 | "✅ **CAUTION:** The metrics printed out are defined below. Unfortunately, we __do not have time__ to go through these. But for you to run an ML project, these, including ROC-AUC, are __foundational knowledge and you need to know the advantage and disadvantage of using them__. [Here is an excellent summary](https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/) of important metrics for evaluating ML models.\n", 1468 | "\n", 1469 | "|Metric|Formula|\n", 1470 | "|---|---|\n", 1471 | "|Precision|$p = tp/(tp + fp)$|\n", 1472 | "|Recall|$r = tp / (tp + fn)$|\n", 1473 | "|F1 score|$f1 = 2(p\\times r)/(p + r)$|\n", 1474 | "|Accuracy|$(tp+tn)/(tp+fp)$|\n", 1475 | "|Macro averge|$0.5\\times score_{\\text{class0}} + 0.5\\times score_{\\text{class1}}$|\n", 1476 | "|Weighted averge|$P_{\\text{class0}}\\times score_{\\text{class0}} + P_{\\text{class1}}\\times score_{\\text{class1}}$
$P$: proportion of a class.|\n", 1477 | "\n", 1478 | "Support: number of each class (not a performance metric)\n" 1479 | ] 1480 | }, 1481 | { 1482 | "cell_type": "markdown", 1483 | "metadata": {}, 1484 | "source": [ 1485 | "✅ **DO THIS:** Let's also generate a classification report to get a few other performance matrics." 1486 | ] 1487 | }, 1488 | { 1489 | "cell_type": "code", 1490 | "execution_count": null, 1491 | "metadata": {}, 1492 | "outputs": [], 1493 | "source": [ 1494 | "from sklearn.metrics import classification_report\n", 1495 | "\n", 1496 | "# Set class names\n", 1497 | "targets = [\"GM\", \"SM\"]\n", 1498 | "\n", 1499 | "report = classification_report(y_test, y_test_pred, target_names=targets)\n", 1500 | "print(report)" 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "markdown", 1505 | "metadata": {}, 1506 | "source": [ 1507 | "### ⛺ **PAUSE: once you finish, please turn your attention to the instructor. **" 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "markdown", 1512 | "metadata": {}, 1513 | "source": [ 1514 | "### ___7.4 Graphics that help with evaluation___" 1515 | ] 1516 | }, 1517 | { 1518 | "cell_type": "markdown", 1519 | "metadata": {}, 1520 | "source": [ 1521 | "✅ **DO THIS:** Here we provide two examples: ROC-AUC curve and precision-recall curve. The red dotted line indicate how a naive classifer woul fair with __random guesses__. We only plot these curves for the model with top 10 features." 1522 | ] 1523 | }, 1524 | { 1525 | "cell_type": "code", 1526 | "execution_count": null, 1527 | "metadata": {}, 1528 | "outputs": [], 1529 | "source": [ 1530 | "from sklearn.metrics import RocCurveDisplay\n", 1531 | "from sklearn.metrics import PrecisionRecallDisplay\n", 1532 | "\n", 1533 | "def plot_curves(X, y, estimator, background):\n", 1534 | "\n", 1535 | " fig, axs = plt.subplots(1, 2, figsize=(10,5))\n", 1536 | "\n", 1537 | " # ROC-AUC curve\n", 1538 | " RocCurveDisplay.from_estimator(estimator, X, y, ax=axs[0])\n", 1539 | " # Plot ROC-AUC background\n", 1540 | " axs[0].plot([0, 1], [0, 1],'r--')\n", 1541 | "\n", 1542 | " # Precision-recall curve\n", 1543 | " PrecisionRecallDisplay.from_estimator(estimator, X, y, ax=axs[1])\n", 1544 | " axs[1].legend(loc='upper right')\n", 1545 | " # Plot PR-curve background\n", 1546 | " axs[1].plot([0, 1], [background, background],'r--')\n", 1547 | " axs[1].set_ylim(-0.05,1.05)\n", 1548 | "\n", 1549 | " plt.show()\n", 1550 | "\n", 1551 | "num_class0 = y_test[y_test==0].shape[0]\n", 1552 | "num_class1 = y_test[y_test==1].shape[0]\n", 1553 | "background = num_class1/(num_class0+num_class1)\n", 1554 | "print(background)\n", 1555 | "\n", 1556 | "plot_curves(X_test_top10, y_test, rfc_loaded_top10, background)" 1557 | ] 1558 | }, 1559 | { 1560 | "cell_type": "markdown", 1561 | "metadata": { 1562 | "tags": [] 1563 | }, 1564 | "source": [ 1565 | "✅ **QUESTION:** The red lines are what we expect a random model will be performing. For the plot on the right-hand side, it is defined by the `background` value defined in the cell above. Why would it be cosndiered as `background` (i.e., random guess) for the right-hand plot?" 1566 | ] 1567 | }, 1568 | { 1569 | "cell_type": "markdown", 1570 | "metadata": {}, 1571 | "source": [ 1572 | "✅ **DO THIS:** Let's also get these curves for training data." 1573 | ] 1574 | }, 1575 | { 1576 | "cell_type": "code", 1577 | "execution_count": null, 1578 | "metadata": {}, 1579 | "outputs": [], 1580 | "source": [ 1581 | "num_class0 = y_train[y_train==0].shape[0]\n", 1582 | "num_class1 = y_train[y_train==1].shape[0]\n", 1583 | "background = num_class1/(num_class0+num_class1)\n", 1584 | "print(background)\n", 1585 | "\n", 1586 | "plot_curves(X_train_top10, y_train, rfc_loaded_top10, background)" 1587 | ] 1588 | }, 1589 | { 1590 | "cell_type": "markdown", 1591 | "metadata": { 1592 | "tags": [] 1593 | }, 1594 | "source": [ 1595 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbor: why are the model performance for training data so much better than that for the testing set?" 1596 | ] 1597 | }, 1598 | { 1599 | "cell_type": "markdown", 1600 | "metadata": {}, 1601 | "source": [ 1602 | "### ⛺ **PAUSE: after your discussion, please turn your attention to the instructor. **" 1603 | ] 1604 | }, 1605 | { 1606 | "cell_type": "markdown", 1607 | "metadata": {}, 1608 | "source": [ 1609 | "---\n", 1610 | "\n", 1611 | "## __Step 8. Interpret model__" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "markdown", 1616 | "metadata": {}, 1617 | "source": [ 1618 | "### ___8.1 Global interpetation using SHAP___" 1619 | ] 1620 | }, 1621 | { 1622 | "cell_type": "markdown", 1623 | "metadata": {}, 1624 | "source": [ 1625 | "✅ **DO THIS:** Run the code below to get [SHAP (SHapley Additive exPlanations)](https://shap.readthedocs.io/en/latest/index.html) values that use game theory to determine how important each feature is in contributing to a prediction. There are MANY, MANY things you can do with SHAP and we will first figure out which features are more important than the others." 1626 | ] 1627 | }, 1628 | { 1629 | "cell_type": "code", 1630 | "execution_count": null, 1631 | "metadata": {}, 1632 | "outputs": [], 1633 | "source": [ 1634 | "help(rfc_loaded_top10)" 1635 | ] 1636 | }, 1637 | { 1638 | "cell_type": "code", 1639 | "execution_count": null, 1640 | "metadata": {}, 1641 | "outputs": [], 1642 | "source": [ 1643 | "rfc_loaded_top10.feature_importances_ " 1644 | ] 1645 | }, 1646 | { 1647 | "cell_type": "code", 1648 | "execution_count": null, 1649 | "metadata": {}, 1650 | "outputs": [], 1651 | "source": [ 1652 | "X_train_top10.shape" 1653 | ] 1654 | }, 1655 | { 1656 | "cell_type": "code", 1657 | "execution_count": null, 1658 | "metadata": {}, 1659 | "outputs": [], 1660 | "source": [ 1661 | "import shap\n", 1662 | "\n", 1663 | "# Get a TreeExplainer object using the model we have created\n", 1664 | "explainer = shap.TreeExplainer(rfc_loaded_top10)\n", 1665 | "\n", 1666 | "# Explainer object for the training dataset\n", 1667 | "shap_values = explainer.shap_values(X_train_top10)\n", 1668 | "\n", 1669 | "# Shap values for positive class (label=1)\n", 1670 | "shap_for_label1 = shap_values[:,:,1]\n", 1671 | "print(shap_for_label1.shape)\n", 1672 | "\n", 1673 | "# Generate a summary plot\n", 1674 | "shap.summary_plot(shap_for_label1, X_train_top10, sort=True, plot_type=\"bar\")" 1675 | ] 1676 | }, 1677 | { 1678 | "cell_type": "markdown", 1679 | "metadata": {}, 1680 | "source": [ 1681 | "We can see that `Fam_size` is the most important. Because we have only two classes, if a feature is important for classifyting one class, it must also be important for the other class." 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "markdown", 1686 | "metadata": {}, 1687 | "source": [ 1688 | "### ___8.2 SHAP values of different instances___" 1689 | ] 1690 | }, 1691 | { 1692 | "cell_type": "markdown", 1693 | "metadata": {}, 1694 | "source": [ 1695 | "✅ **DO THIS:** Another way to look at the SHAP values is by focusing on a particular class. In the example below, we focus on how different features contribute to the predictions of label=1 (SM):" 1696 | ] 1697 | }, 1698 | { 1699 | "cell_type": "code", 1700 | "execution_count": null, 1701 | "metadata": {}, 1702 | "outputs": [], 1703 | "source": [ 1704 | "# plot the shape value and color based on feature values\n", 1705 | "shap.summary_plot(shap_for_label1, X_train_top10, sort=True)" 1706 | ] 1707 | }, 1708 | { 1709 | "cell_type": "markdown", 1710 | "metadata": { 1711 | "tags": [] 1712 | }, 1713 | "source": [ 1714 | "For each feature $x$, two SHAP values are generated for each instance: one for $x$'s contribution to class $0$ and the other for its contribution to class $1$. In the plot above, we are only looking at the contribution to class $1$ and each dot is an instance.\n", 1715 | "\n", 1716 | "Look at Fam_size (family size), instances with higher features values (i.e., in larger families) also tend to have higher positive SHAP values (i.e., positive contribution to be in class 1).\n", 1717 | "\n", 1718 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbor and interpret what the `Func_likelihood` feature's SHAP value distribution means where a higher feature values correlate with lower SHAP." 1719 | ] 1720 | }, 1721 | { 1722 | "cell_type": "markdown", 1723 | "metadata": {}, 1724 | "source": [ 1725 | "### ⛺ **PAUSE: after your discussion, please turn your attention to the instructor. **" 1726 | ] 1727 | }, 1728 | { 1729 | "cell_type": "markdown", 1730 | "metadata": {}, 1731 | "source": [ 1732 | "-----" 1733 | ] 1734 | } 1735 | ], 1736 | "metadata": { 1737 | "kernelspec": { 1738 | "display_name": "Python 3.10.4 ('bert_finetune': conda)", 1739 | "language": "python", 1740 | "name": "python3" 1741 | }, 1742 | "language_info": { 1743 | "codemirror_mode": { 1744 | "name": "ipython", 1745 | "version": 3 1746 | }, 1747 | "file_extension": ".py", 1748 | "mimetype": "text/x-python", 1749 | "name": "python", 1750 | "nbconvert_exporter": "python", 1751 | "pygments_lexer": "ipython3", 1752 | "version": "3.12.3" 1753 | }, 1754 | "vscode": { 1755 | "interpreter": { 1756 | "hash": "323c618d0395b34183a36199d7c8eddbd4e55d51aee9dabbfbc9809db817fb2b" 1757 | } 1758 | } 1759 | }, 1760 | "nbformat": 4, 1761 | "nbformat_minor": 4 1762 | } 1763 | -------------------------------------------------------------------------------- /ML_workshop-part_c_xai.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "---\n", 8 | "# __Part-c: eXplainable AI/ML (XAI)__" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": { 14 | "tags": [] 15 | }, 16 | "source": [ 17 | "# Learning objectives\n", 18 | "\n", 19 | "At the end of the exercise, you should be able to:\n", 20 | "- Explain why it is important to explain your model.\n", 21 | "- Interpret model with example global and local interpretaion methods." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "# First thing first: test your install\n", 31 | "import sklearn, pandas, matplotlib, seaborn, imblearn, numpy, shap, tqdm\n", 32 | "\n", 33 | "# If you encounter error, talk to the instructors and/or your neighbor ASAP" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Outline\n", 41 | "\n", 42 | "- [5 min] Intro (powerpoint)\n", 43 | "- [[10 min] Background](#background)\n", 44 | "- [[05 min] Step 8: Interpret model](#step7)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": { 50 | "tags": [] 51 | }, 52 | "source": [ 53 | "----\n", 54 | "\n", 55 | "\n", 56 | "# __Background__" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "\n", 64 | "## _Machine learning workflow_" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "\n", 72 | "|Step|Purpose|Notes|\n", 73 | "|---|---|---|\n", 74 | "|1|State the question and frame it as a machine learning problem|Is it a supervised or unsupervised learning problem? There are other major types (semi-supervised and reinforced learning).|\n", 75 | "|2|Collect data, and exploratory data analysis (EDA)|Apply four types of EDA.|\n", 76 | "|3|Split training and testing data|Seperate out testing data for evaluation purpose.|\n", 77 | "|4|Engineer features| - Format data, impute missing values, normalize/scale features.
- Transform feature values, select informative features,
- Create combinatorial features (e.g., the interaction terms for basic linear model).
- Do dimension reduction to reduce the number of features.|\n", 78 | "|5|Select models| - There are many models/algorithms/classifers making different assumptions about how modeling should be done. Thus, it is important to go through a wide range of them.
- And for each model/algorith, we need to tune __hyperparameters__, i.e., parameters that are set manually.
- Cross-validate model, i.e., separate data into training/validation subsets, use the training subset to train a model and score the model with the validation subset. Then split the training/validation subset again but differently, train and score. Repeat the process $k$ times.|\n", 79 | "|6|Repeat 2-5|Work to iteratively improve the models.|\n", 80 | "|7|Evaluate the best performing model|- After a best performing model is identified based on the training data, the model is applied to the testing set.
- Various model performance metric can be used to access how good the model is.|\n", 81 | "|8|Interpret the model|- Dissect the model to better understand how it works.
- Identify most informative features for the best performing model.
- Assess the reasons behind false predictions to further improve model.|\n", 82 | "|9|Deploy model|Apply model to new data and make predictions.|" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## _What is interpretability_" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "## _Why interpret the model_" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "Blah blah" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "----\n", 111 | "\n", 112 | "# __Step 1. Define ML problem__\n", 113 | "\n", 114 | "___Problem statement___: \n", 115 | "- Following this published paper:\n", 116 | " - [Robust predictions of specialized metabolism genes through machine learning](https://www.pnas.org/doi/abs/10.1073/pnas.1817074116)\n", 117 | "- Given:\n", 118 | " - Known genes in general metabolism (GM) or specialized metabolism (SM)\n", 119 | " - Various features of genes\n", 120 | " - E.g., expression levels, functional category they belong to \n", 121 | "- How can we use them to:\n", 122 | " - Distinguish genes involved in GM from those involved in SM?\n" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "----\n", 130 | "\n", 131 | "## __Step 2-7. From exploratory data analysis to model evaluation__\n", 132 | "\n", 133 | "We will skip all these steps to cut to the chase. Here we will use a model that has already been build for model interpretation. If you'd like to learn more about how the model was built, see [our ML workshop](https://github.com/ShiuLab/ML_workshop) materials." 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "✅ **DO THIS:** Run the following cell to load the model." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "---\n", 155 | "\n", 156 | "## __Step 8. Interpret model__" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "### ___8.1 Global interpetation using SHAP___" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "✅ **DO THIS:** Run the code below to get [SHAP (SHapley Additive exPlanations)](https://shap.readthedocs.io/en/latest/index.html) values that use game theory to determine how important each feature is in contributing to a prediction. There are MANY, MANY things you can do with SHAP and we will first figure out which features are more important than the others." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "import shap\n", 180 | "shap.initjs()\n", 181 | "\n", 182 | "# Get a TreeExplainer object using the model we have created\n", 183 | "explainer = shap.TreeExplainer(rfc_loaded_top10)\n", 184 | "\n", 185 | "# Use the TreeExplainer to get SHAP values for the training data\n", 186 | "shap_values = explainer.shap_values(X_train_top10)\n", 187 | "\n", 188 | "# Generate a summary plot\n", 189 | "shap.summary_plot(shap_values, X_train_top10, sort=True)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "We can see that `Fam_size` is the most important. Because we have only two classes, if a feature is important for classifyting one class, it must also be important for the other class." 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### ___8.2 SHAP values of different instances___" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "✅ **DO THIS:** Another way to look at the SHAP values is by focusing on a particular class. In the example below, we focus on how different features contribute to the predictions of label=1 (SM):" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "shap_for_label1 = shap_values[1]\n", 220 | "\n", 221 | "# plot the shape value and color based on feature values\n", 222 | "shap.summary_plot(shap_for_label1, X_train_top10, sort=True)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": { 228 | "tags": [] 229 | }, 230 | "source": [ 231 | "For each feature $x$, two SHAP values are generated for each instance: one for $x$'s contribution to class $0$ and the other for its contribution to class $1$. In the plot above, we are only looking at the contribution to class $1$ and each dot is an instance.\n", 232 | "\n", 233 | "Look at Fam_size (family size), instances with higher features values (i.e., in larger families) also tend to have higher positive SHAP values (i.e., positive contribution to be in class 1).\n", 234 | "\n", 235 | "✅ **QUESTION:** In __the next 2 min__, discuss with your neighbor and interpret what the `Func_likelihood` feature's SHAP value distribution means where a higher feature values correlate with lower SHAP." 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "### ⛺ **PAUSE: after your discussion, please turn your attention to the instructor. **" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "-----\n", 250 | "### Congratulations, we're done!" 251 | ] 252 | } 253 | ], 254 | "metadata": { 255 | "kernelspec": { 256 | "display_name": "Python 3.10.3 ('sklearn': conda)", 257 | "language": "python", 258 | "name": "python3" 259 | }, 260 | "language_info": { 261 | "codemirror_mode": { 262 | "name": "ipython", 263 | "version": 3 264 | }, 265 | "file_extension": ".py", 266 | "mimetype": "text/x-python", 267 | "name": "python", 268 | "nbconvert_exporter": "python", 269 | "pygments_lexer": "ipython3", 270 | "version": "3.10.3" 271 | }, 272 | "vscode": { 273 | "interpreter": { 274 | "hash": "b31353e0b5417ab1909bc62ced18dcd0f4fa5d6ceab54c18c132f16ad2bde54b" 275 | } 276 | } 277 | }, 278 | "nbformat": 4, 279 | "nbformat_minor": 4 280 | } 281 | -------------------------------------------------------------------------------- /ML_workshop-xai.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/ML_workshop-xai.pptx -------------------------------------------------------------------------------- /feature_list.csv: -------------------------------------------------------------------------------- 1 | Feature name,Description 2 | Func_likelihood,The likelihood that the gene in question functional which is derived from a model distinguishing functional vs. pseudogenes 3 | Fam_size,The size of the gene family the gene in question belong to 4 | Max_id_paralog,"The maximum identity of the gene in question to its ""siblings"" (paralogs or duplicates)" 5 | WGD_alpha,If the gene in question still has a sibling (paralog) derived from the alpha whole-genome duplication event 6 | WGD_beta_gamma,If the gene in question still has a sibling (paralog) derived from the beta or gamma whole-genome duplication event 7 | Dup_recent,If the gene in question has a sibling that is closely related (synonymous substitution rate < 0.05) 8 | Dup_tandem,If the gene in question is tandemly duplicated 9 | Singleton,If the gene in question has no sibling 10 | Max_PCC_GM_abiotic,The maximum Pearson's correlation between the abiotic stress transcriptional responses of the gene in question and those of general metabolism genes 11 | Max_PCC_SM_abiotic,The maximum Pearson's correlation between the abiotic stress transcriptional responses of the gene in question and those of specialized metabolism genes 12 | Max_PCC_GM_biotic,The maximum Pearson's correlation between the biotic stress transcriptional responses of the gene in question and those of general metabolism genes 13 | Max_PCC_SM_biotic,The maximum Pearson's correlation between the biotic stress transcriptional responses of the gene in question and those of specialized metabolism genes 14 | Max_PCC_GM_hormone,The maximum Pearson's correlation between the hormone-treatment transcriptional responses of the gene in question and those of general metabolism genes 15 | Max_PCC_SM_hormone,The maximum Pearson's correlation between the hormone-treatment transcriptional responses of the gene in question and those of specialized metabolism genes 16 | Expr_med_dev,The median expression level of the gene in question among expression data of different tissue and/or developmental stages 17 | Expr_max_dev,The maximum expression level of the gene in question among expression data of different tissue and/or developmental stages 18 | Expr_breadth_dev,The breadth of expression level of the gene in question among expression data of different tissue and/or developmental stages 19 | -------------------------------------------------------------------------------- /feature_list.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/feature_list.xlsx -------------------------------------------------------------------------------- /img/img_clone_repository.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/img/img_clone_repository.png -------------------------------------------------------------------------------- /img/img_what_ml_is.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/img/img_what_ml_is.png -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Machine learning workshop 2 | 3 | ![alt text](./img/img_what_ml_is.png) 4 | 5 | ## 1. Workshop information 6 | 7 | ### What this is about 8 | 9 | More and more experimental data are available that have fueled ground breaking discoveries. Beyond the original intents of the experiments, these data can be used to discover even more. This is where machine learning (ML) comes in. We can use computers to learn from data and generate models that can predict a biological phenomenon of interest - e.g., will this gene be lethal when it is knocked-out, or which genetic variants can meaningfully predict a phenotype of interests. To learn more about ML, we created this workshop to provide an introduction on the following topics: 10 | 11 | - What is machine learning and why is it useful? 12 | - How does machine learning work? 13 | - What are some example machine learning applications in biology? 14 | - How can we feed data into machine learning tools to make discoveries? 15 | - What are the best practices when doing machine learning? 16 | - Where to go to learn more? 17 | 18 | The workshop will include presentations, discussions, and hands-on sections for those who can complete the pre-workshop components without issue. 19 | 20 | ### Who we are 21 | 22 | * Our lab [website](https://shiulab.github.io/) 23 | 24 | ### What kinds of materials we are sharing 25 | 26 | - Workshop jupyter notebooks 27 | - An example data to run an ML project based on [this paper](https://pubmed.ncbi.nlm.nih.gov/30674669/). 28 | 29 | ## 2. Instructions for runnning the notebook 30 | 31 | ### What's needed 32 | 33 | The workshop example is provided as a [Jupyter notebook](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). It is a document generated by the [Jupyter Lab or Jupyter Notebook applications](](https://jupyter.org/install.html)). A notebook can contain both computer codes in popular languages such as Python and R, and texts in the form of paragraph, equations, figures, links, etc. 34 | 35 | To follow what we have shown in the workshop, you need the following: 36 | * Git: for you to "clone" this repository to your computer to play with. 37 | * Jupyter Lab: the application to view, edit, and execute codes in the notebook. 38 | * Scikit-Learn and others: the software packages Jupyter Lab relies on to run the codes. 39 | 40 | ### Get some backgrounds 41 | 42 | Please make sure you: 43 | 1. Have a look at the [pre-workshop notebook through GitHub](https://github.com/ShiuLab/ML_workshop/blob/master/ML_workshop-part_a-preparation.ipynb) and take some notes on the questions asked. 44 | 2. Watch [this video](https://www.youtube.com/watch?v=cfj6yaYE86U), and [this video](https://www.youtube.com/watch?v=-_XYmr4vkwc) on getting Jupyter notebook to run. 45 | 46 | ### Get the notebook and data 47 | 48 | You can download the notebooks and data from this repository, preferably by setting the following up: 49 | - `git`, a version control software (i.e., a tool to keep track of updates to codes) widely used by folks writing software in any language. 50 | - [Github](https://github.com/) is a code hosting platform that uses `git` for version control and collaboration (i.e., many people can work on the same codes). 51 | 52 | If you don't have git and/or Github account, do the following: 53 | 1. Create a [GitHub Account](https://github.com/join) 54 | 2. Download and install [Github Desktop](https://desktop.github.com/) 55 | * Or you can use [Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) if you are familar with version control and command-line interface. Note that the following info is for using Github Desktop. 56 | 3. Clone the ML_workshop by following [this instruction](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop) and the following screenshot. 57 | * __Note:__ You can specify where the repository goes in your computer. We suggest leaving it as default and to remember where it is - we need it later. 58 | 59 | ![alt text](./img/img_clone_repository.png) 60 | 61 | 4. Navigate to the location where the cloned repository is and confirm that it is there. 62 | 63 | ### Install Anaconda 64 | 65 | [Anaconda](https://www.anaconda.com/) is a a free and open-source distribution of the programming languages Python and R and is a widely used platform for computational and data science applications. 66 | 67 | 1. Download the Python 3.X version of [Anaconda](https://www.anaconda.com/products/individual#Downloads). 68 | 2. Install Anaconda using the [instructions](https://docs.anaconda.com/anaconda/install/). 69 | 3. Open your terminal in [Mac](https://support.apple.com/guide/terminal/open-or-quit-terminal-apd5265185d-f365-44cb-8b09-71a064a42125/mac) or [PC](https://www.wikihow.com/Open-Terminal-in-Windows) 70 | * __Note__: For PC, you need to open the terminal by "Running as Administrator". If you are not familiar with this, see [this post](https://www.itechtics.com/run-programs-administrator/) for more info. 71 | 72 | 5. Issue the following command to make sure Anaconda installation is complete: 73 | ``` 74 | conda list 75 | ``` 76 | The above command allows you to see what software packages have been installed. 77 | 78 | ### Install software packages 79 | 80 | [Conda](https://docs.conda.io/en/latest/) is a package/environment management system. It deals with installing software packages in your computer. It also creates and manage virtual environments where each environment you have a specific set of software for a general category of tasks. 81 | 82 | 1. Create an `ml_workshop` environment and activate it: 83 | ``` 84 | conda create -n ml_workshop python 85 | ``` 86 | ``` 87 | conda activate ml_workshop 88 | ``` 89 | * __Note__: When prompted with `Proceed`, type `y`. 90 | 91 | 2. Install software packages and their dependencies: 92 | ``` 93 | conda install jupyterlab ipykernel ipywidgets matplotlib pandas scikit-learn seaborn shap tqdm 94 | ``` 95 | ``` 96 | pip install imbalanced-learn 97 | ``` 98 | 99 | ### Open the notebooks 100 | 101 | 1. Run Jupyter Lab 102 | 103 | * If you use Linux or Mac OS: 104 | 105 | ``` 106 | jupyter lab 107 | ``` 108 | * If you use PC, and your Github folder is in __C:/__ drive, then do: 109 | ``` 110 | jupyter lab --notebook-dir=C:/ 111 | ``` 112 | * If you use PC and your Github folder is in __D:/__ drive, do: 113 | ``` 114 | jupyter lab --notebook-dir=D:/ 115 | ``` 116 | 117 | 2. In the Jupyter lab window that opens, on the left panel, navigate to __ML_workshop__, the directory where the cloned Github repository is stored. 118 | 119 | 3. Open `ML_workshop-part_a-preparation.ipynb` 120 | 121 | 4. Run each code element by clicking ```SHIFT + ENTER```. 122 | 123 | -------------------------------------------------------------------------------- /workshop_outline.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ShiuLab/ML_workshop/fba900d44807f76d16d86c3c5b9a2172bbafa22b/workshop_outline.xlsx --------------------------------------------------------------------------------