├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── examples ├── .ipynb_checkpoints │ └── featurewiz_classification-checkpoint.ipynb ├── Best_Pipeline_Featurewiz.ipynb ├── FeatureWiz_Interaction_Target_Feature_Engineering_Example.ipynb ├── FeatureWiz_Test.ipynb ├── Featurewiz_Medium_Blogpost.ipynb ├── Featurewiz_on_2000_variables.ipynb ├── affairs_multiclass.csv ├── boston.csv ├── car_sales.csv ├── cross_validate.py ├── featurewiz_autoencoders_demo.ipynb ├── featurewiz_classification.ipynb ├── featurewiz_regression_multi_target.ipynb ├── heart.csv └── winequality.csv ├── featurewiz ├── __init__.py ├── __version__.py ├── auto_encoders.py ├── blagging.py ├── classify_method.py ├── databunch.py ├── encoders.py ├── featurewiz.py ├── ml_models.py ├── my_encoders.py ├── settings.py ├── stacking_models.py └── sulov_method.py ├── images ├── MRMR.png ├── SULOV.jpg ├── feather_example.jpg ├── feature_engg.png ├── feature_engg_old.jpg ├── featurewiz_background.jpg ├── featurewiz_logo.jpg ├── featurewiz_logos.png ├── featurewiz_logos_old.png ├── featurewiz_mrmr.png └── xgboost.jpg ├── old_README.md ├── requirements.txt ├── setup.py └── updates.md /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual 10 | identity and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility, apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the overall 26 | community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or advances of 31 | any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email address, 35 | without their explicit permission 36 | * Other conduct that could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement. 63 | All complaints will be reviewed and investigated promptly and fairly. 64 | 65 | All community leaders are obligated to respect the privacy and security of the 66 | reporter of any incident. 67 | 68 | ## Enforcement Guidelines 69 | 70 | Community leaders will follow these Community Impact Guidelines in determining 71 | the consequences for any action they deem in violation of this Code of Conduct: 72 | 73 | ### 1. Correction 74 | 75 | **Community Impact**: Use of inappropriate language or other behavior deemed 76 | unprofessional or unwelcome in the community. 77 | 78 | **Consequence**: A private, written warning from community leaders, providing 79 | clarity around the nature of the violation and an explanation of why the 80 | behavior was inappropriate. A public apology may be requested. 81 | 82 | ### 2. Warning 83 | 84 | **Community Impact**: A violation through a single incident or series of 85 | actions. 86 | 87 | **Consequence**: A warning with consequences for continued behavior. No 88 | interaction with the people involved, including unsolicited interaction with 89 | those enforcing the Code of Conduct, for a specified period of time. This 90 | includes avoiding interactions in community spaces as well as external channels 91 | like social media. Violating these terms may lead to a temporary or permanent 92 | ban. 93 | 94 | ### 3. Temporary Ban 95 | 96 | **Community Impact**: A serious violation of community standards, including 97 | sustained inappropriate behavior. 98 | 99 | **Consequence**: A temporary ban from any sort of interaction or public 100 | communication with the community for a specified period of time. No public or 101 | private interaction with the people involved, including unsolicited interaction 102 | with those enforcing the Code of Conduct, is allowed during this period. 103 | Violating these terms may lead to a permanent ban. 104 | 105 | ### 4. Permanent Ban 106 | 107 | **Community Impact**: Demonstrating a pattern of violation of community 108 | standards, including sustained inappropriate behavior, harassment of an 109 | individual, or aggression toward or disparagement of classes of individuals. 110 | 111 | **Consequence**: A permanent ban from any sort of public interaction within the 112 | community. 113 | 114 | ## Attribution 115 | 116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 117 | version 2.1, available at 118 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 119 | 120 | Community Impact Guidelines were inspired by 121 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 122 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | We welcome contributions from anyone beginner or advanced. Please before working on some features: 4 | 5 | * Search through the past issues, your concern may have been raised by others in the past. Check through the 6 | closed issues as well. 7 | * If there is no open issue for your feature request, please open one up to coordinate all collaborators. 8 | * Write your feature. 9 | * Submit a pull request on this repo with: 10 | * A brief description 11 | * **Detail of the expected change(s) in behavior** 12 | * How to test it (if it's not obvious) 13 | 14 | Ask someone to test it. 15 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # featurewiz 2 | 🔥 FeatureWiz, the ultimate feature selection library is powered by the renowned Minimum Redundancy Maximum Relevance (MRMR) algorithm. Learn more about it below. 3 | 4 | ![banner](images/featurewiz_logos.png) 5 | 6 | # Table of Contents 7 | 23 | 24 | ## Latest Update (Jan 2025) 25 |
    26 |
  1. featurewiz is now upgraded to version 0.6 27 | Anything above this version now runs on Python 3.12 or greater and also runs on pandas 2.0. 28 | - this is a huge upgrade to those working in Colab, Kaggle and other latest kernels. 29 | - Please make sure you check the `requirements.txt` file to know which versions are recommended.
  2. 30 |
31 | 32 | ## Citation 33 | If you use featurewiz in your research project or paper, please use the following format for citations:

34 | "Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"

35 | Current citations for featurewiz 36 | 37 | [Google Scholar citations for featurewiz](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=) 38 | 39 | ## Highlights 40 | `featurewiz` is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm. 41 | 42 | ### What Makes FeatureWiz Stand Out? 🔍 43 | ✔️ Automatically select the most relevant features without specifying a number 44 | 🚀 Provides the fastest and best implementation of the MRMR algorithm 45 | 🎯 Provides a built-in transformer (lazytransform library) that converts all features to numeric 46 | 📚 Includes deep learning models such as Variational Auto Encoders to capture complex interactions in your data 47 | 📝 Provides feature engineering in addition to feature selection - all with one single API call! 48 | 49 | ### Simple tips for success using featurewiz 💡 50 | 📈 First create additional features using the feature engg module 51 | 🌐 Compare featurewiz against other feature selection methods for best performance 52 | ⚖️ Avoid overfitting by cross-validating your results as shown here 53 | 🎯 Try adding auto-encoders for additional features that may help boost performance 54 | 55 | ### Feature Engineering 56 | Create new features effortlessly with a single line of code! featurewiz enables you to generate hundreds of interaction, group-by, target-encoded features and higher order features, eliminating the need for expert-level knowledge to create your own features. Now you can create even deep learning based features such as Variational Auto Encoders to capture complex interactions hidden among your features. See the latest page for more information on this amazing feature. 57 | 58 | ### What is MRMR? 59 | featurewiz provides one of the best automatic feature selection algorithms, MRMR, as described by wikipedia in this page as follows: "The MRMR feature selection algorithm has been found to be more powerful than other feature selection algorithms such as Boruta". 60 | 61 | In addition, other researchers have compared MRMR against multiple feature selection algorithms and found MRMR to be the best. 62 | 63 | ![feature_mrmr](images/featurewiz_mrmr.png) 64 | 65 | ### How does MRMR feature selection work?🔍 66 | After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or mutually-correlated? Will your model suffer from or benefit from adding all features? To answer these questions, featurewiz uses two crucial steps in MRMR: 67 | 68 | ⚙️ The SULOV Algorithm: SULOV means "Searching for Uncorrelated List of Variables". It is a fast algorithm that removes mutually correlated features so that you're left with only the most non-redundant (un-correlated) features. It uses the Mutual Information Score to accomplish this feat. 69 | 70 | ⚙️ Recursive XGBoost: Second, featurewiz uses XGBoost's feature importance scores by selecting smaller and smaller feature sets repeatedly to identify the most relevant features for your task among all the variables remaining after SULOV algorithm. 71 | 72 | ### Advanced Feature Engineering Options 73 | 74 | featurewiz extends traditional feature selection to the realm of deep learning using Auto Encoders, including Denoising Auto Encoders (DAEs), Variational Auto Encoders (VAEs), CNN's (Convolutional Neural Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets. Just set the 'feature_engg' flag to 'VAE_add' or 'DAE_add' to create these additional features. 75 | 76 | VAE-model-flowchart 77 | 78 | In addition, we include: 79 |
  • A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.
  • 80 |
  • The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding features.
  • 81 | 82 | ### Examples and Updates 83 | - featurewiz is well-documented, and it comes with a number of examples 84 | - featurewiz is actively maintained, and it is regularly updated with new features and bug fixes 85 | 86 | ## Workings 87 | `featurewiz` has two major modules to transform your Data Science workflow:

    88 | 1. Feature Engineering Module 89 | 90 | ![old_feature_engg](images/feature_engg_old.jpg) 91 | 92 |

  • Advanced Feature Creation: use Deep Learning based Auto Encoders and GAN's to extract features to add to your data. These powerful capabilities will help you in solving your toughest problems.
  • 93 |
  • Options for Enhancement: Use "interactions", "groupby", or "target" flags to enable advanced feature engineering techniques.
  • 94 |
  • Kaggle-Ready: Designed to meet the high standards of feature engineering required in competitive data science, like Kaggle.
  • 95 |
  • Efficient and User-Friendly: Generate and sift through thousands of features, selecting only the most impactful ones for your model.

  • 96 | 97 | ![feature_engg](images/feature_engg.png) 98 | 99 | 2. Feature Selection Module 100 |
  • MRMR Algorithm: Employs Minimum Redundancy Maximum Relevance (MRMR) for effective feature selection.
  • 101 |
  • SULOV Method: Stands for 'Searching for Uncorrelated List of Variables', ensuring low redundancy and high relevance in feature selection.
  • 102 |
  • Addressing Key Questions: Helps interpret new features, assess their importance, and evaluate the model's performance with these features.
  • 103 |
  • Optimal Feature Subset: Uses Recursive XGBoost in combination with SULOV to identify the most critical features, reducing overfitting and improving model interpretability.
  • 104 | 105 | #### Comparing featurewiz to Boruta: 106 | Featurewiz uses what is known as a `Minimal Optimal` algorithm such as MRMR while Boruta uses an `All-Relevant` approach. To understand how featurewiz's MRMR approach differs Boruta's 'All-Relevant' approach for best feature selection you need to study the chart below. It shows how the SULOV algorithm performs MRMR feature selection which provides a smaller feature set compared to Boruta which uses a bigger feature set. 107 | 108 | One of the weaknesses of Boruta is that it contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't. 109 | 110 | ![Learn More About MRMR](images/MRMR.png) 111 | 112 | Transform your feature engineering and selection process with featurewiz - the tool that brings expert-level capabilities to your fingertips! 113 | 114 | ## Working 115 | `featurewiz` performs feature selection in 2 steps. Each step is explained below. 116 | The working of the `SULOV` algorithm is as follows: 117 |
      118 |
    1. Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
    2. 119 |
    3. Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
    4. 120 |
    5. Now take each pair of correlated variables (using Pearson coefficient higher than the threshold above), and then eliminate the feature with the lower MIS score from the pair. Do this repeatedly with each pair until no feature pair is left to analyze.
    6. 121 |
    7. What’s left after this step are the features with the highest Information score and the least Pearson correlation with each other.
    8. 122 |
    123 | 124 | ![sulov](images/SULOV.jpg) 125 | 126 | The working of the Recursive XGBoost is as follows: 127 | Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV. 128 |
      129 |
    1. Select all variables in the data set and the full data split into train and valid sets.
    2. 130 |
    3. Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
    4. 131 |
    5. Then take the next set of vars and find top X
    6. 132 |
    7. Do this 5 times. Combine all selected features and de-duplicate them.
    8. 133 |
    134 | 135 | ![xgboost](images/xgboost.jpg) 136 | 137 | ## Tips 138 | Here are some additional tips for ML engineers and data scientists when using featurewiz: 139 |
      140 |
    1. How to cross-validate your results: When you use featurewiz, we automatically perform multiple rounds of feature selection using permutations on the number of columns. However, you can perform feature selection using permutations of rows as follows in cross_validate using featurewiz. 141 |
    2. Use multiple feature selection tools: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.
    3. 142 |
    4. Don't forget to use Auto Encoders!: Autoencoders are like skilled artists who can draw a quick sketch of a complex picture. They learn to capture the essence of the data and then recreate it with as few strokes as possible. This process helps in understanding and compressing data efficiently.
    5. 143 |
    6. Don't overfit your model: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.
    7. 144 |
    8. Start with a small number of features: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.
    9. 145 |
    146 | 147 | ## Install 148 | 149 | **Prerequisites:** 150 |
      151 |
    1. featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
    2. 152 |
    3. We use "networkx" library for charts and interpretability.
      But if you don't have these libraries, featurewiz will install those for you automatically.
    4. 153 |
    154 | 155 | In Kaggle notebooks, you need to install featurewiz like this (otherwise there will be errors): 156 | ``` 157 | !pip install featurewiz 158 | !pip install Pillow==9.0.0 159 | !pip install xlrd — ignore-installed — no-deps 160 | !pip install executing>0.10.0 161 | ``` 162 | 163 | To install from source: 164 | 165 | ``` 166 | cd 167 | git clone git@github.com:AutoViML/featurewiz.git 168 | # or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip 169 | conda create -n python=3.7 anaconda 170 | conda activate # ON WINDOWS: `source activate ` 171 | cd featurewiz 172 | pip install -r requirements.txt 173 | ``` 174 | 175 | ## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps! 176 | As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:
    177 | 178 | ``` 179 | conda install -c conda-forge featurewiz 180 | ``` 181 | 182 | ### If the above conda install fails, you can try installing featurewiz this way: 183 | #### Install featurewiz using git+
    184 | 185 | ``` 186 | !pip install git+https://github.com/AutoViML/featurewiz.git 187 | ``` 188 | 189 | ## Usage 190 | 191 | There are two ways to use featurewiz. 192 |
      193 |
    1. The first way is the new way where you use scikit-learn's `fit and predict` syntax. It also includes the `lazytransformer` library that I created to transform datetime, NLP and categorical variables into numeric variables automatically. We recommend that you use it as the main syntax for all your future needs.
    2. 194 | 195 | ``` 196 | from featurewiz import FeatureWiz 197 | fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std", 198 | category_encoders="auto", add_missing=False, verbose=0, imbalanced=False, 199 | ae_options={}) 200 | X_train_selected, y_train = fwiz.fit_transform(X_train, y_train) 201 | X_test_selected = fwiz.transform(X_test) 202 | ### get list of selected features ### 203 | fwiz.features 204 | ``` 205 | 206 |
    3. The second way is the old way and this was the original syntax of featurewiz. It is still being used by thousands of researchers in the field. Hence it will continue to be maintained. However, it can be discontinued any time without notice. You can use it if you like it.
    4. 207 | 208 | ``` 209 | import featurewiz as fwiz 210 | outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', 211 | header=0, test_data='',feature_engg='', category_encoders='', 212 | dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False) 213 | ``` 214 | 215 | `outputs` is a tuple: There will always be two objects in output. It can vary: 216 | - In the first case, it can be `features` and `trainm`: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only) 217 | - In the second case, it can be `trainm` and `testm`: It can be two transformed dataframes when you send in both test and train but with selected features. 218 | 219 | In both cases, the features and dataframes are ready for you to do further modeling. 220 | 221 | Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want. 222 | You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically. 223 | 224 | ## API 225 | 226 | **Input Arguments for NEW syntax** 227 | 228 | Parameters 229 | ---------- 230 | corr_limit : float, default=0.90 231 | The correlation limit to consider for feature selection. Features with correlations 232 | above this limit may be excluded. 233 | 234 | verbose : int, default=0 235 | Level of verbosity in output messages. 236 | 237 | feature_engg : str or list, default='' 238 | Specifies the feature engineering methods to apply, such as 'interactions', 'groupby', 239 | and 'target'. 240 | 241 | auto_encoders : str or list, default='' 242 | Five new options have been added recently to `auto_encoders` (starting in version 0.5.0): `DAE`, `VAE`, `DAE_ADD`, `VAE_ADD`, `CNN`, `CNN_ADD` and `GAN`. These are deep learning auto encoders (using tensorflow and keras) that can extract the most important patterns in your data and either replace your features or add them as extra features to your data. Try them for your toughest ML problems! See the notebooks folder for examples. 243 | 244 | ae_options : dict, default={} 245 | You can provide a dictionary for tuning auto encoders above. Supported auto encoders include 'dae', 246 | 'vae', and 'gan'. You must use the `help` function to see how to send a dict to each auto encoder. You can also check out this Auto Encoder demo notebook 247 | 248 | category_encoders : str or list, default='' 249 | Encoders for handling categorical variables. Supported encoders include 'onehot', 250 | 'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc', 251 | 'loo', 'base', 'james', 'helmert', 'label', 'auto', etc. 252 | 253 | add_missing : bool, default=False 254 | If True, adds indicators for missing values in the dataset. 255 | 256 | dask_xgboost_flag : bool, default=False 257 | If set to True, enables the use of Dask for parallel computing with XGBoost. 258 | 259 | nrows : int or None, default=None 260 | Limits the number of rows to process. 261 | 262 | skip_sulov : bool, default=False 263 | If True, skips the application of the Super Learning Optimized (SULO) method in 264 | feature selection. 265 | 266 | skip_xgboost : bool, default=False 267 | If True, bypasses the recursive XGBoost feature selection. 268 | 269 | transform_target : bool, default=False 270 | When True, transforms the target variable(s) into numeric format if they are not 271 | already. 272 | 273 | scalers : str or None, default=None 274 | Specifies the scaler to use for feature scaling. Available options include 275 | 'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'. 276 | 277 | imbalanced : True or False, default=False 278 | Specifies whether to use SMOTE technique for imbalanced datasets. 279 | 280 | **Input Arguments for old syntax** 281 | 282 | - `dataname`: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically. 283 | - `target`: name of the target variable in the data set. 284 | - `corr_limit`: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal. 285 | - `verbose`: This has 3 possible states: 286 | - `0` - limited output. Great for running this silently and getting fast results. 287 | - `1` - verbose. Great for knowing how results were and making changes to flags in input. 288 | - `2` - more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method. 289 | - `test_data`: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. `test_data` could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string. 290 | - `dask_xgboost_flag`: default False. If you want to use dask with your data, then set this to True. 291 | - `feature_engg`: You can let featurewiz select its best encoders for your data set by setting this flag 292 | for adding feature engineering. There are three choices. You can choose one, two, or all three. 293 | - `interactions`: This will add interaction features to your data such as x1*x2, x2*x3, x1**2, x2**2, etc. 294 | - `groupby`: This will generate Group By features to your numeric vars by grouping all categorical vars. 295 | - `target`: This will encode and transform all your categorical features using certain target encoders.
      296 | Default is empty string (which means no additional features) 297 | - `add_missing`: default is False. This is a new flag: the `add_missing` flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal. 298 | - `category_encoders`: default is "auto". Instead, you can choose your own category encoders from the list below. 299 | We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)
      These descriptions are derived from the excellent category_encoders python library. Please check it out! 300 | - `HashingEncoder`: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design. 301 | - `SumEncoder`: SumEncoder is a Sum contrast coding for the encoding of categorical features. 302 | - `PolynomialEncoder`: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features. 303 | - `BackwardDifferenceEncoder`: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables. 304 | - `OneHotEncoder`: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary. 305 | - `HelmertEncoder`: HelmertEncoder uses the Helmert contrast coding for encoding categorical features. 306 | - `OrdinalEncoder`: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding. 307 | - `FrequencyEncoder`: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category. 308 | - `BaseNEncoder`: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding. 309 | - `TargetEncoder`: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper. 310 | - `CatBoostEncoder`: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise. 311 | - `WOEEncoder`: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression. 312 | - `JamesSteinEncoder`: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper. 313 | For feature value i, James-Stein estimator returns a weighted average of: 314 | The mean target value for the observed feature value i. 315 | The mean target value (regardless of the feature value). 316 | - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. 317 | - `skip_sulov`: default `False`. You can set the flag to skip the SULOV method if you want. 318 | - `skip_xgboost`: default `False`. You can set the flag to skip the Recursive XGBoost method if you want. 319 | 320 | **Output values for old syntax** This applies only to the old syntax. 321 | - `outputs`: Output is always a tuple. We can call our outputs in that tuple as `out1` and `out2` below. 322 | - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get: 323 | - 1. `features`: It will be a list (of selected features) and 324 | - 2. `trainm`: It will be a dataframe (if you sent in a file or dataname as input) 325 | - `out1` and `out2`: If you sent in two files or dataframes (train and test), you will get: 326 | - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and 327 | - 2. `testm`: a modified test dataframe with engineered and selected features from test_data. 328 | 329 | ## Additional 330 | To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0) 331 | 332 | ![background](images/featurewiz_background.jpg) 333 | 334 | featurewiz was designed for selecting High Performance variables with the fewest steps. 335 | In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).
      336 | 337 | featurewiz is every Data Scientist's feature wizard that will:
        338 |
      1. Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
        339 |
      2. Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can use deep learning to extract features with the click of a mouse. This is very helpful when you have imbalanced classes or 1000's of features to deal with. However, be careful with this option. You can very easily spend a lot of time tuning these neural networks. 340 |
      3. Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
        341 |
      4. Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
        342 |
      5. Build a fast XGBoost or LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.
        343 |
      344 | 345 | *** Special thanks to fellow open source Contributors ***:
      346 |
        347 |
      1. Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
      2. 348 |
      3. Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html
      4. 349 |
      350 | 351 | ## Maintainers 352 | 353 | * [@AutoViML](https://github.com/AutoViML) 354 | 355 | ## Contributing 356 | 357 | See [the contributing file](CONTRIBUTING.md)! 358 | 359 | PRs accepted. 360 | 361 | ## License 362 | 363 | Apache License 2.0 © 2020 Ram Seshadri 364 | 365 | ## DISCLAIMER 366 | This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose. 367 | 368 | 369 | [page]: examples/cross_validate.py 370 | -------------------------------------------------------------------------------- /examples/cross_validate.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.model_selection import train_test_split 3 | from sklearn.linear_model import LogisticRegression 4 | from featurewiz import FeatureWiz 5 | 6 | # Load the dataset into a pandas dataframe 7 | df = pd.read_csv(trainfile, sep=sep) 8 | 9 | # Define your target variable 10 | target = target 11 | 12 | # Split the data into training and testing sets 13 | X_train, X_test, y_train, y_test = train_test_split(df.drop(target, axis=1), df[target], test_size=0.2, random_state=42) 14 | 15 | # Define the number of rounds 16 | num_rounds = 3 17 | 18 | # Perform multiple rounds of feature selection using rows 19 | selected_features = [] 20 | for i in range(num_rounds): 21 | # Split the training set into a new training set and a validation set 22 | X_new_train, X_val, y_new_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=i) 23 | 24 | # Use Featurewiz to select the best features on the new training set 25 | fwiz = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', 26 | dask_xgboost_flag=False, nrows=None, verbose=0) 27 | X_new_train_selected = fwiz.fit_transform(X_new_train, y_new_train) 28 | X_new_val_selected = fwiz.transform(X_val) 29 | 30 | # Evaluate the performance of the model on the validation set with the selected features 31 | model = LogisticRegression() 32 | model.fit(X_new_train_selected, y_new_train) 33 | accuracy = model.score(X_new_val_selected, y_val) 34 | 35 | # Print the accuracy of the model on the validation set 36 | print(f'Round {i+1}: Validation accuracy is {accuracy:.2f}.') 37 | 38 | # Get the selected features from Featurewiz and add them to a list 39 | selected_features.append(fwiz.features) 40 | fwiz_all = fwiz.lazy 41 | ### this saves the lazy transformer from featurewiz for next round ### 42 | 43 | # Find the most common set of features (most stable) and use them to train a logistic regression model 44 | common_features = list(set(selected_features[0]).intersection(*selected_features)) 45 | print('Common most stable features:', len(common_features), 'features are:\n', common_features) 46 | #### Now transform your features to all-numeric using featurewiz' lazy transformer ### 47 | X_train_selected_all = fwiz_all.transform(X_train) 48 | X_test_selected_all = fwiz_all.transform(X_test) 49 | 50 | # Evaluate the performance of the model on each round and compare it to the final accuracy with common features 51 | accuracies = [] 52 | for i in range(num_rounds): 53 | model_round = LogisticRegression() 54 | model_round.fit(X_train_selected_all[selected_features[i]], y_train) 55 | accuracy_round = model_round.score(X_test_selected_all[selected_features[i]], y_test) 56 | accuracies.append(accuracy_round) 57 | 58 | model_final = LogisticRegression() 59 | model_final.fit(X_train_selected_all[common_features], y_train) 60 | accuracy_final = model_final.score(X_test_selected_all[common_features], y_test) 61 | print('Individual accuracy from',len(accuracies),'rounds is:',accuracies) 62 | print('Average accuracy from 3 rounds = ', np.mean(accuracies), '\nvs. final accuracy with common features: ',accuracy_final) -------------------------------------------------------------------------------- /examples/featurewiz_autoencoders_demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "567c5ccb", 7 | "metadata": { 8 | "scrolled": true 9 | }, 10 | "outputs": [], 11 | "source": [ 12 | "import numpy as np\n", 13 | "from sklearn.preprocessing import MinMaxScaler\n", 14 | "import pandas as pd" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 2, 20 | "id": "a0a0470e", 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "name": "stdout", 25 | "output_type": "stream", 26 | "text": [ 27 | "Imported lazytransform v1.15. \n", 28 | "\n", 29 | "Imported featurewiz 0.5.6. Use the following syntax:\n", 30 | " >>> wiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True,\n", 31 | " \t\tcategory_encoders=\"auto\", auto_encoders='VAE', ae_options={},\n", 32 | " \t\tadd_missing=False, imbalanced=False, verbose=0)\n", 33 | " >>> X_train_selected, y_train = wiz.fit_transform(X_train, y_train)\n", 34 | " >>> X_test_selected = wiz.transform(X_test)\n", 35 | " >>> selected_features = wiz.features\n", 36 | " \n" 37 | ] 38 | } 39 | ], 40 | "source": [ 41 | "from featurewiz import FeatureWiz\n", 42 | "from featurewiz import print_regression_metrics, print_classification_metrics" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "id": "21a19841", 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "(6497, 13)\n" 56 | ] 57 | }, 58 | { 59 | "data": { 60 | "text/html": [ 61 | "
      \n", 62 | "\n", 75 | "\n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | "
      fixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholqualityred_wine
      07.40.700.001.90.07611.034.00.99783.510.569.451
      17.80.880.002.60.09825.067.00.99683.200.689.851
      27.80.760.042.30.09215.054.00.99703.260.659.851
      311.20.280.561.90.07517.060.00.99803.160.589.861
      47.40.700.001.90.07611.034.00.99783.510.569.451
      \n", 177 | "
      " 178 | ], 179 | "text/plain": [ 180 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n", 181 | "0 7.4 0.70 0.00 1.9 0.076 \n", 182 | "1 7.8 0.88 0.00 2.6 0.098 \n", 183 | "2 7.8 0.76 0.04 2.3 0.092 \n", 184 | "3 11.2 0.28 0.56 1.9 0.075 \n", 185 | "4 7.4 0.70 0.00 1.9 0.076 \n", 186 | "\n", 187 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n", 188 | "0 11.0 34.0 0.9978 3.51 0.56 \n", 189 | "1 25.0 67.0 0.9968 3.20 0.68 \n", 190 | "2 15.0 54.0 0.9970 3.26 0.65 \n", 191 | "3 17.0 60.0 0.9980 3.16 0.58 \n", 192 | "4 11.0 34.0 0.9978 3.51 0.56 \n", 193 | "\n", 194 | " alcohol quality red_wine \n", 195 | "0 9.4 5 1 \n", 196 | "1 9.8 5 1 \n", 197 | "2 9.8 5 1 \n", 198 | "3 9.8 6 1 \n", 199 | "4 9.4 5 1 " 200 | ] 201 | }, 202 | "execution_count": 3, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "trainfile = 'c:/users/ram/documents/ram/data_sets/kaggle/diabetes.csv'\n", 209 | "datapath = '../Ram/Data_Sets/'\n", 210 | "filename = 'winequality.csv'\n", 211 | "#filename = 'affairs.csv'\n", 212 | "trainfile = datapath+filename\n", 213 | "sep = ','\n", 214 | "dft = pd.read_csv(trainfile,sep=sep)\n", 215 | "#dft.drop(['affairs','affair'],axis=1, inplace=True)\n", 216 | "print(dft.shape)\n", 217 | "dft.head()" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 4, 223 | "id": "603bf23c", 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "data": { 228 | "text/plain": [ 229 | "7" 230 | ] 231 | }, 232 | "execution_count": 4, 233 | "metadata": {}, 234 | "output_type": "execute_result" 235 | } 236 | ], 237 | "source": [ 238 | "target = 'quality'\n", 239 | "#target = 'affair_multiclass'\n", 240 | "modeltype = 'Multi_Classification'\n", 241 | "preds = [x for x in list(dft) if x not in target]\n", 242 | "dft[target].nunique()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "raw", 247 | "id": "1ba875d4", 248 | "metadata": {}, 249 | "source": [ 250 | "from sklearn.datasets import make_classification, make_regression\n", 251 | "from sklearn.model_selection import train_test_split\n", 252 | "from sklearn.metrics import accuracy_score, mean_squared_error\n", 253 | "if modeltype == 'Regression':\n", 254 | " X, y = make_regression(n_samples=10000, noise=1000, n_features=8, random_state=0)\n", 255 | "else:\n", 256 | " X, y = make_classification(n_samples=10000, n_classes=5, n_features=8, n_informative=4, random_state=0)\n", 257 | "# split dataset into train and test sets\n", 258 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=99)" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 5, 264 | "id": "1494931d", 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "name": "stdout", 269 | "output_type": "stream", 270 | "text": [ 271 | "(5197, 12) (1300, 12)\n" 272 | ] 273 | } 274 | ], 275 | "source": [ 276 | "from sklearn.model_selection import train_test_split\n", 277 | "from featurewiz import FE_kmeans_resampler\n", 278 | "if modeltype == 'Regression':\n", 279 | " X_train, X_test, y_train, y_test = train_test_split(dft[preds], dft[target], test_size=0.20, random_state=1,)\n", 280 | " X_train_over, y_train_over = FE_kmeans_resampler(X_train, y_train, target, smote='',verbose=0)\n", 281 | " print(X_train_over.shape, X_test.shape)\n", 282 | " #train, test = pd.concat([X_train_over, pd.Series(y_train_over,name=target)], axis=1), pd.concat([X_test, y_test], axis=1)\n", 283 | " train, test = train_test_split(dft, test_size=0.20, random_state=42)\n", 284 | "else:\n", 285 | " X_train, X_test, y_train, y_test = train_test_split(dft[preds], dft[target], test_size=0.20, \n", 286 | " stratify=dft[target],\n", 287 | " random_state=42)\n", 288 | " train, test = train_test_split(dft, test_size=0.20, random_state=42,\n", 289 | " stratify=dft[target]\n", 290 | " )\n", 291 | "print(X_train.shape, X_test.shape)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 6, 297 | "id": "6f387d3e", 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "featurewiz is given 0.9 as correlation limit...\n", 305 | " Skipping feature engineering since no feature_engg input...\n", 306 | " final list of category encoders given: ['onehot', 'label']\n", 307 | " You need to pip install tensorflow>= 2.5 in order to use this Autoencoder option.\n", 308 | "Since Auto Encoders are selected for feature extraction,\n", 309 | " Recursive XGBoost is also skipped...\n", 310 | "CNNAutoEncoder()\n", 311 | " AE dictionary given: dict_items([])\n", 312 | " final list of scalers given: [minmax]\n" 313 | ] 314 | } 315 | ], 316 | "source": [ 317 | "scaler = FeatureWiz(feature_engg = '', nrows=None, transform_target=True,\n", 318 | " \t\tcategory_encoders=\"auto\", auto_encoders='CNN_ADD', ae_options={},\n", 319 | " \t\tadd_missing=False, imbalanced=False, verbose=0)" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 7, 325 | "id": "a8d5fedd", 326 | "metadata": {}, 327 | "outputs": [ 328 | { 329 | "name": "stdout", 330 | "output_type": "stream", 331 | "text": [ 332 | "Loaded input data. Shape = (5197, 12)\n", 333 | "#### Starting featurewiz transform for train data ####\n", 334 | " Single_Label Multi_Classification problem \n", 335 | "Shape of dataset: (5197, 12). Now we classify variables into different types...\n", 336 | "Time taken to define data pipeline = 1 second(s)\n", 337 | "No model input given...\n", 338 | "Lazy Transformer Pipeline created...\n", 339 | " transformed target from object type to numeric\n", 340 | " Time taken to fit dataset = 1 second(s)\n", 341 | " Time taken to transform dataset = 1 second(s)\n", 342 | " Shape of transformed dataset: (5197, 12)\n", 343 | " No hyperparam selection since GAN or CNN is selected for auto_encoders...\n", 344 | "Fitting and transforming CNNAutoEncoder for dataset...\n", 345 | "Epoch 1/100\n", 346 | "130/130 [==============================] - 2s 7ms/step - loss: 0.0133 - val_loss: 0.0038 - lr: 0.0010\n", 347 | "Epoch 2/100\n", 348 | "130/130 [==============================] - 1s 5ms/step - loss: 0.0029 - val_loss: 0.0022 - lr: 0.0010\n", 349 | "Epoch 3/100\n", 350 | "130/130 [==============================] - 1s 5ms/step - loss: 0.0019 - val_loss: 0.0015 - lr: 0.0010\n", 351 | "Epoch 4/100\n", 352 | "130/130 [==============================] - 1s 6ms/step - loss: 0.0012 - val_loss: 0.0011 - lr: 0.0010\n", 353 | "Epoch 5/100\n", 354 | "130/130 [==============================] - 1s 6ms/step - loss: 9.0416e-04 - val_loss: 8.1538e-04 - lr: 0.0010\n", 355 | "Epoch 6/100\n", 356 | "130/130 [==============================] - 1s 6ms/step - loss: 6.8004e-04 - val_loss: 5.5630e-04 - lr: 0.0010\n", 357 | "Epoch 7/100\n", 358 | "130/130 [==============================] - 1s 6ms/step - loss: 4.7319e-04 - val_loss: 4.5762e-04 - lr: 0.0010\n", 359 | "Epoch 8/100\n", 360 | "130/130 [==============================] - 1s 5ms/step - loss: 4.0537e-04 - val_loss: 4.0449e-04 - lr: 0.0010\n", 361 | "Epoch 9/100\n", 362 | "130/130 [==============================] - 1s 5ms/step - loss: 3.7104e-04 - val_loss: 3.6171e-04 - lr: 0.0010\n", 363 | "Epoch 10/100\n", 364 | "130/130 [==============================] - 1s 5ms/step - loss: 3.4615e-04 - val_loss: 3.7683e-04 - lr: 0.0010\n", 365 | "Epoch 11/100\n", 366 | "130/130 [==============================] - 1s 5ms/step - loss: 3.2180e-04 - val_loss: 3.3386e-04 - lr: 0.0010\n", 367 | "Epoch 12/100\n", 368 | "130/130 [==============================] - 1s 6ms/step - loss: 3.1295e-04 - val_loss: 3.1308e-04 - lr: 0.0010\n", 369 | "Epoch 13/100\n", 370 | "130/130 [==============================] - 1s 6ms/step - loss: 2.9671e-04 - val_loss: 3.3688e-04 - lr: 0.0010\n", 371 | "Epoch 14/100\n", 372 | "130/130 [==============================] - 1s 6ms/step - loss: 2.4843e-04 - val_loss: 2.6806e-04 - lr: 5.0000e-04\n", 373 | "Epoch 15/100\n", 374 | "130/130 [==============================] - 1s 5ms/step - loss: 2.4102e-04 - val_loss: 2.5663e-04 - lr: 5.0000e-04\n", 375 | "Epoch 16/100\n", 376 | "130/130 [==============================] - 1s 5ms/step - loss: 2.3189e-04 - val_loss: 2.6313e-04 - lr: 5.0000e-04\n", 377 | "Epoch 17/100\n", 378 | "130/130 [==============================] - 1s 5ms/step - loss: 2.3202e-04 - val_loss: 2.5388e-04 - lr: 5.0000e-04\n", 379 | "Epoch 18/100\n", 380 | "130/130 [==============================] - 1s 5ms/step - loss: 2.2195e-04 - val_loss: 2.3446e-04 - lr: 5.0000e-04\n", 381 | "Epoch 19/100\n", 382 | "130/130 [==============================] - 1s 5ms/step - loss: 2.1615e-04 - val_loss: 2.3867e-04 - lr: 5.0000e-04\n", 383 | "Epoch 20/100\n", 384 | "130/130 [==============================] - 1s 5ms/step - loss: 1.9608e-04 - val_loss: 2.2372e-04 - lr: 2.5000e-04\n", 385 | "Epoch 21/100\n", 386 | "130/130 [==============================] - 1s 6ms/step - loss: 1.9390e-04 - val_loss: 2.1614e-04 - lr: 2.5000e-04\n", 387 | "Epoch 22/100\n", 388 | "130/130 [==============================] - 1s 6ms/step - loss: 1.8848e-04 - val_loss: 2.0742e-04 - lr: 2.5000e-04\n", 389 | "Epoch 23/100\n", 390 | "130/130 [==============================] - 1s 5ms/step - loss: 1.8496e-04 - val_loss: 2.1073e-04 - lr: 2.5000e-04\n", 391 | "Epoch 24/100\n", 392 | "130/130 [==============================] - 1s 5ms/step - loss: 1.8005e-04 - val_loss: 2.0087e-04 - lr: 2.5000e-04\n", 393 | "Epoch 25/100\n", 394 | "130/130 [==============================] - 1s 5ms/step - loss: 1.7124e-04 - val_loss: 1.9250e-04 - lr: 1.2500e-04\n", 395 | "Epoch 26/100\n", 396 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6716e-04 - val_loss: 1.8936e-04 - lr: 1.2500e-04\n", 397 | "Epoch 27/100\n", 398 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6401e-04 - val_loss: 1.8330e-04 - lr: 1.2500e-04\n", 399 | "Epoch 28/100\n", 400 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6035e-04 - val_loss: 1.8186e-04 - lr: 1.2500e-04\n", 401 | "Epoch 29/100\n", 402 | "130/130 [==============================] - 1s 5ms/step - loss: 1.5901e-04 - val_loss: 1.8252e-04 - lr: 1.2500e-04\n", 403 | "Epoch 30/100\n", 404 | "130/130 [==============================] - 1s 6ms/step - loss: 1.5409e-04 - val_loss: 1.7704e-04 - lr: 1.0000e-04\n", 405 | "Epoch 31/100\n", 406 | "130/130 [==============================] - 1s 5ms/step - loss: 1.5029e-04 - val_loss: 1.6895e-04 - lr: 1.0000e-04\n", 407 | "Epoch 32/100\n", 408 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4684e-04 - val_loss: 1.6386e-04 - lr: 1.0000e-04\n", 409 | "Epoch 33/100\n", 410 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4408e-04 - val_loss: 1.6665e-04 - lr: 1.0000e-04\n", 411 | "Epoch 34/100\n", 412 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4066e-04 - val_loss: 1.5804e-04 - lr: 1.0000e-04\n", 413 | "Epoch 35/100\n", 414 | "130/130 [==============================] - 1s 5ms/step - loss: 1.3626e-04 - val_loss: 1.5064e-04 - lr: 1.0000e-04\n", 415 | "Epoch 36/100\n", 416 | "130/130 [==============================] - 1s 5ms/step - loss: 1.3172e-04 - val_loss: 1.4678e-04 - lr: 1.0000e-04\n", 417 | "Epoch 37/100\n", 418 | "130/130 [==============================] - 1s 5ms/step - loss: 1.2683e-04 - val_loss: 1.4821e-04 - lr: 1.0000e-04\n", 419 | "Epoch 38/100\n", 420 | "130/130 [==============================] - 1s 5ms/step - loss: 1.2296e-04 - val_loss: 1.3887e-04 - lr: 1.0000e-04\n", 421 | "Epoch 39/100\n", 422 | "130/130 [==============================] - 1s 5ms/step - loss: 1.1663e-04 - val_loss: 1.2789e-04 - lr: 1.0000e-04\n", 423 | "Epoch 40/100\n", 424 | "130/130 [==============================] - 1s 5ms/step - loss: 1.1185e-04 - val_loss: 1.2739e-04 - lr: 1.0000e-04\n", 425 | "Epoch 41/100\n", 426 | "130/130 [==============================] - 1s 5ms/step - loss: 1.0750e-04 - val_loss: 1.1749e-04 - lr: 1.0000e-04\n", 427 | "Epoch 42/100\n", 428 | "130/130 [==============================] - 1s 5ms/step - loss: 1.0330e-04 - val_loss: 1.1409e-04 - lr: 1.0000e-04\n", 429 | "Epoch 43/100\n", 430 | "130/130 [==============================] - 1s 7ms/step - loss: 9.8775e-05 - val_loss: 1.0765e-04 - lr: 1.0000e-04\n", 431 | "Epoch 44/100\n", 432 | "130/130 [==============================] - 1s 6ms/step - loss: 9.4918e-05 - val_loss: 1.0886e-04 - lr: 1.0000e-04\n", 433 | "Epoch 45/100\n", 434 | "130/130 [==============================] - 1s 6ms/step - loss: 9.0940e-05 - val_loss: 1.0206e-04 - lr: 1.0000e-04\n", 435 | "Epoch 46/100\n", 436 | "130/130 [==============================] - 1s 6ms/step - loss: 8.8166e-05 - val_loss: 9.9367e-05 - lr: 1.0000e-04\n", 437 | "Epoch 47/100\n", 438 | "130/130 [==============================] - 1s 6ms/step - loss: 8.5611e-05 - val_loss: 9.5914e-05 - lr: 1.0000e-04\n", 439 | "Epoch 48/100\n", 440 | "130/130 [==============================] - 1s 6ms/step - loss: 8.2108e-05 - val_loss: 9.6259e-05 - lr: 1.0000e-04\n", 441 | "Epoch 49/100\n", 442 | "130/130 [==============================] - 1s 6ms/step - loss: 8.0729e-05 - val_loss: 9.1706e-05 - lr: 1.0000e-04\n", 443 | "Epoch 50/100\n", 444 | "130/130 [==============================] - 1s 6ms/step - loss: 7.8945e-05 - val_loss: 8.6837e-05 - lr: 1.0000e-04\n", 445 | "Epoch 51/100\n", 446 | "130/130 [==============================] - 1s 6ms/step - loss: 7.7156e-05 - val_loss: 8.8305e-05 - lr: 1.0000e-04\n", 447 | "Epoch 52/100\n", 448 | "130/130 [==============================] - 1s 6ms/step - loss: 7.5093e-05 - val_loss: 8.8565e-05 - lr: 1.0000e-04\n", 449 | "Epoch 53/100\n", 450 | "130/130 [==============================] - 1s 5ms/step - loss: 7.5755e-05 - val_loss: 8.5557e-05 - lr: 1.0000e-04\n", 451 | "Epoch 54/100\n", 452 | "130/130 [==============================] - 1s 5ms/step - loss: 7.2456e-05 - val_loss: 8.1475e-05 - lr: 1.0000e-04\n", 453 | "Epoch 55/100\n", 454 | "130/130 [==============================] - 1s 5ms/step - loss: 7.1072e-05 - val_loss: 7.8665e-05 - lr: 1.0000e-04\n", 455 | "Epoch 56/100\n", 456 | "130/130 [==============================] - 1s 5ms/step - loss: 7.1659e-05 - val_loss: 8.1373e-05 - lr: 1.0000e-04\n", 457 | "Epoch 57/100\n", 458 | "130/130 [==============================] - 1s 6ms/step - loss: 6.9972e-05 - val_loss: 8.0773e-05 - lr: 1.0000e-04\n", 459 | "Epoch 58/100\n", 460 | "130/130 [==============================] - 1s 6ms/step - loss: 6.8407e-05 - val_loss: 7.8222e-05 - lr: 1.0000e-04\n", 461 | "Epoch 59/100\n", 462 | "130/130 [==============================] - 1s 5ms/step - loss: 6.5964e-05 - val_loss: 7.4572e-05 - lr: 1.0000e-04\n", 463 | "Epoch 60/100\n", 464 | "130/130 [==============================] - 1s 5ms/step - loss: 6.6795e-05 - val_loss: 8.0847e-05 - lr: 1.0000e-04\n", 465 | "Epoch 61/100\n" 466 | ] 467 | }, 468 | { 469 | "name": "stdout", 470 | "output_type": "stream", 471 | "text": [ 472 | "130/130 [==============================] - 1s 5ms/step - loss: 6.5861e-05 - val_loss: 7.6759e-05 - lr: 1.0000e-04\n", 473 | "Epoch 62/100\n", 474 | "130/130 [==============================] - 1s 5ms/step - loss: 6.4737e-05 - val_loss: 7.4082e-05 - lr: 1.0000e-04\n", 475 | "Epoch 63/100\n", 476 | "130/130 [==============================] - 1s 5ms/step - loss: 6.4065e-05 - val_loss: 7.2013e-05 - lr: 1.0000e-04\n", 477 | "Epoch 64/100\n", 478 | "130/130 [==============================] - 1s 5ms/step - loss: 6.3586e-05 - val_loss: 7.1381e-05 - lr: 1.0000e-04\n", 479 | "Epoch 65/100\n", 480 | "130/130 [==============================] - 1s 5ms/step - loss: 6.2723e-05 - val_loss: 6.9830e-05 - lr: 1.0000e-04\n", 481 | "Epoch 66/100\n", 482 | "130/130 [==============================] - 1s 5ms/step - loss: 6.1998e-05 - val_loss: 7.1838e-05 - lr: 1.0000e-04\n", 483 | "Epoch 67/100\n", 484 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0689e-05 - val_loss: 6.7790e-05 - lr: 1.0000e-04\n", 485 | "Epoch 68/100\n", 486 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0923e-05 - val_loss: 6.7505e-05 - lr: 1.0000e-04\n", 487 | "Epoch 69/100\n", 488 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0179e-05 - val_loss: 6.8082e-05 - lr: 1.0000e-04\n", 489 | "Epoch 70/100\n", 490 | "130/130 [==============================] - 1s 5ms/step - loss: 5.8860e-05 - val_loss: 7.0600e-05 - lr: 1.0000e-04\n", 491 | "Epoch 71/100\n", 492 | "130/130 [==============================] - 1s 5ms/step - loss: 5.9904e-05 - val_loss: 6.6730e-05 - lr: 1.0000e-04\n", 493 | "Epoch 72/100\n", 494 | "130/130 [==============================] - 1s 5ms/step - loss: 5.7617e-05 - val_loss: 6.6053e-05 - lr: 1.0000e-04\n", 495 | "Epoch 73/100\n", 496 | "130/130 [==============================] - 1s 5ms/step - loss: 5.8018e-05 - val_loss: 6.3887e-05 - lr: 1.0000e-04\n", 497 | "Epoch 74/100\n", 498 | "129/130 [============================>.] - ETA: 0s - loss: 5.6845e-05Restoring model weights from the end of the best epoch: 64.\n", 499 | "130/130 [==============================] - 1s 5ms/step - loss: 5.6692e-05 - val_loss: 6.5505e-05 - lr: 1.0000e-04\n", 500 | "Epoch 00074: early stopping\n", 501 | "Shape of transformed data due to auto encoder = (5197, 24)\n", 502 | " Single_Label Multi_Classification problem \n", 503 | "Starting SULOV with 24 features...\n", 504 | " there are no null values in dataset...\n", 505 | " there are no null values in target column...\n", 506 | "Completed SULOV. 12 features selected\n", 507 | " time taken to run entire featurewiz = 57 second(s)\n", 508 | "Recursive XGBoost selected 12 features...\n" 509 | ] 510 | } 511 | ], 512 | "source": [ 513 | "# Load and preprocess your dataset\n", 514 | "# Assuming X_train and y_train are your training data and labels\n", 515 | "X_train_selected, y_train = scaler.fit_transform(X_train, y_train)" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 8, 521 | "id": "1d287a58", 522 | "metadata": {}, 523 | "outputs": [ 524 | { 525 | "name": "stdout", 526 | "output_type": "stream", 527 | "text": [ 528 | "#### Starting featurewiz transform for test data ####\n", 529 | "Loaded input data. Shape = (1300, 12)\n", 530 | "#### Starting lazytransform for test data ####\n", 531 | " Time taken to transform dataset = 1 second(s)\n", 532 | " Shape of transformed dataset: (1300, 12)\n", 533 | "Shape of transformed data due to auto encoder = (1300, 24)\n", 534 | "Returning dataframe with 12 features \n" 535 | ] 536 | } 537 | ], 538 | "source": [ 539 | "### Since you modified y_train to numeric, you must do same for y_test\n", 540 | "X_test_selected = scaler.transform(X_test)\n", 541 | "if scaler.lazy.yformer:\n", 542 | " y_test = scaler.lazy.yformer.transform(y_test)" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": 9, 548 | "id": "e1a62c58", 549 | "metadata": {}, 550 | "outputs": [ 551 | { 552 | "name": "stdout", 553 | "output_type": "stream", 554 | "text": [ 555 | "Bal accu 36%\n", 556 | "ROC AUC = 0.84\n", 557 | " precision recall f1-score support\n", 558 | "\n", 559 | " 0 0.00 0.00 0.00 6\n", 560 | " 1 0.71 0.12 0.20 43\n", 561 | " 2 0.73 0.69 0.71 428\n", 562 | " 3 0.65 0.79 0.71 567\n", 563 | " 4 0.69 0.56 0.62 216\n", 564 | " 5 0.93 0.36 0.52 39\n", 565 | " 6 0.00 0.00 0.00 1\n", 566 | "\n", 567 | " accuracy 0.68 1300\n", 568 | " macro avg 0.53 0.36 0.39 1300\n", 569 | "weighted avg 0.69 0.68 0.67 1300\n", 570 | "\n", 571 | "final average balanced accuracy score = 0.36\n" 572 | ] 573 | } 574 | ], 575 | "source": [ 576 | "import numpy as np\n", 577 | "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", 578 | "from sklearn.utils import class_weight\n", 579 | "from sklearn.metrics import accuracy_score, classification_report\n", 580 | "from featurewiz import get_class_distribution\n", 581 | "from xgboost import XGBClassifier, XGBRFRegressor\n", 582 | "# Updating the Random Forest Classifier with the corrected class weights\n", 583 | "if modeltype == 'Regression':\n", 584 | " #rf_classifier = RandomForestRegressor(random_state=42)\n", 585 | " rf_classifier = XGBRFRegressor(n_estimators=300, random_state=99)\n", 586 | "else:\n", 587 | " # Correctly computing class weights for the classes present in the training set\n", 588 | " class_weights_dict_corrected = get_class_distribution(y_train)\n", 589 | " rf_classifier = RandomForestClassifier(n_estimators=100, class_weight=class_weights_dict_corrected, random_state=42)\n", 590 | "\n", 591 | "\n", 592 | "# Fitting the classifier on the training data\n", 593 | "rf_classifier.fit(X_train_selected, y_train)\n", 594 | "\n", 595 | "# Predicting on the test set\n", 596 | "y_pred = rf_classifier.predict(X_test_selected)\n", 597 | "\n", 598 | "if modeltype == 'Regression':\n", 599 | " print_regression_metrics(y_test, y_pred, verbose=1)\n", 600 | "else:\n", 601 | " # Evaluating the classifier\n", 602 | " y_probas = rf_classifier.predict_proba(X_test_selected)\n", 603 | " print_classification_metrics(y_test, y_pred, y_probas, verbose=1)" 604 | ] 605 | }, 606 | { 607 | "cell_type": "code", 608 | "execution_count": null, 609 | "id": "1dd98599", 610 | "metadata": {}, 611 | "outputs": [], 612 | "source": [] 613 | }, 614 | { 615 | "cell_type": "code", 616 | "execution_count": null, 617 | "id": "d71a2530", 618 | "metadata": {}, 619 | "outputs": [], 620 | "source": [] 621 | } 622 | ], 623 | "metadata": { 624 | "kernelspec": { 625 | "display_name": "Python 3", 626 | "language": "python", 627 | "name": "python3" 628 | }, 629 | "language_info": { 630 | "codemirror_mode": { 631 | "name": "ipython", 632 | "version": 3 633 | }, 634 | "file_extension": ".py", 635 | "mimetype": "text/x-python", 636 | "name": "python", 637 | "nbconvert_exporter": "python", 638 | "pygments_lexer": "ipython3", 639 | "version": "3.8.5" 640 | } 641 | }, 642 | "nbformat": 4, 643 | "nbformat_minor": 5 644 | } 645 | -------------------------------------------------------------------------------- /examples/heart.csv: -------------------------------------------------------------------------------- 1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target 2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1 3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1 4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1 5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1 6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1 7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1 8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1 9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1 10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1 11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1 12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1 13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1 14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1 15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1 16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1 17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1 18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1 19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1 20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1 21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1 22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1 23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1 24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1 25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1 26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1 27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1 28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1 29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1 30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1 31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1 32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1 33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1 34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1 35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1 36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1 37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1 38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1 39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1 40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1 41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1 42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1 43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1 44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1 45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1 46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1 47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1 48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1 49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1 50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1 51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1 52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1 53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1 54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1 55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1 56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1 57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1 58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1 59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1 60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1 61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1 62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1 63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1 64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1 65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1 66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1 67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1 68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1 69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1 70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1 71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1 72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1 73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1 74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1 75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1 76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1 77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1 78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1 79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1 80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1 81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1 82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1 83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1 84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1 85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1 86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1 87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1 88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1 89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1 90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1 91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1 92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1 93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1 94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1 95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1 96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1 97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1 98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1 99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1 100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1 101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1 102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1 103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1 104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1 105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1 106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1 107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1 108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1 109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1 110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1 111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1 112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1 113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1 114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1 115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1 116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1 117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1 118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1 119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1 120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1 121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1 122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1 123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1 124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1 125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1 126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1 127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1 128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1 129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1 130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1 131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1 132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1 133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1 134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1 135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1 136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1 137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1 138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1 139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1 140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1 141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1 142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1 143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1 144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1 145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1 146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1 147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1 148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1 149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1 150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1 151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1 152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1 153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1 154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1 155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1 156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1 157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1 158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1 159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1 160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1 161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1 162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1 163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1 164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1 165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1 167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0 168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0 169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0 170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0 171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0 172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0 173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0 174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0 175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0 176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0 177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0 178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0 179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0 180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0 181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0 182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0 183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0 184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0 185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0 186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0 187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0 188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0 189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0 190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0 191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0 192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0 193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0 194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0 195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0 196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0 197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0 198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0 199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0 200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0 201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0 202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0 203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0 204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0 205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0 206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0 207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0 208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0 209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0 210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0 211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0 212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0 213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0 214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0 215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0 216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0 217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0 218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0 219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0 220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0 221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0 222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0 223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0 224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0 225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0 226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0 227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0 228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0 229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0 230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0 231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0 232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0 233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0 234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0 235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0 236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0 237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0 238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0 239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0 240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0 241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0 242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0 243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0 244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0 245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0 246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0 247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0 248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0 249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0 250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0 251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0 252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0 253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0 254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0 255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0 256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0 257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0 258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0 259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0 260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0 261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0 262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0 263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0 264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0 265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0 266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0 267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0 268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0 269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0 270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0 271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0 272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0 273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0 274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0 275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0 276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0 277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0 278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0 279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0 280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0 281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0 282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0 283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0 284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0 285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0 286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0 287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0 288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0 289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0 290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0 291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0 292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0 293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0 294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0 295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0 296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0 297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0 298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0 299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0 300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0 301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0 302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0 303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0 304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0 305 | -------------------------------------------------------------------------------- /featurewiz/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ################################################################################ 3 | # featurewiz - advanced feature engineering and best features selection in single line of code 4 | # Python v3.6+ 5 | # Created by Ram Seshadri 6 | # Licensed under Apache License v2 7 | ################################################################################ 8 | # Version 9 | from .__version__ import __version__ 10 | from .featurewiz import featurewiz 11 | from .featurewiz import FE_split_one_field_into_many, FE_add_groupby_features_aggregated_to_dataframe 12 | from .featurewiz import FE_start_end_date_time_features 13 | from .featurewiz import classify_features 14 | from .featurewiz import classify_columns,FE_combine_rare_categories 15 | from .featurewiz import FE_count_rows_for_all_columns_by_group 16 | from .featurewiz import FE_add_age_by_date_col, FE_split_add_column, FE_get_latest_values_based_on_date_column 17 | from .featurewiz import FE_capping_outliers_beyond_IQR_Range 18 | from .featurewiz import EDA_classify_and_return_cols_by_type, EDA_classify_features_for_deep_learning 19 | from .featurewiz import FE_create_categorical_feature_crosses, EDA_find_skewed_variables 20 | from .featurewiz import FE_find_and_cap_outliers, EDA_find_outliers 21 | from .featurewiz import split_data_n_ways, FE_concatenate_multiple_columns 22 | from .featurewiz import FE_discretize_numeric_variables, reduce_mem_usage 23 | from .ml_models import simple_XGBoost_model, simple_LightGBM_model, complex_XGBoost_model 24 | from .ml_models import complex_LightGBM_model,data_transform, MultiClassSVM 25 | from .ml_models import IterativeBestClassifier, IterativeDoubleClassifier, IterativeSearchClassifier 26 | from .my_encoders import My_LabelEncoder, Groupby_Aggregator, My_LabelEncoder_Pipe, Ranking_Aggregator, DateTime_Transformer 27 | from .my_encoders import Rare_Class_Combiner, Rare_Class_Combiner_Pipe, FE_create_time_series_features, Binning_Transformer 28 | from .my_encoders import Column_Names_Transformer, FE_convert_all_object_columns_to_numeric, Numeric_Transformer 29 | from .my_encoders import TS_Lagging_Transformer, TS_Fourier_Transformer, TS_Trend_Seasonality_Transformer 30 | from .my_encoders import TS_Lagging_Transformer_Pipe, TS_Fourier_Transformer_Pipe 31 | from lazytransform import LazyTransformer, SuloRegressor, SuloClassifier, print_regression_metrics, print_classification_metrics 32 | from lazytransform import print_regression_model_stats, YTransformer, print_sulo_accuracy 33 | from .sulov_method import FE_remove_variables_using_SULOV_method 34 | from .featurewiz import FE_transform_numeric_columns_to_bins, FE_create_interaction_vars 35 | from .stacking_models import Stacking_Classifier, Blending_Regressor, Stacking_Regressor, stacking_models_list 36 | from .stacking_models import StackingClassifier_Multi, analyze_problem_type_array, get_class_distribution 37 | from .auto_encoders import DenoisingAutoEncoder, VariationalAutoEncoder, CNNAutoEncoder 38 | from .auto_encoders import GAN, GANAugmenter 39 | from .featurewiz import EDA_binning_numeric_column_displaying_bins, FE_calculate_duration_from_timestamp 40 | from .featurewiz import FE_convert_mixed_datatypes_to_string, FE_drop_rows_with_infinity 41 | from .featurewiz import EDA_find_remove_columns_with_infinity, FE_split_list_into_columns 42 | from .featurewiz import EDA_remove_special_chars, FE_remove_commas_in_numerics 43 | from .featurewiz import EDA_randomly_select_rows_from_dataframe, remove_duplicate_cols_in_dataset 44 | from .featurewiz import cross_val_model_predictions 45 | from .blagging import BlaggingClassifier 46 | from .featurewiz import FeatureWiz 47 | ################################################################################ 48 | if __name__ == "__main__": 49 | module_type = 'Running' 50 | else: 51 | module_type = 'Imported' 52 | version_number = __version__ 53 | print("""%s featurewiz %s. Use the following syntax: 54 | >>> wiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, 55 | category_encoders="auto", auto_encoders='VAE', ae_options={}, 56 | add_missing=False, imbalanced=False, verbose=0) 57 | >>> X_train_selected, y_train = wiz.fit_transform(X_train, y_train) 58 | >>> X_test_selected = wiz.transform(X_test) 59 | >>> selected_features = wiz.features 60 | """ %(module_type, version_number)) 61 | ################################################################################ 62 | -------------------------------------------------------------------------------- /featurewiz/__version__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """Specifies the version of the featurewiz package.""" 3 | 4 | __title__ = "featurewiz" 5 | __author__ = "Ram Seshadri" 6 | __description__ = "Advanced Feature Engineering and Feature Selection for any data set, any size" 7 | __url__ = "https://github.com/Auto_ViML/featurewiz.git" 8 | __version__ = "0.6.1" 9 | __license__ = "Apache License 2.0" 10 | __copyright__ = "2020-23 Google" 11 | -------------------------------------------------------------------------------- /featurewiz/classify_method.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import random 4 | np.random.seed(99) 5 | random.seed(42) 6 | ################################################################################ 7 | #### The warnings from Sklearn are so annoying that I have to shut it off ####### 8 | import warnings 9 | warnings.filterwarnings("ignore") 10 | from sklearn.exceptions import DataConversionWarning 11 | warnings.filterwarnings(action='ignore', category=DataConversionWarning) 12 | def warn(*args, **kwargs): 13 | pass 14 | warnings.warn = warn 15 | import logging 16 | #################################################################################### 17 | import pdb 18 | from functools import reduce 19 | import copy 20 | import time 21 | ################################################################################# 22 | def left_subtract(l1,l2): 23 | lst = [] 24 | for i in l1: 25 | if i not in l2: 26 | lst.append(i) 27 | return lst 28 | ################################################################################# 29 | import copy 30 | def EDA_find_remove_columns_with_infinity(df, remove=False): 31 | """ 32 | This function finds all columns in a dataframe that have inifinite values (np.inf or -np.inf) 33 | It returns a list of column names. If the list is empty, it means no columns were found. 34 | If remove flag is set, then it returns a smaller dataframe with inf columns removed. 35 | """ 36 | nums = df.select_dtypes(include='number').columns.tolist() 37 | dfx = df[nums] 38 | sum_rows = np.isinf(dfx).values.sum() 39 | add_cols = list(dfx.columns.to_series()[np.isinf(dfx).any()]) 40 | if sum_rows > 0: 41 | print(' there are %d rows and %d columns with infinity in them...' %(sum_rows,len(add_cols))) 42 | if remove: 43 | ### here you need to use df since the whole dataset is involved ### 44 | nocols = [x for x in df.columns if x not in add_cols] 45 | print(" Shape of dataset before %s and after %s removing columns with infinity" %(df.shape,(df[nocols].shape,))) 46 | return df[nocols] 47 | else: 48 | ## this will be a list of columns with infinity #### 49 | return add_cols 50 | else: 51 | ## this will be an empty list if there are no columns with infinity 52 | return add_cols 53 | #################################################################################### 54 | def classify_columns(df_preds, verbose=0): 55 | """ 56 | This actually does Exploratory data analysis - it means this function performs EDA 57 | ###################################################################################### 58 | Takes a dataframe containing only predictors to be classified into various types. 59 | DO NOT SEND IN A TARGET COLUMN since it will try to include that into various columns. 60 | Returns a data frame containing columns and the class it belongs to such as numeric, 61 | categorical, date or id column, boolean, nlp, discrete_string and cols to delete... 62 | ####### Returns a dictionary with 10 kinds of vars like the following: # continuous_vars,int_vars 63 | # cat_vars,factor_vars, bool_vars,discrete_string_vars,nlp_vars,date_vars,id_vars,cols_delete 64 | """ 65 | train = copy.deepcopy(df_preds) 66 | #### If there are 30 chars are more in a discrete_string_var, it is then considered an NLP variable 67 | max_nlp_char_size = 30 68 | max_cols_to_print = 30 69 | print('#######################################################################################') 70 | print('######################## C L A S S I F Y I N G V A R I A B L E S ####################') 71 | print('#######################################################################################') 72 | if verbose: 73 | print('Classifying variables in data set...') 74 | #### Cat_Limit defines the max number of categories a column can have to be called a categorical colum 75 | cat_limit = 35 76 | float_limit = 15 #### Make this limit low so that float variables below this limit become cat vars ### 77 | def add(a,b): 78 | return a+b 79 | sum_all_cols = dict() 80 | orig_cols_total = train.shape[1] 81 | #Types of columns 82 | cols_delete = [] 83 | cols_delete = [col for col in list(train) if (len(train[col].value_counts()) == 1 84 | ) | (train[col].isnull().sum()/len(train) >= 0.90)] 85 | inf_cols = EDA_find_remove_columns_with_infinity(train) 86 | mixed_cols = [x for x in list(train) if len(train[x].dropna().apply(type).value_counts()) > 1] 87 | if len(mixed_cols) > 0: 88 | print(' Removing %s column(s) due to mixed data type detected...' %mixed_cols) 89 | cols_delete += mixed_cols 90 | cols_delete += inf_cols 91 | train = train[left_subtract(list(train),cols_delete)] 92 | var_df = pd.Series(dict(train.dtypes)).reset_index(drop=False).rename( 93 | columns={0:'type_of_column'}) 94 | sum_all_cols['cols_delete'] = cols_delete 95 | 96 | var_df['bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in ['bool','object'] 97 | and len(train[x['index']].value_counts()) == 2 else 0, axis=1) 98 | string_bool_vars = list(var_df[(var_df['bool'] ==1)]['index']) 99 | sum_all_cols['string_bool_vars'] = string_bool_vars 100 | var_df['num_bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8, 101 | np.uint16, np.uint32, np.uint64, 102 | 'int8','int16','int32','int64', 103 | 'float16','float32','float64'] and len( 104 | train[x['index']].value_counts()) == 2 else 0, axis=1) 105 | num_bool_vars = list(var_df[(var_df['num_bool'] ==1)]['index']) 106 | sum_all_cols['num_bool_vars'] = num_bool_vars 107 | ###### This is where we take all Object vars and split them into diff kinds ### 108 | discrete_or_nlp = var_df.apply(lambda x: 1 if x['type_of_column'] in ['object'] and x[ 109 | 'index'] not in string_bool_vars+cols_delete else 0,axis=1) 110 | ######### This is where we figure out whether a string var is nlp or discrete_string var ### 111 | var_df['nlp_strings'] = 0 112 | var_df['discrete_strings'] = 0 113 | var_df['cat'] = 0 114 | var_df['id_col'] = 0 115 | discrete_or_nlp_vars = var_df.loc[discrete_or_nlp==1]['index'].values.tolist() 116 | copy_discrete_or_nlp_vars = copy.deepcopy(discrete_or_nlp_vars) 117 | if len(discrete_or_nlp_vars) > 0: 118 | for col in copy_discrete_or_nlp_vars: 119 | #### first fill empty or missing vals since it will blowup ### 120 | ### Remember that fillna only works at the dataframe level! 121 | train[[col]] = train[[col]].fillna(' ') 122 | if train[col].map(lambda x: len(x) if type(x)==str else 0).max( 123 | ) >= 50 and len(train[col].value_counts() 124 | ) >= int(0.9*len(train)) and col not in string_bool_vars: 125 | var_df.loc[var_df['index']==col,'nlp_strings'] = 1 126 | elif train[col].map(lambda x: len(x) if type(x)==str else 0).mean( 127 | ) >= max_nlp_char_size and train[col].map(lambda x: len(x) if type(x)==str else 0).max( 128 | ) < 50 and len(train[col].value_counts() 129 | ) <= int(0.9*len(train)) and col not in string_bool_vars: 130 | var_df.loc[var_df['index']==col,'discrete_strings'] = 1 131 | elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts() 132 | ) <= int(0.9*len(train)) and col not in string_bool_vars: 133 | var_df.loc[var_df['index']==col,'discrete_strings'] = 1 134 | elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts() 135 | ) == len(train) and col not in string_bool_vars: 136 | var_df.loc[var_df['index']==col,'id_col'] = 1 137 | else: 138 | var_df.loc[var_df['index']==col,'cat'] = 1 139 | nlp_vars = list(var_df[(var_df['nlp_strings'] ==1)]['index']) 140 | sum_all_cols['nlp_vars'] = nlp_vars 141 | discrete_string_vars = list(var_df[(var_df['discrete_strings'] ==1) ]['index']) 142 | sum_all_cols['discrete_string_vars'] = discrete_string_vars 143 | ###### This happens only if a string column happens to be an ID column ####### 144 | #### DO NOT Add this to ID_VARS yet. It will be done later.. Dont change it easily... 145 | #### Category DTYPE vars are very special = they can be left as is and not disturbed in Python. ### 146 | var_df['dcat'] = var_df.apply(lambda x: 1 if str(x['type_of_column'])=='category' else 0, 147 | axis=1) 148 | factor_vars = list(var_df[(var_df['dcat'] ==1)]['index']) 149 | sum_all_cols['factor_vars'] = factor_vars 150 | ######################################################################## 151 | date_or_id = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8, 152 | np.uint16, np.uint32, np.uint64, 153 | 'int8','int16', 154 | 'int32','int64'] and x[ 155 | 'index'] not in string_bool_vars+num_bool_vars+discrete_string_vars+nlp_vars else 0, 156 | axis=1) 157 | ######### This is where we figure out whether a numeric col is date or id variable ### 158 | var_df['int'] = 0 159 | var_df['date_time'] = 0 160 | ### if a particular column is date-time type, now set it as a date time variable ## 161 | var_df['date_time'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [' 2050: 169 | var_df.loc[var_df['index']==col,'id_col'] = 1 170 | else: 171 | try: 172 | pd.to_datetime(train[col],infer_datetime_format=True) 173 | var_df.loc[var_df['index']==col,'date_time'] = 1 174 | except: 175 | var_df.loc[var_df['index']==col,'id_col'] = 1 176 | else: 177 | if train[col].min() < 1900 or train[col].max() > 2050: 178 | if col not in num_bool_vars: 179 | var_df.loc[var_df['index']==col,'int'] = 1 180 | else: 181 | try: 182 | pd.to_datetime(train[col],infer_datetime_format=True) 183 | var_df.loc[var_df['index']==col,'date_time'] = 1 184 | except: 185 | if col not in num_bool_vars: 186 | var_df.loc[var_df['index']==col,'int'] = 1 187 | else: 188 | pass 189 | int_vars = list(var_df[(var_df['int'] ==1)]['index']) 190 | date_vars = list(var_df[(var_df['date_time'] == 1)]['index']) 191 | id_vars = list(var_df[(var_df['id_col'] == 1)]['index']) 192 | sum_all_cols['int_vars'] = int_vars 193 | copy_date_vars = copy.deepcopy(date_vars) 194 | for date_var in copy_date_vars: 195 | #### This test is to make sure sure date vars are actually date vars 196 | try: 197 | pd.to_datetime(train[date_var],infer_datetime_format=True) 198 | except: 199 | ##### if not a date var, then just add it to delete it from processing 200 | cols_delete.append(date_var) 201 | date_vars.remove(date_var) 202 | sum_all_cols['date_vars'] = date_vars 203 | sum_all_cols['id_vars'] = id_vars 204 | sum_all_cols['cols_delete'] = cols_delete 205 | ## This is an EXTREMELY complicated logic for cat vars. Don't change it unless you test it many times! 206 | var_df['numeric'] = 0 207 | float_or_cat = var_df.apply(lambda x: 1 if x['type_of_column'] in ['float16', 208 | 'float32','float64'] else 0, 209 | axis=1) 210 | ####### We need to make sure there are no categorical vars in float ####### 211 | if len(var_df.loc[float_or_cat == 1]) > 0: 212 | for col in var_df.loc[float_or_cat == 1]['index'].values.tolist(): 213 | if len(train[col].value_counts()) > 2 and len(train[col].value_counts() 214 | ) <= float_limit and len(train[col].value_counts()) <= len(train): 215 | var_df.loc[var_df['index']==col,'cat'] = 1 216 | else: 217 | if col not in (num_bool_vars + factor_vars): 218 | var_df.loc[var_df['index']==col,'numeric'] = 1 219 | cat_vars = list(var_df[(var_df['cat'] ==1)]['index']) 220 | continuous_vars = list(var_df[(var_df['numeric'] ==1)]['index']) 221 | 222 | ######## V E R Y I M P O R T A N T ################################################### 223 | cat_vars_copy = copy.deepcopy(factor_vars) 224 | for cat in cat_vars_copy: 225 | if df_preds[cat].dtype==float: 226 | continuous_vars.append(cat) 227 | factor_vars.remove(cat) 228 | var_df.loc[var_df['index']==cat,'dcat'] = 0 229 | var_df.loc[var_df['index']==cat,'numeric'] = 1 230 | elif len(df_preds[cat].value_counts()) == df_preds.shape[0]: 231 | id_vars.append(cat) 232 | factor_vars.remove(cat) 233 | var_df.loc[var_df['index']==cat,'dcat'] = 0 234 | var_df.loc[var_df['index']==cat,'id_col'] = 1 235 | 236 | sum_all_cols['factor_vars'] = factor_vars 237 | ##### There are a couple of extra tests you need to do to remove abberations in cat_vars ### 238 | cat_vars_copy = copy.deepcopy(cat_vars) 239 | for cat in cat_vars_copy: 240 | if df_preds[cat].dtype==float: 241 | continuous_vars.append(cat) 242 | cat_vars.remove(cat) 243 | var_df.loc[var_df['index']==cat,'cat'] = 0 244 | var_df.loc[var_df['index']==cat,'numeric'] = 1 245 | elif len(df_preds[cat].value_counts()) == df_preds.shape[0]: 246 | id_vars.append(cat) 247 | cat_vars.remove(cat) 248 | var_df.loc[var_df['index']==cat,'cat'] = 0 249 | var_df.loc[var_df['index']==cat,'id_col'] = 1 250 | sum_all_cols['cat_vars'] = cat_vars 251 | sum_all_cols['continuous_vars'] = continuous_vars 252 | sum_all_cols['id_vars'] = id_vars 253 | ###### This is where you consoldate the numbers ########### 254 | var_dict_sum = dict(zip(var_df.values[:,0], var_df.values[:,2:].sum(1))) 255 | for col, sumval in var_dict_sum.items(): 256 | if sumval == 0: 257 | print('%s of type=%s is not classified' %(col,train[col].dtype)) 258 | elif sumval > 1: 259 | print('%s of type=%s is classified into more then one type' %(col,train[col].dtype)) 260 | else: 261 | pass 262 | ##### If there are more than 1000 unique values, then add it to NLP vars ### 263 | copy_discretes = copy.deepcopy(discrete_string_vars) 264 | for each_discrete in copy_discretes: 265 | if train[each_discrete].nunique() >= 1000: 266 | nlp_vars.append(each_discrete) 267 | discrete_string_vars.remove(each_discrete) 268 | elif train[each_discrete].nunique() > 100 and train[each_discrete].nunique() < 1000: 269 | pass 270 | else: 271 | ### If it is less than 100 unique values, then make it categorical var 272 | cat_vars.append(each_discrete) 273 | discrete_string_vars.remove(each_discrete) 274 | sum_all_cols['discrete_string_vars'] = discrete_string_vars 275 | sum_all_cols['cat_vars'] = cat_vars 276 | sum_all_cols['nlp_vars'] = nlp_vars 277 | ############### This is where you print all the types of variables ############## 278 | ####### Returns 8 vars in the following order: continuous_vars,int_vars,cat_vars, 279 | ### string_bool_vars,discrete_string_vars,nlp_vars,date_or_id_vars,cols_delete 280 | if verbose == 1: 281 | print(" Number of Numeric Columns = ", len(continuous_vars)) 282 | print(" Number of Integer-Categorical Columns = ", len(int_vars)) 283 | print(" Number of String-Categorical Columns = ", len(cat_vars)) 284 | print(" Number of Factor-Categorical Columns = ", len(factor_vars)) 285 | print(" Number of String-Boolean Columns = ", len(string_bool_vars)) 286 | print(" Number of Numeric-Boolean Columns = ", len(num_bool_vars)) 287 | print(" Number of Discrete String Columns = ", len(discrete_string_vars)) 288 | print(" Number of NLP String Columns = ", len(nlp_vars)) 289 | print(" Number of Date Time Columns = ", len(date_vars)) 290 | print(" Number of ID Columns = ", len(id_vars)) 291 | print(" Number of Columns to Delete = ", len(cols_delete)) 292 | if verbose == 2: 293 | print(' Printing upto %d columns max in each category:' %max_cols_to_print) 294 | print(" Numeric Columns : %s" %continuous_vars[:max_cols_to_print]) 295 | print(" Integer-Categorical Columns: %s" %int_vars[:max_cols_to_print]) 296 | print(" String-Categorical Columns: %s" %cat_vars[:max_cols_to_print]) 297 | print(" Factor-Categorical Columns: %s" %factor_vars[:max_cols_to_print]) 298 | print(" String-Boolean Columns: %s" %string_bool_vars[:max_cols_to_print]) 299 | print(" Numeric-Boolean Columns: %s" %num_bool_vars[:max_cols_to_print]) 300 | print(" Discrete String Columns: %s" %discrete_string_vars[:max_cols_to_print]) 301 | print(" NLP text Columns: %s" %nlp_vars[:max_cols_to_print]) 302 | print(" Date Time Columns: %s" %date_vars[:max_cols_to_print]) 303 | print(" ID Columns: %s" %id_vars[:max_cols_to_print]) 304 | print(" Columns that will not be considered in modeling: %s" %cols_delete[:max_cols_to_print]) 305 | ##### now collect all the column types and column names into a single dictionary to return! 306 | 307 | len_sum_all_cols = reduce(add,[len(v) for v in sum_all_cols.values()]) 308 | if len_sum_all_cols == orig_cols_total: 309 | if verbose: 310 | print(' %d Predictors classified...' %orig_cols_total) 311 | #print(' This does not include the Target column(s)') 312 | else: 313 | print('No of columns classified %d does not match %d total cols. Continuing...' %( 314 | len_sum_all_cols, orig_cols_total)) 315 | ls = sum_all_cols.values() 316 | flat_list = [item for sublist in ls for item in sublist] 317 | if len(left_subtract(list(train),flat_list)) > 0: 318 | print(' Error: some columns missing from classification are: %s' %left_subtract(list(train),flat_list)) 319 | return sum_all_cols 320 | #################################################################################### 321 | -------------------------------------------------------------------------------- /featurewiz/databunch.py: -------------------------------------------------------------------------------- 1 | ############################################################################### 2 | # MIT License 3 | # 4 | # Copyright (c) 2020 Alex Lekov 5 | # 6 | # Permission is hereby granted, free of charge, to any person obtaining a copy 7 | # of this software and associated documentation files (the "Software"), to deal 8 | # in the Software without restriction, including without limitation the rights 9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | # copies of the Software, and to permit persons to whom the Software is 11 | # furnished to do so, subject to the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be included in all 14 | # copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | # SOFTWARE. 23 | ############################################################################### 24 | ##### This amazing Library was created by Alex Lekov: Many Thanks to Alex! ### 25 | ##### https://github.com/Alex-Lekov/AutoML_Alex ### 26 | ############################################################################### 27 | import pandas as pd 28 | import numpy as np 29 | from itertools import combinations 30 | from sklearn.preprocessing import StandardScaler 31 | from category_encoders import HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder 32 | from category_encoders import OneHotEncoder, HelmertEncoder, OrdinalEncoder, CountEncoder, BaseNEncoder 33 | from category_encoders import TargetEncoder, CatBoostEncoder, WOEEncoder, JamesSteinEncoder 34 | from category_encoders.glmm import GLMMEncoder 35 | from sklearn.preprocessing import LabelEncoder 36 | from category_encoders.wrapper import PolynomialWrapper 37 | from .encoders import FrequencyEncoder 38 | from . import settings 39 | 40 | import pdb 41 | # disable chained assignments 42 | pd.options.mode.chained_assignment = None 43 | import copy 44 | import dask 45 | import dask.dataframe as dd 46 | 47 | class DataBunch(object): 48 | """ 49 | Сlass for storing, cleaning and processing your dataset 50 | """ 51 | def __init__(self, 52 | X_train=None, 53 | y_train=None, 54 | X_test=None, 55 | y_test=None, 56 | cat_features=None, 57 | clean_and_encod_data=True, 58 | cat_encoder_names=None, 59 | clean_nan=True, 60 | num_generator_features=True, 61 | group_generator_features=True, 62 | target_enc_cat_features=True, 63 | normalization=True, 64 | random_state=42, 65 | verbose=1): 66 | """ 67 | Description of __init__ 68 | 69 | Args: 70 | X_train=None (undefined): dataset 71 | y_train=None (undefined): y 72 | X_test=None (undefined): dataset 73 | y_test=None (undefined): y 74 | cat_features=None (list or None): 75 | clean_and_encod_data=True (undefined): 76 | cat_encoder_names=None (list or None): 77 | clean_nan=True (undefined): 78 | num_generator_features=True (undefined): 79 | group_generator_features=True (undefined): 80 | target_enc_cat_features=True (undefined) 81 | random_state=42 (undefined): 82 | verbose = 1 (undefined) 83 | """ 84 | self.random_state = random_state 85 | 86 | self.X_train = None 87 | self.y_train = None 88 | self.X_test = None 89 | self.y_test = None 90 | self.X_train_predicts = None 91 | self.X_test_predicts = None 92 | self.cat_features = None 93 | 94 | # Encoders 95 | self.cat_encoders_names = settings.cat_encoders_names 96 | self.target_encoders_names = settings.target_encoders_names 97 | 98 | 99 | self.cat_encoder_names = cat_encoder_names 100 | self.cat_encoder_names_list = list(self.cat_encoders_names.keys()) + list(self.target_encoders_names.keys()) 101 | self.target_encoders_names_list = list(self.target_encoders_names.keys()) 102 | 103 | # check X_train, y_train, X_test 104 | if self.check_data_format(X_train): 105 | if type(X_train) == dask.dataframe.core.DataFrame: 106 | self.X_train_source = X_train.compute() 107 | else: 108 | self.X_train_source = pd.DataFrame(X_train) 109 | self.X_train_source = remove_duplicate_cols_in_dataset(self.X_train_source) 110 | if X_test is not None: 111 | if self.check_data_format(X_test): 112 | if type(X_test) == dask.dataframe.core.DataFrame: 113 | self.X_test_source = X_test.compute() 114 | else: 115 | self.X_test_source = pd.DataFrame(X_test) 116 | self.X_test_source = remove_duplicate_cols_in_dataset(self.X_test_source) 117 | 118 | 119 | ### There is a chance for an error in this - so worth watching! 120 | if y_train is not None: 121 | le = LabelEncoder() 122 | if self.check_data_format(y_train): 123 | if settings.multi_label: 124 | ### if the model is mult-Label, don't transform it since it won't work 125 | self.y_train_source = y_train 126 | else: 127 | if not isinstance(y_train, pd.DataFrame): 128 | if y_train.dtype == 'object' or str(y_train.dtype) == 'category': 129 | self.y_train_source = le.fit_transform(y_train) 130 | else: 131 | if settings.modeltype == 'Multi_Classification': 132 | rare_class = find_rare_class(y_train) 133 | if rare_class != 0: 134 | ### if the rare class is not zero, then transform it using Label Encoder 135 | y_train = le.fit_transform(y_train) 136 | self.y_train_source = copy.deepcopy(y_train) 137 | else: 138 | print('Error: y_train should be a series. Skipping target encoding for dataset...') 139 | target_enc_cat_features = False 140 | else: 141 | if settings.multi_label: 142 | self.y_train_source = pd.DataFrame(y_train) 143 | else: 144 | if y_train.dtype == 'object' or str(y_train.dtype) == 'category': 145 | self.y_train_source = le.fit_transform(pd.DataFrame(y_train)) 146 | else: 147 | self.y_train_source = copy.deepcopy(y_train) 148 | else: 149 | print("No target data found!") 150 | return 151 | 152 | if y_test is not None: 153 | self.y_test = y_test 154 | 155 | if verbose > 0: 156 | print('Source X_train shape: ', self.X_train_source.shape) 157 | if not X_test is None: 158 | print('| Source X_test shape: ', self.X_test_source.shape) 159 | print('#'*50) 160 | 161 | # add categorical features in DataBunch 162 | if cat_features is None: 163 | self.cat_features = self.auto_detect_cat_features(self.X_train_source) 164 | if verbose > 0: 165 | print('Auto detect cat features: ', len(self.cat_features)) 166 | 167 | else: 168 | self.cat_features = list(cat_features) 169 | 170 | # preproc_data in DataBunch 171 | if clean_and_encod_data: 172 | if verbose > 0: 173 | print('> Start preprocessing with %d variables' %self.X_train_source.shape[1]) 174 | self.X_train, self.X_test = self.preproc_data(self.X_train_source, 175 | self.X_test_source, 176 | self.y_train_source, 177 | cat_features=self.cat_features, 178 | cat_encoder_names=cat_encoder_names, 179 | clean_nan=clean_nan, 180 | num_generator_features=num_generator_features, 181 | group_generator_features=group_generator_features, 182 | target_enc_cat_features=target_enc_cat_features, 183 | normalization=normalization, 184 | verbose=verbose,) 185 | else: 186 | self.X_train, self.X_test = X_train, X_test 187 | 188 | 189 | def check_data_format(self, data): 190 | """ 191 | Description of check_data_format: 192 | Check that data is not pd.DataFrame or empty 193 | 194 | Args: 195 | data (undefined): dataset 196 | Return: 197 | True or Exception 198 | """ 199 | data_tmp = pd.DataFrame(data) 200 | if data_tmp is None or data_tmp.empty: 201 | raise Exception("data is not pd.DataFrame or empty") 202 | else: 203 | if isinstance(data, pd.Series) or isinstance(data, pd.DataFrame): 204 | return True 205 | elif isinstance(data, np.ndarray): 206 | return True 207 | elif type(data) == dask.dataframe.core.DataFrame: 208 | return True 209 | else: 210 | False 211 | 212 | def clean_nans(self, data, cols=None): 213 | """ 214 | Fill Nans and add column, that there were nans in this column 215 | 216 | Args: 217 | data (pd.DataFrame, shape (n_samples, n_features)): the input data 218 | cols list() features: the input data 219 | Return: 220 | Clean data (pd.DataFrame, shape (n_samples, n_features)) 221 | 222 | """ 223 | if cols is not None: 224 | nan_columns = list(data[cols].columns[data[cols].isnull().sum() > 0]) 225 | if nan_columns: 226 | for nan_column in nan_columns: 227 | data[nan_column+'_isNAN'] = pd.isna(data[nan_column]).astype('uint8') 228 | data.fillna(data.median(), inplace=True) 229 | return(data) 230 | 231 | 232 | def auto_detect_cat_features(self, data): 233 | """ 234 | Description of _auto_detect_cat_features: 235 | Auto-detection categorical_features by simple rule: 236 | categorical feature == if feature nunique low 1% of data 237 | 238 | Args: 239 | data (pd.DataFrame): dataset 240 | 241 | Returns: 242 | cat_features (list): columns names cat features 243 | 244 | """ 245 | #object_features = list(data.columns[data.dtypes == 'object']) 246 | cat_features = data.columns[(data.nunique(dropna=False) < len(data)//100) & \ 247 | (data.nunique(dropna=False) >2)] 248 | #cat_features = list(set([*object_features, *cat_features])) 249 | return (cat_features) 250 | 251 | 252 | def gen_cat_encodet_features(self, data, cat_encoder_name): 253 | """ 254 | Description of _encode_features: 255 | Encode car features 256 | 257 | Args: 258 | data (pd.DataFrame): 259 | cat_encoder_name (str): cat Encoder name 260 | 261 | Returns: 262 | pd.DataFrame 263 | 264 | """ 265 | 266 | if isinstance(cat_encoder_name, str): 267 | if cat_encoder_name in self.cat_encoder_names_list and cat_encoder_name not in self.target_encoders_names_list: 268 | if cat_encoder_name == 'HashingEncoder': 269 | encoder = self.cat_encoders_names[cat_encoder_name][0](cols=self.cat_features, n_components=int(np.log(len(data.columns))*1000), 270 | drop_invariant=True) 271 | else: 272 | encoder = self.cat_encoders_names[cat_encoder_name][0](cols=self.cat_features, drop_invariant=True) 273 | data_encodet = encoder.fit_transform(data) 274 | data_encodet = data_encodet.add_prefix(cat_encoder_name + '_') 275 | else: 276 | print(f"{cat_encoder_name} is not supported!") 277 | return ('', '') 278 | else: 279 | encoder = copy.deepcopy(cat_encoder_name) 280 | data_encodet = encoder.transform(data) 281 | data_encodet = data_encodet.add_prefix(str(cat_encoder_name).split("(")[0] + '_') 282 | 283 | 284 | return (data_encodet, encoder) 285 | 286 | 287 | def gen_target_encodet_features(self, x_data, y_data=None, cat_encoder_name=''): 288 | """ 289 | Description of _encode_features: 290 | Encode car features 291 | 292 | Args: 293 | data (pd.DataFrame): 294 | cat_encoder_name (str): cat Encoder name 295 | 296 | Returns: 297 | pd.DataFrame 298 | 299 | """ 300 | 301 | 302 | if isinstance(cat_encoder_name, str): 303 | ### If it is the first time, it will perform fit_transform ! 304 | if cat_encoder_name in self.target_encoders_names_list: 305 | encoder = self.target_encoders_names[cat_encoder_name][0](cols=self.cat_features, drop_invariant=True) 306 | if settings.modeltype == 'Multi_Classification': 307 | ### you must put a Polynomial Wrapper on the cat_encoder in case the model is multi-class 308 | if cat_encoder_name in ['WOEEncoder']: 309 | encoder = PolynomialWrapper(encoder) 310 | ### All other encoders TargetEncoder CatBoostEncoder GLMMEncoder don't need 311 | ### Polynomial Wrappers since they handle multi-class (label encoded) very well! 312 | cols = encoder.cols 313 | for each_col in cols: 314 | x_data[each_col] = encoder.fit_transform(x_data[each_col], y_data).values 315 | data_encodet = encoder.fit_transform(x_data, y_data) 316 | data_encodet = data_encodet.add_prefix(cat_encoder_name + '_') 317 | else: 318 | print(f"{cat_encoder_name} is not supported!") 319 | return ('', '') 320 | else: 321 | ### if it is already fit, then it will only do transform here ! 322 | encoder = copy.deepcopy(cat_encoder_name) 323 | data_encodet = encoder.transform(x_data) 324 | data_encodet = data_encodet.add_prefix(str(cat_encoder_name).split("(")[0] + '_') 325 | 326 | 327 | return (data_encodet, encoder) 328 | 329 | def gen_numeric_interaction_features(self, 330 | df, 331 | columns, 332 | operations=['/','*','-','+'],): 333 | """ 334 | Description of numeric_interaction_terms: 335 | Numerical interaction generator features: A/B, A*B, A-B, 336 | 337 | Args: 338 | df (pd.DataFrame): 339 | columns (list): num columns names 340 | operations (list): operations type 341 | 342 | Returns: 343 | pd.DataFrame 344 | 345 | """ 346 | copy_columns = copy.deepcopy(columns) 347 | fe_df = pd.DataFrame() 348 | for combo_col in combinations(columns,2): 349 | if '/' in operations: 350 | fe_df['{}_div_by_{}'.format(combo_col[0], combo_col[1]) ] = (df[combo_col[0]]*1.) / df[combo_col[1]] 351 | if '*' in operations: 352 | fe_df['{}_mult_by_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] * df[combo_col[1]] 353 | if '-' in operations: 354 | fe_df['{}_minus_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] - df[combo_col[1]] 355 | if '+' in operations: 356 | fe_df['{}_plus_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] + df[combo_col[1]] 357 | 358 | for each_col in copy_columns: 359 | fe_df['{}_squared'.format(each_col) ] = df[each_col].pow(2) 360 | return (fe_df) 361 | 362 | 363 | def gen_groupby_cat_encode_features(self, data, cat_columns, num_column, 364 | cat_encoder_name='JamesSteinEncoder'): 365 | """ 366 | Description of group_encoder 367 | 368 | Args: 369 | data (pd.DataFrame): dataset 370 | cat_columns (list): cat columns names 371 | num_column (str): num column name 372 | 373 | Returns: 374 | pd.DataFrame 375 | 376 | """ 377 | 378 | if isinstance(cat_encoder_name, str): 379 | if cat_encoder_name in self.cat_encoder_names_list: 380 | encoder = JamesSteinEncoder(cols=self.cat_features, model='beta', return_df = True, drop_invariant=True) 381 | encoder.fit(X=data[cat_columns], y=data[num_column].values) 382 | else: 383 | print(f"{cat_encoder_name} is not supported!") 384 | return ('', '') 385 | else: 386 | encoder = copy.deepcopy(cat_encoder_name) 387 | 388 | data_encodet = encoder.transform(X=data[cat_columns], y=data[num_column].values) 389 | data_encodet = data_encodet.add_prefix('GroupEncoded_' + num_column + '_') 390 | 391 | return (data_encodet, encoder) 392 | 393 | def preproc_data(self, X_train=None, 394 | X_test=None, 395 | y_train=None, 396 | cat_features=None, 397 | cat_encoder_names=None, 398 | clean_nan=True, 399 | num_generator_features=True, 400 | group_generator_features=True, 401 | target_enc_cat_features=True, 402 | normalization=True, 403 | verbose=1,): 404 | """ 405 | Description of preproc_data: 406 | dataset preprocessing function 407 | 408 | Args: 409 | X_train=None (pd.DataFrame): 410 | X_test=None (pd.DataFrame): 411 | y_train=None (pd.DataFrame): 412 | cat_features=None (list): 413 | cat_encoder_names=None (list): 414 | clean_nan=True (Bool): 415 | num_generator_features=True (Bool): 416 | group_generator_features=True (Bool): 417 | 418 | Returns: 419 | X_train (pd.DataFrame) 420 | X_test (pd.DataFrame) 421 | 422 | """ 423 | 424 | #### Sometimes there are duplicates in column names. You must remove them here. ### 425 | cat_features = find_remove_duplicates(cat_features) 426 | 427 | # concat datasets for correct processing. 428 | df_train = X_train.copy() 429 | 430 | if X_test is None: 431 | data = df_train 432 | test_data = None ### Set test_data to None if X_test is None 433 | else: 434 | test_data = X_test.copy() 435 | test_data = remove_duplicate_cols_in_dataset(test_data) 436 | data = copy.deepcopy(df_train) 437 | 438 | data = remove_duplicate_cols_in_dataset(data) 439 | 440 | # object & num features 441 | object_features = list(data.columns[(data.dtypes == 'object') | (data.dtypes == 'category')]) 442 | num_features = list(set(data.columns) - set(cat_features) - set(object_features) - {'test'}) 443 | encodet_features_names = list(set(object_features + list(cat_features))) 444 | 445 | original_number_features = len(encodet_features_names) 446 | count_number_features = df_train.shape[1] 447 | 448 | self.encodet_features_names = encodet_features_names 449 | self.num_features_names = num_features 450 | self.binary_features_names = [] 451 | 452 | # LabelEncode all Binary Features - leave the rest alone 453 | cols = data.columns.tolist() 454 | #### This sometimes errors because there are duplicate columns in a dataset ### 455 | print('LabelEncode all Boolean Features. Leave the rest alone') 456 | for feature in cols: 457 | if data[feature].dtype == bool : 458 | print(' boolean feature = ',feature) 459 | 460 | for feature in cols: 461 | if (data[feature].dtype == bool): 462 | data[feature] = data[feature].astype('category').cat.codes 463 | if test_data is not None: 464 | test_data[feature] = test_data[feature].astype('category').cat.codes 465 | self.binary_features_names.append(feature) 466 | 467 | # Convert all Category features "Category" type variables if no encoding is specified 468 | cat_only_encoders = [x for x in self.cat_encoder_names if x in self.cat_encoders_names] 469 | if len(cat_only_encoders) > 0: 470 | ### Just skip if this encoder is not in the list of category encoders ## 471 | if encodet_features_names: 472 | if cat_encoder_names is None: 473 | for feature in encodet_features_names: 474 | data[feature] = data[feature].fillna('missing') 475 | data[feature] = data[feature].astype('category').cat.codes 476 | if test_data is not None: 477 | test_data[feature] = test_data[feature].fillna('missing') 478 | test_data[feature] = test_data[feature].astype('category').cat.codes 479 | else: 480 | #### If an encoder is specified, then use that encoder to transform categorical variables 481 | if verbose > 0: 482 | print('> Generate Categorical Encoded features') 483 | 484 | copy_cat_encoder_names = copy.deepcopy(cat_encoder_names) 485 | for encoder_name in copy_cat_encoder_names: 486 | if verbose > 0: 487 | print(' + To know more, click: %s' %self.cat_encoders_names[encoder_name][1]) 488 | data_encodet, train_encoder = self.gen_cat_encodet_features(data[encodet_features_names], 489 | encoder_name) 490 | if not isinstance(data_encodet, str): 491 | data = pd.concat([data, data_encodet], axis=1) 492 | if test_data is not None: 493 | test_encodet, _ = self.gen_cat_encodet_features(test_data[encodet_features_names], 494 | train_encoder) 495 | if not isinstance(test_encodet, str): 496 | test_data = pd.concat([test_data, test_encodet], axis=1) 497 | 498 | if verbose > 0: 499 | if not isinstance(data_encodet, str): 500 | addl_features = data_encodet.shape[1] - original_number_features 501 | count_number_features += addl_features 502 | print(' + added ', addl_features, ' additional Features using',encoder_name) 503 | 504 | # Generate Target related Encoder features for cat variables: 505 | 506 | 507 | target_encoders = [x for x in self.cat_encoder_names if x in self.target_encoders_names_list] 508 | if len(target_encoders) > 0: 509 | target_enc_cat_features = True 510 | if target_enc_cat_features: 511 | if encodet_features_names: 512 | if verbose > 0: 513 | print('> Generate Target Encoded categorical features') 514 | 515 | if len(target_encoders) == 0: 516 | target_encoders = ['TargetEncoder'] ### set the default as TargetEncoder if nothing is specified 517 | copy_target_encoders = copy.deepcopy(target_encoders) 518 | for encoder_name in copy_target_encoders: 519 | if verbose > 0: 520 | print(' + To know more, click: %s' %self.target_encoders_names[encoder_name][1]) 521 | data_encodet, train_encoder = self.gen_target_encodet_features(data[encodet_features_names], 522 | self.y_train_source, encoder_name) 523 | if not isinstance(data_encodet, str): 524 | data = pd.concat([data, data_encodet], axis=1) 525 | 526 | if test_data is not None: 527 | test_encodet, _ = self.gen_target_encodet_features(test_data[encodet_features_names],'', 528 | train_encoder) 529 | if not isinstance(test_encodet, str): 530 | test_data = pd.concat([test_data, test_encodet], axis=1) 531 | 532 | 533 | if verbose > 0: 534 | if not isinstance(data_encodet, str): 535 | addl_features = data_encodet.shape[1] - original_number_features 536 | count_number_features += addl_features 537 | print(' + added ', len(encodet_features_names) , ' additional Features using ', encoder_name) 538 | 539 | # Clean NaNs in Numeric variables only 540 | if clean_nan: 541 | if verbose > 0: 542 | print('> Cleaned NaNs in numeric features') 543 | data = self.clean_nans(data, cols=num_features) 544 | if test_data is not None: 545 | test_data = self.clean_nans(test_data, cols=num_features) 546 | ### Sometimes, train has nulls while test doesn't and vice versa 547 | if test_data is not None: 548 | rem_cols = left_subtract(list(data),list(test_data)) 549 | if len(rem_cols) > 0: 550 | for rem_col in rem_cols: 551 | test_data[rem_col] = 0 552 | elif len(left_subtract(list(test_data),list(data))) > 0: 553 | rem_cols = left_subtract(list(test_data),list(data)) 554 | for rem_col in rem_cols: 555 | data[rem_col] = 0 556 | else: 557 | print(' + test and train have similar NaN columns') 558 | 559 | # Generate interaction features for Numeric variables 560 | if num_generator_features: 561 | if len(num_features) > 1: 562 | if verbose > 0: 563 | print('> Generate Interactions features among Numeric variables') 564 | fe_df = self.gen_numeric_interaction_features(data[num_features], 565 | num_features, 566 | operations=['/','*','-','+'],) 567 | 568 | if not isinstance(fe_df, str): 569 | data = pd.concat([data,fe_df],axis=1) 570 | if test_data is not None: 571 | fe_test = self.gen_numeric_interaction_features(test_data[num_features], 572 | num_features, 573 | operations=['/','*','-','+'],) 574 | if not isinstance(fe_test, str): 575 | test_data = pd.concat([test_data, fe_test], axis=1) 576 | 577 | if verbose > 0: 578 | if not isinstance(fe_df, str): 579 | addl_features = fe_df.shape[1] 580 | count_number_features += addl_features 581 | print(' + added ', addl_features, ' Interaction Features ',) 582 | 583 | # Generate Group Encoded Features for Numeric variables only using all Categorical variables 584 | if group_generator_features: 585 | if encodet_features_names and num_features: 586 | if verbose > 0: 587 | print('> Generate Group-by Encoded Features') 588 | print(' + To know more, click: %s' %self.target_encoders_names['JamesSteinEncoder'][1]) 589 | 590 | for num_col in num_features: 591 | data_encodet, train_group_encoder = self.gen_groupby_cat_encode_features( 592 | data, 593 | encodet_features_names, 594 | num_col,) 595 | if not isinstance(data_encodet, str): 596 | data = pd.concat([data, data_encodet],axis=1) 597 | if test_data is not None: 598 | test_encodet, _ = self.gen_groupby_cat_encode_features( 599 | data, 600 | encodet_features_names, 601 | num_col,train_group_encoder) 602 | if not isinstance(test_encodet, str): 603 | test_data = pd.concat([test_data, test_encodet], axis=1) 604 | 605 | if verbose > 0: 606 | addl_features = data_encodet.shape[1]*len(num_features) 607 | count_number_features += addl_features 608 | print(' + added ', addl_features, ' Group-by Encoded Features using JamesSteinEncoder') 609 | 610 | 611 | # Drop source cat features 612 | if not len(cat_encoder_names) == 0: 613 | ### if there is no categorical encoding, then let the categorical_vars pass through. 614 | ### If they have been transformed into Cat Encoded variables, then you can drop them! 615 | data.drop(columns=encodet_features_names, inplace=True) 616 | # In this case, there may be some inf values, replace them ###### 617 | data.replace([np.inf, -np.inf], np.nan, inplace=True) 618 | #data.fillna(0, inplace=True) 619 | if test_data is not None: 620 | if not len(cat_encoder_names) == 0: 621 | ### if there is no categorical encoding, then let the categorical_vars pass through. 622 | test_data.drop(columns=encodet_features_names, inplace=True) 623 | test_data.replace([np.inf, -np.inf], np.nan, inplace=True) 624 | #test_data.fillna(0, inplace=True) 625 | 626 | X_train = copy.deepcopy(data) 627 | X_test = copy.deepcopy(test_data) 628 | 629 | # Normalization Data 630 | if normalization: 631 | if verbose > 0: 632 | print('> Normalization Features') 633 | columns_name = X_train.columns.values 634 | scaler = StandardScaler().fit(X_train) 635 | X_train = scaler.transform(X_train) 636 | X_test = scaler.transform(X_test) 637 | X_train = pd.DataFrame(X_train, columns=columns_name) 638 | X_test = pd.DataFrame(X_test, columns=columns_name) 639 | 640 | if verbose > 0: 641 | print('#'*50) 642 | print('> Final Number of Features: ', (X_train.shape[1])) 643 | print('#'*50) 644 | print('New X_train rows: %s, X_test rows: %s' %(X_train.shape[0], X_test.shape[0])) 645 | print('New X_train columns: %s, X_test columns: %s' %(X_train.shape[1], X_test.shape[1])) 646 | if len(left_subtract(X_test.columns, X_train.columns)) > 0: 647 | print("""There are more columns in test than train 648 | due to missing columns being more in test than train. Continuing...""") 649 | 650 | return X_train, X_test 651 | ################################################################################ 652 | def find_rare_class(series, verbose=0): 653 | ######### Print the % count of each class in a Target variable ##### 654 | """ 655 | Works on Multi Class too. Prints class percentages count of target variable. 656 | It returns the name of the Rare class (the one with the minimum class member count). 657 | This can also be helpful in using it as pos_label in Binary and Multi Class problems. 658 | """ 659 | return series.value_counts().index[-1] 660 | ################################################################################# 661 | def left_subtract(l1,l2): 662 | lst = [] 663 | for i in l1: 664 | if i not in l2: 665 | lst.append(i) 666 | return lst 667 | ################################################################################# 668 | def remove_duplicate_cols_in_dataset(df): 669 | df = copy.deepcopy(df) 670 | cols = df.columns.tolist() 671 | number_duplicates = df.columns.duplicated().astype(int).sum() 672 | if number_duplicates > 0: 673 | print('Detected %d duplicate columns in dataset. Removing duplicates...' %number_duplicates) 674 | df = df.loc[:,~df.columns.duplicated()] 675 | return df 676 | ########################################################################### 677 | # Removes duplicates from a list to return unique values - USED ONLYONCE 678 | def find_remove_duplicates(values): 679 | output = [] 680 | seen = set() 681 | for value in values: 682 | if value not in seen: 683 | output.append(value) 684 | seen.add(value) 685 | return output 686 | ################################################################################# 687 | -------------------------------------------------------------------------------- /featurewiz/encoders.py: -------------------------------------------------------------------------------- 1 | ############################################################################### 2 | # MIT License 3 | # 4 | # Copyright (c) 2020 Alex Lekov 5 | # 6 | # Permission is hereby granted, free of charge, to any person obtaining a copy 7 | # of this software and associated documentation files (the "Software"), to deal 8 | # in the Software without restriction, including without limitation the rights 9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | # copies of the Software, and to permit persons to whom the Software is 11 | # furnished to do so, subject to the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be included in all 14 | # copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | # SOFTWARE. 23 | ############################################################################### 24 | ##### This amazing Library was created by Alex Lekov: Many Thanks to Alex! ### 25 | ##### https://github.com/Alex-Lekov/AutoML_Alex ### 26 | ############################################################################### 27 | import pandas as pd 28 | import numpy as np 29 | 30 | ################################################################ 31 | # Simple Encoders 32 | # (do not use information about target) 33 | ################################################################ 34 | 35 | class FrequencyEncoder(): 36 | """ 37 | FrequencyEncoder 38 | Conversion of category into frequencies. 39 | Parameters 40 | ---------- 41 | cols : list of categorical features. 42 | drop_invariant : not used 43 | """ 44 | def __init__(self, cols=None, drop_invariant=None): 45 | """ 46 | Description of __init__ 47 | 48 | Args: 49 | cols=None (undefined): columns in dataset 50 | drop_invariant=None (undefined): not used 51 | 52 | """ 53 | self.cols = cols 54 | self.counts_dict = None 55 | 56 | def fit(self, X: pd.DataFrame, y=None) -> pd.DataFrame: 57 | """ 58 | Description of fit 59 | 60 | Args: 61 | X (pd.DataFrame): dataset 62 | y=None (not used): not used 63 | 64 | Returns: 65 | pd.DataFrame 66 | 67 | """ 68 | counts_dict = {} 69 | if self.cols is None: 70 | self.cols = X.columns 71 | for col in self.cols: 72 | values = X[col].value_counts(dropna=False).index 73 | n_obs = np.float(len(X)) 74 | counts = list(X[col].value_counts(dropna=False) / n_obs) 75 | counts_dict[col] = dict(zip(values, counts)) 76 | self.counts_dict = counts_dict 77 | 78 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 79 | """ 80 | Description of transform 81 | 82 | Args: 83 | X (pd.DataFrame): dataset 84 | 85 | Returns: 86 | pd.DataFrame 87 | 88 | """ 89 | counts_dict_test = {} 90 | res = [] 91 | for col in self.cols: 92 | values = X[col].value_counts(1,dropna=False).index.tolist() 93 | counts = X[col].value_counts(1,dropna=False).values.tolist() 94 | counts_dict_test[col] = dict(zip(values, counts)) 95 | 96 | # if value is in "train" keys - replace "test" counts with "train" counts 97 | for k in [ 98 | key 99 | for key in counts_dict_test[col].keys() 100 | if key in self.counts_dict[col].keys() 101 | ]: 102 | counts_dict_test[col][k] = self.counts_dict[col][k] 103 | res.append(X[col].map(counts_dict_test[col]).values.reshape(-1, 1)) 104 | try: 105 | res = np.hstack(res) 106 | except: 107 | pdb.set_trace() 108 | X[self.cols] = res 109 | return X 110 | 111 | def fit_transform(self, X: pd.DataFrame, y=None) -> pd.DataFrame: 112 | """ 113 | Description of fit_transform 114 | 115 | Args: 116 | X (pd.DataFrame): dataset 117 | y=None (undefined): not used 118 | 119 | Returns: 120 | pd.DataFrame 121 | 122 | """ 123 | self.fit(X, y) 124 | X = self.transform(X) 125 | return X 126 | -------------------------------------------------------------------------------- /featurewiz/settings.py: -------------------------------------------------------------------------------- 1 | ### this defines some of the global settings for encoder names in one place #### 2 | from category_encoders import HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder 3 | from category_encoders import OneHotEncoder, HelmertEncoder, OrdinalEncoder, CountEncoder, BaseNEncoder 4 | from category_encoders import TargetEncoder, CatBoostEncoder, WOEEncoder, JamesSteinEncoder 5 | from category_encoders.glmm import GLMMEncoder 6 | from sklearn.preprocessing import LabelEncoder 7 | from category_encoders.wrapper import PolynomialWrapper 8 | from .encoders import FrequencyEncoder 9 | ################################################################################# 10 | def init(): 11 | global cat_encoders_names 12 | cat_encoders_names = { 13 | 'HashingEncoder': [HashingEncoder,'https://contrib.scikit-learn.org/category_encoders/hashing.html'], 14 | 'SumEncoder': [SumEncoder,'https://contrib.scikit-learn.org/category_encoders/sum.html'], 15 | 'PolynomialEncoder': [PolynomialEncoder,'https://contrib.scikit-learn.org/category_encoders/polynomial.html'], 16 | 'BackwardDifferenceEncoder': [BackwardDifferenceEncoder,'https://contrib.scikit-learn.org/category_encoders/backward_difference.html'], 17 | 'OneHotEncoder': [OneHotEncoder,'https://contrib.scikit-learn.org/category_encoders/onehot.html'], 18 | 'HelmertEncoder': [HelmertEncoder,'https://contrib.scikit-learn.org/category_encoders/helmert.html'], 19 | 'OrdinalEncoder': [OrdinalEncoder,'https://contrib.scikit-learn.org/category_encoders/ordinal.html'], 20 | 'BaseNEncoder': [BaseNEncoder,'https://contrib.scikit-learn.org/category_encoders/basen.html'], 21 | 'FrequencyEncoder': [FrequencyEncoder,'https://github.com/Alex-Lekov/AutoML_Alex/blob/master/automl_alex/encoders.py'], 22 | } 23 | 24 | global target_encoders_names 25 | target_encoders_names = { 26 | 'TargetEncoder': [TargetEncoder,'https://contrib.scikit-learn.org/category_encoders/targetencoder.html'], 27 | 'CatBoostEncoder': [CatBoostEncoder,'https://contrib.scikit-learn.org/category_encoders/catboost.html'], 28 | 'WOEEncoder': [WOEEncoder,'https://contrib.scikit-learn.org/category_encoders/woe.html'], 29 | 'JamesSteinEncoder': [JamesSteinEncoder,'https://contrib.scikit-learn.org/category_encoders/jamesstein.html'], 30 | 'GLMMEncoder': [GLMMEncoder,'https://contrib.scikit-learn.org/category_encoders/glmm.html'], 31 | } 32 | 33 | global modeltpe 34 | modeltpe = '' 35 | global multi_label 36 | multi_label = False 37 | ################################################################################# 38 | -------------------------------------------------------------------------------- /featurewiz/sulov_method.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import random 4 | np.random.seed(99) 5 | random.seed(42) 6 | from . import settings 7 | settings.init() 8 | ################################################################################ 9 | #### The warnings from Sklearn are so annoying that I have to shut it off ####### 10 | import warnings 11 | warnings.filterwarnings("ignore") 12 | from sklearn.exceptions import DataConversionWarning 13 | warnings.filterwarnings(action='ignore', category=DataConversionWarning) 14 | def warn(*args, **kwargs): 15 | pass 16 | warnings.warn = warn 17 | import logging 18 | #################################################################################### 19 | import pdb 20 | import copy 21 | import time 22 | from sklearn.feature_selection import chi2, mutual_info_regression, mutual_info_classif 23 | from sklearn.feature_selection import SelectKBest 24 | from itertools import combinations 25 | import matplotlib.patches as mpatches 26 | import matplotlib.pyplot as plt 27 | ################################################################################################# 28 | from collections import defaultdict 29 | from collections import OrderedDict 30 | import time 31 | import networkx as nx # Import networkx for groupwise method 32 | ################################################################################# 33 | def left_subtract(l1,l2): 34 | lst = [] 35 | for i in l1: 36 | if i not in l2: 37 | lst.append(i) 38 | return lst 39 | ################################################################################# 40 | def return_dictionary_list(lst_of_tuples): 41 | """ Returns a dictionary of lists if you send in a list of Tuples""" 42 | orDict = defaultdict(list) 43 | # iterating over list of tuples 44 | for key, val in lst_of_tuples: 45 | orDict[key].append(val) 46 | return orDict 47 | ################################################################################ 48 | def find_remove_duplicates(list_of_values): 49 | """ 50 | # Removes duplicates from a list to return unique values - USED ONLY ONCE 51 | """ 52 | output = [] 53 | seen = set() 54 | for value in list_of_values: 55 | if value not in seen: 56 | output.append(value) 57 | seen.add(value) 58 | return output 59 | ################################################################################## 60 | def remove_highly_correlated_vars_fast(df, corr_limit): # Keeping original function for fallback 61 | """Fast method to remove highly correlated vars using just linear correlation.""" 62 | corr_matrix = df.corr().abs() 63 | upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)) 64 | to_drop = [column for column in upper.columns if any(upper[column] > corr_limit)] 65 | return to_drop 66 | 67 | def FE_remove_variables_using_SULOV_method(df, numvars, modeltype, target, 68 | corr_limit = 0.70, verbose=0, dask_xgboost_flag=False, 69 | correlation_types = ['pearson'], # New parameter for correlation types 70 | adaptive_threshold = False, # New parameter for adaptive threshold 71 | sulov_mode = 'pairwise'): # New parameter for SULOV mode (pairwise/groupwise) 72 | """ 73 | FE stands for Feature Engineering - it means this function performs feature engineering 74 | ########################################################################################### 75 | ##### SULOV stands for Searching Uncorrelated List Of Variables ############# 76 | ########################################################################################### 77 | SULOV method was created by Ram Seshadri in 2018. This highly efficient method removes 78 | variables that are highly correlated using a series of pair-wise correlation knockout 79 | rounds. It is extremely fast and hence can work on thousands of variables in less than 80 | a minute, even on a laptop. You need to send in a list of numeric variables and that's 81 | all! The method defines high Correlation as anything over 0.70 (absolute) but this can 82 | be changed. If two variables have absolute correlation higher than this, they will be 83 | marked, and using a process of elimination, one of them will get knocked out: 84 | To decide order of variables to keep, we use mutuail information score to select. MIS returns 85 | a ranked list of these correlated variables: when we select one, we knock out others that 86 | are highly correlated to it. Then we select next variable to inspect. This continues until 87 | we knock out all highly correlated variables in each set of variables. Finally we are 88 | left with only uncorrelated variables that are also highly important in mutual score. 89 | ########################################################################################### 90 | ######## YOU MUST INCLUDE THE ABOVE MESSAGE IF YOU COPY SULOV method IN YOUR LIBRARY ########## 91 | ########################################################################################### 92 | """ 93 | df = copy.deepcopy(df) 94 | df_target = df[target] 95 | df = df[numvars] 96 | ### for some reason, doing a mass fillna of vars doesn't work! Hence doing it individually! 97 | null_vars = np.array(numvars)[df.isnull().sum()>0] 98 | for each_num in null_vars: 99 | df[each_num] = df[each_num].fillna(0) 100 | target = copy.deepcopy(target) 101 | if verbose: 102 | print('#######################################################################################') 103 | print('##### Searching for Uncorrelated List Of Variables (SULOV) in %s features ############' %len(numvars)) 104 | print('#######################################################################################') 105 | print('Starting SULOV with %d features...' %len(numvars)) 106 | 107 | # 1. Calculate Correlation Matrices based on correlation_types parameter 108 | correlation_matrices = {} 109 | for corr_type in correlation_types: 110 | correlation_matrices[corr_type] = df.corr(method=corr_type).abs() 111 | 112 | # 2. Adaptive Threshold (if enabled) 113 | current_corr_threshold = corr_limit 114 | if adaptive_threshold: 115 | combined_corr_matrix = pd.concat(correlation_matrices.values()).max(level=0) # Max across all corr types 116 | upper_triangle_corrs = combined_corr_matrix.where(np.triu(np.ones(combined_corr_matrix.shape),k=1).astype(bool)).stack().sort_values(ascending=False) 117 | correlation_values = upper_triangle_corrs.values 118 | current_corr_threshold = np.percentile(correlation_values, 75) # Example: 75th percentile 119 | print(f"Adaptive Correlation Threshold: {current_corr_threshold:.3f}") 120 | 121 | # 3. Find Correlated Pairs based on all selected correlation types 122 | correlated_pairs = [] 123 | for i in range(len(df.columns)): 124 | for j in range(i + 1, len(df.columns)): 125 | col1 = df.columns[i] 126 | col2 = df.columns[j] 127 | is_correlated = False 128 | for corr_type, corr_matrix in correlation_matrices.items(): 129 | if corr_matrix.loc[col1, col2] >= current_corr_threshold: 130 | is_correlated = True 131 | break # If correlated by any type, consider them correlated 132 | if is_correlated: 133 | correlated_pairs.append((col1, col2)) 134 | 135 | # Deterministic sorting of correlated pairs (always applied) 136 | correlated_pairs.sort() 137 | 138 | if modeltype == 'Regression': 139 | sel_function = mutual_info_regression 140 | else: 141 | sel_function = mutual_info_classif 142 | 143 | if correlated_pairs: # Proceed only if correlated pairs are found 144 | if isinstance(target, list): 145 | target = target[0] 146 | max_feats = len(numvars) # Changed from len(corr_list) to numvars to be more robust 147 | 148 | ##### you must ensure there are no infinite nor null values in corr_list df ## 149 | df_fit_cols = find_remove_duplicates(sum(correlated_pairs,())) # Unique cols from correlated pairs 150 | df_fit = df[df_fit_cols] 151 | 152 | ### Now check if there are any NaN values in the dataset ##### 153 | if df_fit.isnull().sum().sum() > 0: 154 | df_fit = df_fit.dropna() 155 | else: 156 | print(' there are no null values in dataset...') 157 | 158 | if df_target.isnull().sum().sum() > 0: 159 | print(' there are null values in target. Returning with all vars...') 160 | return numvars 161 | else: 162 | print(' there are no null values in target column...') 163 | 164 | ##### Ready to perform fit and find mutual information score #### 165 | 166 | try: 167 | if modeltype == 'Regression': 168 | fs = mutual_info_regression(df_fit, df_target, n_neighbors=5, discrete_features=False, random_state=42) 169 | else: 170 | fs = mutual_info_classif(df_fit, df_target, n_neighbors=5, discrete_features=False, random_state=42) 171 | except Exception as e: 172 | print(f' SelectKBest() function is erroring with: {e}. Returning with all {len(numvars)} variables...') 173 | return numvars 174 | 175 | try: 176 | ################################################################################# 177 | ####### This is the main section where we use mutual info score to select vars 178 | ################################################################################# 179 | mutual_info = dict(zip(df_fit_cols,fs)) # Use df_fit_cols as keys 180 | #### The first variable in list has the highest correlation to the target variable ### 181 | sorted_by_mutual_info =[key for (key,val) in sorted(mutual_info.items(), key=lambda kv: kv[1],reverse=True)] 182 | 183 | if sulov_mode == 'pairwise': 184 | ##### Now we select the final list of correlated variables (Pairwise SULOV) ########### 185 | selected_corr_list = [] 186 | copy_sorted = copy.deepcopy(sorted_by_mutual_info) 187 | analyzed_pairs = set() # Track analyzed pairs to avoid redundancy 188 | 189 | for col1_sorted in copy_sorted: 190 | selected_corr_list.append(col1_sorted) 191 | for col2_tup in correlated_pairs: 192 | col1_corr, col2_corr = col2_tup 193 | pair = tuple(sorted(col2_tup)) # Ensure consistent pair order 194 | if col1_sorted == col1_corr and pair not in analyzed_pairs: # Check if current sorted col is part of a correlated pair 195 | analyzed_pairs.add(pair) 196 | if col2_corr in copy_sorted: 197 | copy_sorted.remove(col2_corr) 198 | elif col1_sorted == col2_corr and pair not in analyzed_pairs: # Check if current sorted col is part of a correlated pair 199 | analyzed_pairs.add(pair) 200 | if col1_corr in copy_sorted: 201 | copy_sorted.remove(col1_corr) 202 | 203 | elif sulov_mode == 'groupwise': 204 | ##### Groupwise SULOV ########### 205 | G = nx.Graph() 206 | for col in df_fit_cols: # Use df_fit_cols for graph nodes 207 | G.add_node(col) 208 | for col1_g, col2_g in correlated_pairs: 209 | G.add_edge(col1_g, col2_g) 210 | correlated_feature_groups = list(nx.connected_components(G)) 211 | 212 | selected_corr_list = [] 213 | features_to_drop_in_group = set() 214 | for group in correlated_feature_groups: 215 | if len(group) > 1: 216 | group_mis_scores = {feature: mutual_info.get(feature, 0) for feature in group} # Get MIS for group, default 0 if not found 217 | best_feature = max(group_mis_scores, key=group_mis_scores.get) # Feature with max MIS 218 | selected_corr_list.append(best_feature) # Keep best feature 219 | for feature in group: 220 | if feature != best_feature: 221 | features_to_drop_in_group.add(feature) 222 | selected_corr_list = list(set(selected_corr_list)) # Ensure unique selected features 223 | removed_cols_sulov = list(features_to_drop_in_group) # Features removed in groupwise mode 224 | final_list_corr_part = selected_corr_list # Renamed for clarity 225 | 226 | else: # Default to original pairwise logic if mode is not recognized 227 | print(f"Warning: Unknown SULOV mode '{sulov_mode}'. Defaulting to pairwise mode.") 228 | ##### Original Pairwise SULOV logic (as fallback) ###### 229 | selected_corr_list = [] 230 | copy_sorted = copy.deepcopy(sorted_by_mutual_info) 231 | copy_pair = dict(return_dictionary_list(correlated_pairs)) # Recreate pair dict if needed 232 | for each_corr_name in copy_sorted: 233 | selected_corr_list.append(each_corr_name) 234 | if each_corr_name in copy_pair: # Check if key exists before accessing 235 | for each_remove in copy_pair[each_corr_name]: 236 | if each_remove in copy_sorted: 237 | copy_sorted.remove(each_remove) 238 | final_list_corr_part = selected_corr_list # Renamed for clarity 239 | 240 | 241 | if sulov_mode != 'groupwise': # For pairwise and default modes 242 | final_list_corr_part = selected_corr_list # Renamed for consistency 243 | removed_cols_sulov = left_subtract(df_fit_cols, final_list_corr_part) # Calculate removed cols 244 | 245 | ##### Now we combine the uncorrelated list to the selected correlated list above 246 | rem_col_list = left_subtract(numvars, df_fit_cols) # Uncorrelated columns are those not in df_fit_cols 247 | final_list = rem_col_list + final_list_corr_part 248 | removed_cols = left_subtract(numvars, final_list) + removed_cols_sulov # Combine all removed cols 249 | 250 | except Exception as e: 251 | print(f' SULOV Method crashing due to: {e}') 252 | #### Dropping highly correlated Features fast using simple linear correlation ### 253 | removed_cols = remove_highly_correlated_vars_fast(df,corr_limit) 254 | final_list = left_subtract(numvars, removed_cols) 255 | 256 | if len(removed_cols) > 0: 257 | if verbose: 258 | print(f' Removing ({len(removed_cols)}) highly correlated variables:') 259 | if len(removed_cols) <= 30: 260 | print(f' {removed_cols}') 261 | if len(final_list) <= 30: 262 | print(f' Following ({len(final_list)}) vars selected: {final_list}') 263 | 264 | ############## D R A W C O R R E L A T I O N N E T W O R K ################## 265 | selected = copy.deepcopy(final_list) 266 | if verbose and len(selected) <= 1000 and correlated_pairs: # Draw only if correlated pairs exist 267 | try: 268 | #### Now start building the graph ################### 269 | gf = nx.Graph() 270 | ### the mutual info score gives the size of the bubble ### 271 | multiplier = 2100 272 | for each in sorted_by_mutual_info: # Use sorted_by_mutual_info for node order 273 | if each in mutual_info: # Check if mutual_info exists for the node 274 | gf.add_node(each, size=int(max(1,mutual_info[each]*multiplier))) 275 | 276 | ######### This is where you calculate the size of each node to draw 277 | sizes = [mutual_info.get(x,0)*multiplier for x in list(gf.nodes())] # Use .get with default 0 for robustness 278 | 279 | corr = df[df_fit_cols].corr() # Use df_fit_cols for correlation calculation 280 | high_corr = corr[abs(corr)>current_corr_threshold] # Use adaptive/original threshold 281 | combos = combinations(df_fit_cols,2) # Use df_fit_cols for combinations 282 | 283 | ### this gives the strength of correlation between 2 nodes ## 284 | multiplier_edge = 20 # Renamed to avoid confusion 285 | for (var1, var2) in combos: 286 | if np.isnan(high_corr.loc[var1,var2]): 287 | pass 288 | else: 289 | gf.add_edge(var1, var2,weight=multiplier_edge*high_corr.loc[var1,var2]) 290 | 291 | ######## Now start building the networkx graph ########################## 292 | widths = nx.get_edge_attributes(gf, 'weight') 293 | nodelist = gf.nodes() 294 | cols = 5 295 | height_size = 5 296 | width_size = 15 297 | rows = int(len(df_fit_cols)/cols) # Use df_fit_cols length 298 | if rows < 1: 299 | rows = 1 300 | plt.figure(figsize=(width_size,min(20,height_size*rows))) 301 | pos = nx.shell_layout(gf) 302 | nx.draw_networkx_nodes(gf,pos, 303 | nodelist=nodelist, 304 | node_size=sizes, 305 | node_color='blue', 306 | alpha=0.5) 307 | nx.draw_networkx_edges(gf,pos, 308 | edgelist = widths.keys(), 309 | width=list(widths.values()), 310 | edge_color='lightblue', 311 | alpha=0.6) 312 | pos_higher = {} 313 | x_off = 0.04 # offset on the x axis 314 | y_off = 0.04 # offset on the y axis 315 | for k, v in pos.items(): 316 | pos_higher[k] = (v[0]+x_off, v[1]+y_off) 317 | 318 | labels_dict = {} # Create labels dictionary 319 | for x in nodelist: 320 | if x in selected: 321 | labels_dict[x] = x+' (selected)' 322 | else: 323 | labels_dict[x] = x+' (removed)' 324 | 325 | nx.draw_networkx_labels(gf, pos=pos_higher, 326 | labels = labels_dict, 327 | font_color='black') 328 | 329 | plt.box(True) 330 | plt.title("""In SULOV, we repeatedly remove features with lower mutual info scores among highly correlated pairs (see figure), 331 | SULOV selects the feature with higher mutual info score related to target when choosing between a pair. """, fontsize=10) 332 | plt.suptitle('How SULOV Method Works by Removing Highly Correlated Features', fontsize=20,y=1.03) 333 | red_patch = mpatches.Patch(color='blue', label='Bigger circle denotes higher mutual info score with target') 334 | blue_patch = mpatches.Patch(color='lightblue', label='Thicker line denotes higher correlation between two variables') 335 | plt.legend(handles=[red_patch, blue_patch],loc='best') 336 | plt.show(); 337 | ##### N E T W O R K D I A G R A M C O M P L E T E ################# 338 | return final_list 339 | except Exception as e: 340 | print(f' Networkx library visualization crashing due to {e}') 341 | print(f'Completed SULOV. {len(final_list)} features selected') 342 | return final_list 343 | else: 344 | print(f'Completed SULOV. {len(final_list)} features selected') 345 | return final_list 346 | print(f'Completed SULOV. All {len(numvars)} features selected') 347 | return numvars 348 | ################################################################################### -------------------------------------------------------------------------------- /images/MRMR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/MRMR.png -------------------------------------------------------------------------------- /images/SULOV.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/SULOV.jpg -------------------------------------------------------------------------------- /images/feather_example.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feather_example.jpg -------------------------------------------------------------------------------- /images/feature_engg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feature_engg.png -------------------------------------------------------------------------------- /images/feature_engg_old.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feature_engg_old.jpg -------------------------------------------------------------------------------- /images/featurewiz_background.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_background.jpg -------------------------------------------------------------------------------- /images/featurewiz_logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logo.jpg -------------------------------------------------------------------------------- /images/featurewiz_logos.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logos.png -------------------------------------------------------------------------------- /images/featurewiz_logos_old.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logos_old.png -------------------------------------------------------------------------------- /images/featurewiz_mrmr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_mrmr.png -------------------------------------------------------------------------------- /images/xgboost.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/xgboost.jpg -------------------------------------------------------------------------------- /old_README.md: -------------------------------------------------------------------------------- 1 | # featurewiz 2 | `featurewiz` is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm. 3 | 4 | ![banner](images/featurewiz_logos.png) 5 | 6 | # Table of Contents 7 | 23 | 24 | ## Latest 25 | `featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. If you are looking for the latest and greatest updates about our library, check out our updates page. 26 |
      27 | 28 | ## Citation 29 | If you use featurewiz in your research project or paper, please use the following format for citations: 30 |

      31 | "Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz " 32 |

      33 | Current citations for featurewiz 34 | 35 | [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=) 36 | 37 | ## Highlights 38 | `featurewiz` stands out as a versatile and powerful tool for feature selection and engineering, capable of significantly enhancing model performance through intelligent feature transformation and selection techniques. Its unique methods like SULOV and recursive XGBoost, combined with advanced feature engineering options, make it a valuable addition to any data scientist's toolkit: 39 | ### Best Feature Selection Algorithm 40 |

    5. It provides one of the best automatic feature selection algorithms (Minimum Redundancy Maximum Relevance (MRMR) algorithm) as described by wikipedia in this page: "The MRMR selection has been found to be more powerful than the maximum relevance feature selection" such as Boruta.
    6. 41 | 42 | ### Advanced Feature Engineering Options 43 | featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as: 44 |
    7. Auto Encoders, including Denoising Auto Encoders (DAEs) and Variational Auto Encoders (VAEs), for improved model performance, especially on imbalanced datasets.
    8. 45 |
    9. A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.
    10. 46 |
    11. The ability to add interaction features (e.g., x1x2, x2x3, x1^2), group by features, and target encoding.
    12. 47 | 48 | ### SULOV Method for Feature Selection 49 |
    13. SULOV stands for "Searching for Uncorrelated List Of Variables". It selects features that are uncorrelated with each other but have high correlation with the target variable, based on the Minimum Redundancy Maximum Relevance (mRMR) principle. This method effectively reduces redundancy in features while retaining those with high relevance to the target.
    14. 50 | 51 | ### Recursive XGBoost Method 52 |
    15. After applying the SULOV method, featurewiz employs a recursive approach using XGBoost's feature importance. This process is repeated multiple times on subsets of data, combining and deduplicating selected features to identify the most impactful ones.
    16. 53 | 54 | ### Comprehensive Encoding and Transformation 55 |
    17. featurewiz allows for extensive customization in how features are encoded and transformed, making it highly adaptable to various types of data.
    18. 56 |
    19. The ability to combine multiple encoding and transformation methods enhances its flexibility and effectiveness in feature engineering.
    20. 57 | 58 | ### Used by PhD's and Researchers and actively maintained 59 |
    21. featurewiz is used by researchers and PhD data scientists around the world: there are 64 citations for featurewiz since its release: 60 | 61 | [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=)
    22. 62 |
    23. It's efficient in handling large datasets, making it suitable for a wide range of applications from small to big data scenarios.
    24. 63 |
    25. It is well-documented, and it comes with a number of examples.
    26. 64 |
    27. It is actively maintained, and it is regularly updated with new features and bug fixes.
    28. 65 | 66 | ## Internals 67 | `featurewiz` has two major internal modules. They are explained below. 68 | ### 1. Feature Engineering module 69 |

      The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).

      70 | One of the gaps in open-source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high-powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find the best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.
      71 |

      featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.
      72 | 73 | ![feature_engg](images/feature_engg.jpg) 74 | 75 | ### 2. Feature Selection module 76 |

      The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection.
      77 | Why perform Feature Selection? Once you have created 100's of new features, you still have three questions left to answer: 78 | 1. How do we interpret those newly created features? 79 | 2. Which of these features is important and which is useless? How many of them are highly correlated to each other causing redundancy? 80 | 3. Does the model overfit now on these new features and perform better or worse than before? 81 |
      82 | All are very important questions and featurewiz answers them by using the SULOV method and Recursive XGBoost to reduce features in your dataset to the best "minimum optimal" features for the model.
      83 |

      SULOV: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) algorithm explained in this article as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to MRMR (featurewiz) while "all-relevant" refers to Boruta.
      84 | 85 | ![MRMR_chart](images/MRMR.png) 86 | 87 | ## Working 88 | `featurewiz` performs feature selection in 2 steps. Each step is explained below. 89 | The working of the `SULOV` algorithm is as follows: 90 |

        91 |
      1. Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
      2. 92 |
      3. Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
      4. 93 |
      5. Now take each pair of correlated variables, then knock off the one with the lower MIS score.
      6. 94 |
      7. What’s left is the ones with the highest Information scores and least correlation with each other.
      8. 95 |
      96 | 97 | ![sulov](images/SULOV.jpg) 98 | 99 | The working of the Recursive XGBoost is as follows: 100 | Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV. 101 |
        102 |
      1. Select all variables in the data set and the full data split into train and valid sets.
      2. 103 |
      3. Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
      4. 104 |
      5. Then take the next set of vars and find top X
      6. 105 |
      7. Do this 5 times. Combine all selected features and de-duplicate them.
      8. 106 |
      107 | 108 | ![xgboost](images/xgboost.jpg) 109 | 110 | ## Tips 111 | Here are some additional tips for ML engineers and data scientists when using featurewiz: 112 |
        113 |
      1. How to cross-validate your results: When you use featurewiz, we automatically perform multiple rounds of feature selection using permutations on the number of columns. However, you can perform feature selection using permutations of rows as follows in cross_validate using featurewiz. 114 |
      2. Use multiple feature selection tools: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.
      3. 115 |
      4. Don't forget to engineer new features: Feature selection is only one part of the process of building a good machine learning model. You should also spend time engineering your features to make them as informative as possible. This can involve things like creating new features, transforming existing features, and removing irrelevant features.
      5. 116 |
      6. Don't overfit your model: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.
      7. 117 |
      8. Start with a small number of features: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.
      9. 118 |
      119 | 120 | ## Install 121 | 122 | **Prerequisites:** 123 |
        124 |
      1. featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
      2. 125 |
      3. We use "networkx" library for charts and interpretability.
        But if you don't have these libraries, featurewiz will install those for you automatically.
      4. 126 |
      127 | To install from source: 128 | 129 | ``` 130 | cd 131 | git clone git@github.com:AutoViML/featurewiz.git 132 | # or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip 133 | conda create -n python=3.7 anaconda 134 | conda activate # ON WINDOWS: `source activate ` 135 | cd featurewiz 136 | pip install -r requirements.txt 137 | ``` 138 | 139 | ## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps! 140 | As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:
      141 | 142 | ``` 143 | conda install -c conda-forge featurewiz 144 | ``` 145 | 146 | ### If the above conda install fails, you can try installing featurewiz this way: 147 | #### Install featurewiz using git+
      148 | 149 | ``` 150 | !pip install git+https://github.com/AutoViML/featurewiz.git 151 | ``` 152 | 153 | ## Usage 154 | 155 | There are two ways to use featurewiz. 156 |
        157 |
      1. The first way is the new way where you use scikit-learn's `fit and predict` syntax. It also includes the `lazytransformer` library that I created to transform datetime, NLP and categorical variables into numeric variables automatically. We recommend that you use it as the main syntax for all your future needs.
      2. 158 | 159 | ``` 160 | from featurewiz import FeatureWiz 161 | fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std", 162 | category_encoders="auto", add_missing=False, verbose=0) 163 | X_train_selected, y_train = fwiz.fit_transform(X_train, y_train) 164 | X_test_selected = fwiz.transform(X_test) 165 | ### get list of selected features ### 166 | fwiz.features 167 | ``` 168 | 169 |
      3. The second way is the old way and this was the original syntax of featurewiz. It is still being used by thousands of researchers in the field. Hence it will continue to be maintained. However, it can be discontinued any time without notice. You can use it if you like it.
      4. 170 | 171 | ``` 172 | import featurewiz as fwiz 173 | outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',', 174 | header=0, test_data='',feature_engg='', category_encoders='', 175 | dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False) 176 | ``` 177 | 178 | `outputs` is a tuple: There will always be two objects in output. It can vary: 179 | - In the first case, it can be `features` and `trainm`: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only) 180 | - In the second case, it can be `trainm` and `testm`: It can be two transformed dataframes when you send in both test and train but with selected features. 181 | 182 | In both cases, the features and dataframes are ready for you to do further modeling. 183 | 184 | Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want. 185 | You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically. 186 | 187 | ## API 188 | 189 | **Input Arguments for NEW syntax** 190 | 191 | Parameters 192 | ---------- 193 | corr_limit : float, default=0.90 194 | The correlation limit to consider for feature selection. Features with correlations 195 | above this limit may be excluded. 196 | 197 | verbose : int, default=0 198 | Level of verbosity in output messages. 199 | 200 | feature_engg : str or list, default='' 201 | Specifies the feature engineering methods to apply, such as 'interactions', 'groupby', 202 | and 'target'. 203 | 204 | category_encoders : str or list, default='' 205 | Encoders for handling categorical variables. Supported encoders include 'onehot', 206 | 'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc', 207 | 'loo', 'base', 'james', 'helmert', 'label', 'auto', etc. 208 | 209 | add_missing : bool, default=False 210 | If True, adds indicators for missing values in the dataset. 211 | 212 | dask_xgboost_flag : bool, default=False 213 | If set to True, enables the use of Dask for parallel computing with XGBoost. 214 | 215 | nrows : int or None, default=None 216 | Limits the number of rows to process. 217 | 218 | skip_sulov : bool, default=False 219 | If True, skips the application of the Super Learning Optimized (SULO) method in 220 | feature selection. 221 | 222 | skip_xgboost : bool, default=False 223 | If True, bypasses the recursive XGBoost feature selection. 224 | 225 | transform_target : bool, default=False 226 | When True, transforms the target variable(s) into numeric format if they are not 227 | already. 228 | 229 | scalers : str or None, default=None 230 | Specifies the scaler to use for feature scaling. Available options include 231 | 'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'. 232 | 233 | **Input Arguments for old syntax** 234 | 235 | - `dataname`: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically. 236 | - `target`: name of the target variable in the data set. 237 | - `corr_limit`: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal. 238 | - `verbose`: This has 3 possible states: 239 | - `0` - limited output. Great for running this silently and getting fast results. 240 | - `1` - verbose. Great for knowing how results were and making changes to flags in input. 241 | - `2` - more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method. 242 | - `test_data`: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. `test_data` could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string. 243 | - `dask_xgboost_flag`: default False. If you want to use dask with your data, then set this to True. 244 | - `feature_engg`: You can let featurewiz select its best encoders for your data set by setting this flag 245 | for adding feature engineering. There are three choices. You can choose one, two, or all three. 246 | - `interactions`: This will add interaction features to your data such as x1*x2, x2*x3, x1**2, x2**2, etc. 247 | - `groupby`: This will generate Group By features to your numeric vars by grouping all categorical vars. 248 | - `target`: This will encode and transform all your categorical features using certain target encoders.
        249 | Default is empty string (which means no additional features) 250 | - `add_missing`: default is False. This is a new flag: the `add_missing` flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal. 251 | - `category_encoders`: default is "auto". Instead, you can choose your own category encoders from the list below. 252 | We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)
        These descriptions are derived from the excellent category_encoders python library. Please check it out! 253 | - `HashingEncoder`: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design. 254 | - `SumEncoder`: SumEncoder is a Sum contrast coding for the encoding of categorical features. 255 | - `PolynomialEncoder`: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features. 256 | - `BackwardDifferenceEncoder`: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables. 257 | - `OneHotEncoder`: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary. 258 | - `HelmertEncoder`: HelmertEncoder uses the Helmert contrast coding for encoding categorical features. 259 | - `OrdinalEncoder`: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding. 260 | - `FrequencyEncoder`: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category. 261 | - `BaseNEncoder`: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding. 262 | - `TargetEncoder`: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper. 263 | - `CatBoostEncoder`: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise. 264 | - `WOEEncoder`: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression. 265 | - `JamesSteinEncoder`: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper. 266 | For feature value i, James-Stein estimator returns a weighted average of: 267 | The mean target value for the observed feature value i. 268 | The mean target value (regardless of the feature value). 269 | - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask. 270 | - `skip_sulov`: default `False`. You can set the flag to skip the SULOV method if you want. 271 | - `skip_xgboost`: default `False`. You can set the flag to skip the Recursive XGBoost method if you want. 272 | 273 | **Output values for old syntax** This applies only to the old syntax. 274 | - `outputs`: Output is always a tuple. We can call our outputs in that tuple as `out1` and `out2` below. 275 | - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get: 276 | - 1. `features`: It will be a list (of selected features) and 277 | - 2. `trainm`: It will be a dataframe (if you sent in a file or dataname as input) 278 | - `out1` and `out2`: If you sent in two files or dataframes (train and test), you will get: 279 | - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and 280 | - 2. `testm`: a modified test dataframe with engineered and selected features from test_data. 281 | 282 | ## Additional 283 | To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0) 284 | 285 | ![background](images/featurewiz_background.jpg) 286 | 287 | featurewiz was designed for selecting High Performance variables with the fewest steps. 288 | In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).
        289 |

        290 | featurewiz is every Data Scientist's feature wizard that will:

          291 |
        1. Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
          292 |
        2. Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option. 293 |
        3. Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
          294 |
        4. Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
          295 |
        5. Build a fast XGBoost or LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.
          296 |
        297 | 298 | *** Special thanks to fellow open source Contributors ***:
        299 |
          300 |
        1. Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
        2. 301 |
        3. Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html
        4. 302 |
        303 | 304 | ## Maintainers 305 | 306 | * [@AutoViML](https://github.com/AutoViML) 307 | 308 | ## Contributing 309 | 310 | See [the contributing file](CONTRIBUTING.md)! 311 | 312 | PRs accepted. 313 | 314 | ## License 315 | 316 | Apache License 2.0 © 2020 Ram Seshadri 317 | 318 | ## DISCLAIMER 319 | This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose. 320 | 321 | 322 | [page]: examples/cross_validate.py 323 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy<2.0 2 | ipython 3 | jupyter 4 | xgboost>=1.6.2,<=1.7.6 5 | pandas>=2.0 6 | matplotlib 7 | seaborn 8 | scipy 9 | scikit-learn>=1.2.2,<=1.5.2 10 | networkx 11 | category_encoders==2.6.3 12 | xlrd>=2.0.0 13 | dask>=2021.11.0 14 | lightgbm>=3.2.1 15 | distributed>=2021.11.0 16 | feather-format>=0.4.1 17 | pyarrow>=7.0.0 18 | fsspec>=0.3.3 19 | Pillow>=9.0.0 20 | tqdm>=4.61.1 21 | numexpr>=2.7.3 22 | tensorflow>=2.5.2 23 | lazytransform>=1.17 24 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import setuptools 4 | 5 | with open("README.md", "r", encoding="utf-8") as fh: 6 | long_description = fh.read() 7 | 8 | setuptools.setup( 9 | name="featurewiz", 10 | version="0.6.1", 11 | author="Ram Seshadri", 12 | author_email="rsesha2001@yahoo.com", 13 | description="Select Best Features from your data set - any size - now with XGBoost!", 14 | long_description=long_description, 15 | long_description_content_type="text/markdown", 16 | license='Apache License 2.0', 17 | url="https://github.com/AutoViML/featurewiz", 18 | packages=setuptools.find_packages(exclude=("tests",)), 19 | install_requires=[ 20 | "numpy<2.0", 21 | "ipython", 22 | "jupyter", 23 | "xgboost>=1.6.2,<=1.7.6", 24 | "pandas>=2.0", 25 | "matplotlib", 26 | "seaborn", 27 | "scipy", 28 | "scikit-learn>=1.2.2,<=1.5.2", 29 | "networkx", 30 | "category_encoders==2.6.3", 31 | "xlrd>=2.0.0", 32 | "dask>=2021.11.0", 33 | "lightgbm>=3.2.1", 34 | "distributed>=2021.11.0", 35 | "feather-format>=0.4.1", 36 | "pyarrow>=7.0.0", 37 | "fsspec>=0.3.3", 38 | "Pillow>=9.0.0", 39 | "tqdm>=4.61.1", 40 | "numexpr>=2.7.3", 41 | "tensorflow>=2.5.2", 42 | "lazytransform>=1.17", 43 | ], 44 | classifiers=[ 45 | "Programming Language :: Python :: 3", 46 | "Operating System :: OS Independent", 47 | ], 48 | ) 49 | -------------------------------------------------------------------------------- /updates.md: -------------------------------------------------------------------------------- 1 | # featurewiz latest updates page 2 | This is the main page where we will post the latest updates to the featurewiz library. Make sure you bookmark this page and upgrade your featurewiz library before you run it. There are new updates almost every week to featurewiz! 3 | 4 | ### Update (Jan 2024): Introducing BlaggingClassifier, built to handle imbalanced distributions! 5 | Featurewiz now introduces The BlaggingClassifier one of the best classifiers ever, built by # Original Author: Gilles Louppe and Licensed in BSD 3 clause with Adaptations by Tom E Fawcett. This is an amazing classifier that everyone must try for their best imbalanced and multi-class problems. Don't just take our word for it, try it out yourself and see! 6 | 7 | ## Jan 2024 update 8 | `featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the `IterativeDoubleClassifier` and the `BlaggingClassifier`. If you are looking for the latest and greatest updates about our library, check out our updates page. 9 |
        10 | 11 | ![IterativeBest](https://i.ibb.co/R2w7WR6/Iterative-Best-Design.png) 12 | 13 | #### Update (December 2023): FeatureWiz 0.5 is here! Includes powerful deep learning autoencoders! 14 | The FeatureWiz transformer now includes powerful deep learning auto encoders in the new `auto_encoders` argument. They will transform your features into a lower dimension space but capturing important patterns inherent in your dataset. This is known as "feature extraction" and is a powerful tool for tackling very difficult problems in classification. You can set the `auto_encoders` option to `VAE` for Variational Auto Encoder, `DAE` for Denoising Auto Encoder, `CNN` for CNN's and `GAN` for GAN data augmentation for generating synthetic data. These options will completely replace your existing features. Suppose you want to add them as additional features? You can do that by setting the `auto_encoders` option to `VAE_ADD`, `DAE_ADD`, and `CNN_ADD` and featurewiz will automatically add these features to your existing dataset. In addition, it will do feature selection among your old and new features. Isn't that awesome? I have uploaded a sample notebook for you to test and improve your Classifier performance using these options for Imbalanced datasets. Please send me comments via email which is displayed on my Github main page. 15 | 16 | ![VAE](https://i.ibb.co/sJsKphR/VAE-model-flowchart.png) 17 | 18 | #### Update (November 2023): The FeatureWiz transformer (version 0.4.3 on) includes an "add_missing" flag 19 | The FeatureWiz transformer now includes an `add_missing` flag which will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal when you use FeatureWiz library. Try it out and let us know in your comments via email. 20 | 21 | 22 | #### Update (October 2023): FeatureWiz transformer (version 0.4.0 on) now has lazytransformer library 23 | The new FeatureWiz transformer includes a categorical encoder + date-time + NLP transfomer that transforms all your string, text, date-time columns into numeric variables in one step. You will see a fully transformed `(all-numeric)` dataset when you use FeatureWiz transformer. Try it out and let us know in your comments via email. 24 | 25 | #### Update (June 2023): featurewiz now has skip_sulov and skip_xgboost flags 26 | 27 | There are two flags that are available to skip the recursive xgboost and/or SULOV methods. They are the `skip_xgboost` and `skip_sulov` flags. They are by default set to `False`. But you can change them to `True` if you want to skip them. 28 | 29 | #### Update (May 2023): featurewiz 3.0 is here with better accuracy and speed 30 | 31 | The latest version of featurewiz is here! The new 3.0 version of featurewiz provides slightly better performance by about 1-2% in diverse datasets (your experience may vary). Install it and check it out! 32 | 33 | 34 | #### Update (March 2023): XGBoost 1.7 and higher versions have issues with featurewiz 35 | 36 | The latest version of XGBoost 1.7+ does not work with featurewiz. They have made massive changes to their API. So please switch to xgboost 1.5 if you want to run featurewiz. 37 | 38 | #### Update (October 2022): FeatureWiz 2.0 is here. 39 | featurewiz 2.0 is here. You have two small performance improvements: 40 | 1. SULOV method now has a higher correlation limit of 0.90 as default. This means fewer variables are removed and hence more vars are selected. You can always set it back to the old limit by setting `corr_limit`=0.70 if you want. 41 |
        42 | 2. Recursive XGBoost algorithm is tighter in that it selects fewer features in each iteration. To see how many it selects, set `verbose` flag to 1.
        43 | The net effect is that the same number of features are selected but they are better at producing more accurate models. Try it out and let us know. 44 | 45 | #### Update (September 2022): You can now skip SULOV method using skip_sulov flag 46 | featurewiz now has a new input: `skip_sulov` flag is here. You can set it to `True` to skip the SULOV method if needed. 47 | 48 | #### Update (August 2022): Silent mode with verbose=0 49 | featurewiz now has a "silent" mode which you can set using the "verbose=0" option. It will run silently with no charts or graphs and very minimal verbose output. Hope this helps!
        50 | 51 | #### Update (May 2022): New high performance modules based on XGBoost and LightGBM 52 | featurewiz as of version 0.1.50 or higher has multiple high performance models that you can use to build highly performant models once you have completed feature selection. These models are based on LightGBM and XGBoost and have even Stacking and Blending ensembles. You can find them as functions starting with "simple_" and "complex_" under featurewiz. All the best!
        53 | 54 | #### Update (March 2022): Ability to read feather format files 55 | featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds. See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!
        56 | 57 | 58 | ![feather_example](./images/feather_example.jpg) 59 | 60 | featurewiz now runs at blazing speeds thanks to using GPU's by default. So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!
        61 | 62 | ### Update (Jan 2022): FeatureWiz is now a sklearn-compatible transformer that you can use in data pipelines 63 | FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer. You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer. 64 | 65 | ``` 66 | from featurewiz import FeatureWiz 67 | features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', 68 | dask_xgboost_flag=False, nrows=None, verbose=2) 69 | X_train_selected, y_train = features.fit_transform(X_train, y_train) 70 | X_test_selected = features.transform(X_test) 71 | features.features ### provides the list of selected features ### 72 | ``` 73 | 74 | ### Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost. 75 | featurewiz now runs with a default setting of `nrows=None`. This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run. 76 | 77 | ### Featurewiz has lots of new fast model training functions that you can use to train highly performant models with the features selected by featurewiz. They are: 78 | 1. simple_LightGBM_model() - simple regression and classification with one target label.
        79 | 2. simple_XGBoost_model() - simple regression and classification with one target label.
        80 | 3. complex_LightGBM_model() - more complex multi-label and multi-class models.
        81 | 4. complex_XGBoost_model() - more complex multi-label and multi-class models.
        82 | 5. Stacking_Classifier(): Stacking model that can handle multi-label, multi-class problems.
        83 | 6. Stacking_Regressor(): Stacking model that can handle multi-label, regression problems.
        84 | 7. Blending_Regressor(): Blending model that can handle multi-label, regression problems.
        85 | 86 | --------------------------------------------------------------------------------