├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── examples
├── .ipynb_checkpoints
│ └── featurewiz_classification-checkpoint.ipynb
├── Best_Pipeline_Featurewiz.ipynb
├── FeatureWiz_Interaction_Target_Feature_Engineering_Example.ipynb
├── FeatureWiz_Test.ipynb
├── Featurewiz_Medium_Blogpost.ipynb
├── Featurewiz_on_2000_variables.ipynb
├── affairs_multiclass.csv
├── boston.csv
├── car_sales.csv
├── cross_validate.py
├── featurewiz_autoencoders_demo.ipynb
├── featurewiz_classification.ipynb
├── featurewiz_regression_multi_target.ipynb
├── heart.csv
└── winequality.csv
├── featurewiz
├── __init__.py
├── __version__.py
├── auto_encoders.py
├── blagging.py
├── classify_method.py
├── databunch.py
├── encoders.py
├── featurewiz.py
├── ml_models.py
├── my_encoders.py
├── settings.py
├── stacking_models.py
└── sulov_method.py
├── images
├── MRMR.png
├── SULOV.jpg
├── feather_example.jpg
├── feature_engg.png
├── feature_engg_old.jpg
├── featurewiz_background.jpg
├── featurewiz_logo.jpg
├── featurewiz_logos.png
├── featurewiz_logos_old.png
├── featurewiz_mrmr.png
└── xgboost.jpg
├── old_README.md
├── requirements.txt
├── setup.py
└── updates.md
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We as members, contributors, and leaders pledge to make participation in our
6 | community a harassment-free experience for everyone, regardless of age, body
7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
8 | identity and expression, level of experience, education, socio-economic status,
9 | nationality, personal appearance, race, caste, color, religion, or sexual
10 | identity and orientation.
11 |
12 | We pledge to act and interact in ways that contribute to an open, welcoming,
13 | diverse, inclusive, and healthy community.
14 |
15 | ## Our Standards
16 |
17 | Examples of behavior that contributes to a positive environment for our
18 | community include:
19 |
20 | * Demonstrating empathy and kindness toward other people
21 | * Being respectful of differing opinions, viewpoints, and experiences
22 | * Giving and gracefully accepting constructive feedback
23 | * Accepting responsibility, apologizing to those affected by our mistakes,
24 | and learning from the experience
25 | * Focusing on what is best not just for us as individuals, but for the overall
26 | community
27 |
28 | Examples of unacceptable behavior include:
29 |
30 | * The use of sexualized language or imagery, and sexual attention or advances of
31 | any kind
32 | * Trolling, insulting or derogatory comments, and personal or political attacks
33 | * Public or private harassment
34 | * Publishing others' private information, such as a physical or email address,
35 | without their explicit permission
36 | * Other conduct that could reasonably be considered inappropriate in a
37 | professional setting
38 |
39 | ## Enforcement Responsibilities
40 |
41 | Community leaders are responsible for clarifying and enforcing our standards of
42 | acceptable behavior and will take appropriate and fair corrective action in
43 | response to any behavior that they deem inappropriate, threatening, offensive,
44 | or harmful.
45 |
46 | Community leaders have the right and responsibility to remove, edit, or reject
47 | comments, commits, code, wiki edits, issues, and other contributions that are
48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
49 | decisions when appropriate.
50 |
51 | ## Scope
52 |
53 | This Code of Conduct applies within all community spaces and also applies when
54 | an individual is officially representing the community in public spaces.
55 | Examples of representing our community include using an official e-mail address,
56 | posting via an official social media account, or acting as an appointed
57 | representative at an online or offline event.
58 |
59 | ## Enforcement
60 |
61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
62 | reported to the community leaders responsible for enforcement.
63 | All complaints will be reviewed and investigated promptly and fairly.
64 |
65 | All community leaders are obligated to respect the privacy and security of the
66 | reporter of any incident.
67 |
68 | ## Enforcement Guidelines
69 |
70 | Community leaders will follow these Community Impact Guidelines in determining
71 | the consequences for any action they deem in violation of this Code of Conduct:
72 |
73 | ### 1. Correction
74 |
75 | **Community Impact**: Use of inappropriate language or other behavior deemed
76 | unprofessional or unwelcome in the community.
77 |
78 | **Consequence**: A private, written warning from community leaders, providing
79 | clarity around the nature of the violation and an explanation of why the
80 | behavior was inappropriate. A public apology may be requested.
81 |
82 | ### 2. Warning
83 |
84 | **Community Impact**: A violation through a single incident or series of
85 | actions.
86 |
87 | **Consequence**: A warning with consequences for continued behavior. No
88 | interaction with the people involved, including unsolicited interaction with
89 | those enforcing the Code of Conduct, for a specified period of time. This
90 | includes avoiding interactions in community spaces as well as external channels
91 | like social media. Violating these terms may lead to a temporary or permanent
92 | ban.
93 |
94 | ### 3. Temporary Ban
95 |
96 | **Community Impact**: A serious violation of community standards, including
97 | sustained inappropriate behavior.
98 |
99 | **Consequence**: A temporary ban from any sort of interaction or public
100 | communication with the community for a specified period of time. No public or
101 | private interaction with the people involved, including unsolicited interaction
102 | with those enforcing the Code of Conduct, is allowed during this period.
103 | Violating these terms may lead to a permanent ban.
104 |
105 | ### 4. Permanent Ban
106 |
107 | **Community Impact**: Demonstrating a pattern of violation of community
108 | standards, including sustained inappropriate behavior, harassment of an
109 | individual, or aggression toward or disparagement of classes of individuals.
110 |
111 | **Consequence**: A permanent ban from any sort of public interaction within the
112 | community.
113 |
114 | ## Attribution
115 |
116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
117 | version 2.1, available at
118 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
119 |
120 | Community Impact Guidelines were inspired by
121 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
122 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing
2 |
3 | We welcome contributions from anyone beginner or advanced. Please before working on some features:
4 |
5 | * Search through the past issues, your concern may have been raised by others in the past. Check through the
6 | closed issues as well.
7 | * If there is no open issue for your feature request, please open one up to coordinate all collaborators.
8 | * Write your feature.
9 | * Submit a pull request on this repo with:
10 | * A brief description
11 | * **Detail of the expected change(s) in behavior**
12 | * How to test it (if it's not obvious)
13 |
14 | Ask someone to test it.
15 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # featurewiz
2 | 🔥 FeatureWiz, the ultimate feature selection library is powered by the renowned Minimum Redundancy Maximum Relevance (MRMR) algorithm. Learn more about it below.
3 |
4 | 
5 |
6 | # Table of Contents
7 |
23 |
24 | ## Latest Update (Jan 2025)
25 |
26 | - featurewiz is now upgraded to version 0.6
27 | Anything above this version now runs on Python 3.12 or greater and also runs on pandas 2.0.
28 | - this is a huge upgrade to those working in Colab, Kaggle and other latest kernels.
29 | - Please make sure you check the `requirements.txt` file to know which versions are recommended.
30 |
31 |
32 | ## Citation
33 | If you use featurewiz in your research project or paper, please use the following format for citations:
34 | "Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"
35 | Current citations for featurewiz
36 |
37 | [Google Scholar citations for featurewiz](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=)
38 |
39 | ## Highlights
40 | `featurewiz` is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm.
41 |
42 | ### What Makes FeatureWiz Stand Out? 🔍
43 | ✔️ Automatically select the most relevant features without specifying a number
44 | 🚀 Provides the fastest and best implementation of the MRMR algorithm
45 | 🎯 Provides a built-in transformer (lazytransform library) that converts all features to numeric
46 | 📚 Includes deep learning models such as Variational Auto Encoders to capture complex interactions in your data
47 | 📝 Provides feature engineering in addition to feature selection - all with one single API call!
48 |
49 | ### Simple tips for success using featurewiz 💡
50 | 📈 First create additional features using the feature engg module
51 | 🌐 Compare featurewiz against other feature selection methods for best performance
52 | ⚖️ Avoid overfitting by cross-validating your results as shown here
53 | 🎯 Try adding auto-encoders for additional features that may help boost performance
54 |
55 | ### Feature Engineering
56 | Create new features effortlessly with a single line of code! featurewiz enables you to generate hundreds of interaction, group-by, target-encoded features and higher order features, eliminating the need for expert-level knowledge to create your own features. Now you can create even deep learning based features such as Variational Auto Encoders to capture complex interactions hidden among your features. See the latest page for more information on this amazing feature.
57 |
58 | ### What is MRMR?
59 | featurewiz provides one of the best automatic feature selection algorithms, MRMR, as described by wikipedia in this page as follows: "The MRMR feature selection algorithm has been found to be more powerful than other feature selection algorithms such as Boruta".
60 |
61 | In addition, other researchers have compared MRMR against multiple feature selection algorithms and found MRMR to be the best.
62 |
63 | 
64 |
65 | ### How does MRMR feature selection work?🔍
66 | After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or mutually-correlated? Will your model suffer from or benefit from adding all features? To answer these questions, featurewiz uses two crucial steps in MRMR:
67 |
68 | ⚙️ The SULOV Algorithm: SULOV means "Searching for Uncorrelated List of Variables". It is a fast algorithm that removes mutually correlated features so that you're left with only the most non-redundant (un-correlated) features. It uses the Mutual Information Score to accomplish this feat.
69 |
70 | ⚙️ Recursive XGBoost: Second, featurewiz uses XGBoost's feature importance scores by selecting smaller and smaller feature sets repeatedly to identify the most relevant features for your task among all the variables remaining after SULOV algorithm.
71 |
72 | ### Advanced Feature Engineering Options
73 |
74 | featurewiz extends traditional feature selection to the realm of deep learning using Auto Encoders, including Denoising Auto Encoders (DAEs), Variational Auto Encoders (VAEs), CNN's (Convolutional Neural Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets. Just set the 'feature_engg' flag to 'VAE_add' or 'DAE_add' to create these additional features.
75 |
76 |
77 |
78 | In addition, we include:
79 | A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.
80 | The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding features.
81 |
82 | ### Examples and Updates
83 | - featurewiz is well-documented, and it comes with a number of examples
84 | - featurewiz is actively maintained, and it is regularly updated with new features and bug fixes
85 |
86 | ## Workings
87 | `featurewiz` has two major modules to transform your Data Science workflow:
88 | 1. Feature Engineering Module
89 |
90 | 
91 |
92 |
Advanced Feature Creation: use Deep Learning based Auto Encoders and GAN's to extract features to add to your data. These powerful capabilities will help you in solving your toughest problems.
93 | Options for Enhancement: Use "interactions", "groupby", or "target" flags to enable advanced feature engineering techniques.
94 | Kaggle-Ready: Designed to meet the high standards of feature engineering required in competitive data science, like Kaggle.
95 | Efficient and User-Friendly: Generate and sift through thousands of features, selecting only the most impactful ones for your model.
96 |
97 | 
98 |
99 | 2. Feature Selection Module
100 | MRMR Algorithm: Employs Minimum Redundancy Maximum Relevance (MRMR) for effective feature selection.
101 | SULOV Method: Stands for 'Searching for Uncorrelated List of Variables', ensuring low redundancy and high relevance in feature selection.
102 | Addressing Key Questions: Helps interpret new features, assess their importance, and evaluate the model's performance with these features.
103 | Optimal Feature Subset: Uses Recursive XGBoost in combination with SULOV to identify the most critical features, reducing overfitting and improving model interpretability.
104 |
105 | #### Comparing featurewiz to Boruta:
106 | Featurewiz uses what is known as a `Minimal Optimal` algorithm such as MRMR while Boruta uses an `All-Relevant` approach. To understand how featurewiz's MRMR approach differs Boruta's 'All-Relevant' approach for best feature selection you need to study the chart below. It shows how the SULOV algorithm performs MRMR feature selection which provides a smaller feature set compared to Boruta which uses a bigger feature set.
107 |
108 | One of the weaknesses of Boruta is that it contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
109 |
110 | 
111 |
112 | Transform your feature engineering and selection process with featurewiz - the tool that brings expert-level capabilities to your fingertips!
113 |
114 | ## Working
115 | `featurewiz` performs feature selection in 2 steps. Each step is explained below.
116 | The working of the `SULOV` algorithm is as follows:
117 |
118 | - Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
119 | - Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
120 | - Now take each pair of correlated variables (using Pearson coefficient higher than the threshold above), and then eliminate the feature with the lower MIS score from the pair. Do this repeatedly with each pair until no feature pair is left to analyze.
121 | - What’s left after this step are the features with the highest Information score and the least Pearson correlation with each other.
122 |
123 |
124 | 
125 |
126 | The working of the Recursive XGBoost is as follows:
127 | Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV.
128 |
129 | - Select all variables in the data set and the full data split into train and valid sets.
130 | - Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
131 | - Then take the next set of vars and find top X
132 | - Do this 5 times. Combine all selected features and de-duplicate them.
133 |
134 |
135 | 
136 |
137 | ## Tips
138 | Here are some additional tips for ML engineers and data scientists when using featurewiz:
139 |
140 | - How to cross-validate your results: When you use featurewiz, we automatically perform multiple rounds of feature selection using permutations on the number of columns. However, you can perform feature selection using permutations of rows as follows in cross_validate using featurewiz.
141 |
- Use multiple feature selection tools: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.
142 | - Don't forget to use Auto Encoders!: Autoencoders are like skilled artists who can draw a quick sketch of a complex picture. They learn to capture the essence of the data and then recreate it with as few strokes as possible. This process helps in understanding and compressing data efficiently.
143 | - Don't overfit your model: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.
144 | - Start with a small number of features: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.
145 |
146 |
147 | ## Install
148 |
149 | **Prerequisites:**
150 |
151 | - featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
152 | - We use "networkx" library for charts and interpretability.
But if you don't have these libraries, featurewiz will install those for you automatically.
153 |
154 |
155 | In Kaggle notebooks, you need to install featurewiz like this (otherwise there will be errors):
156 | ```
157 | !pip install featurewiz
158 | !pip install Pillow==9.0.0
159 | !pip install xlrd — ignore-installed — no-deps
160 | !pip install executing>0.10.0
161 | ```
162 |
163 | To install from source:
164 |
165 | ```
166 | cd
167 | git clone git@github.com:AutoViML/featurewiz.git
168 | # or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
169 | conda create -n python=3.7 anaconda
170 | conda activate # ON WINDOWS: `source activate `
171 | cd featurewiz
172 | pip install -r requirements.txt
173 | ```
174 |
175 | ## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps!
176 | As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:
177 |
178 | ```
179 | conda install -c conda-forge featurewiz
180 | ```
181 |
182 | ### If the above conda install fails, you can try installing featurewiz this way:
183 | #### Install featurewiz using git+
184 |
185 | ```
186 | !pip install git+https://github.com/AutoViML/featurewiz.git
187 | ```
188 |
189 | ## Usage
190 |
191 | There are two ways to use featurewiz.
192 |
193 | - The first way is the new way where you use scikit-learn's `fit and predict` syntax. It also includes the `lazytransformer` library that I created to transform datetime, NLP and categorical variables into numeric variables automatically. We recommend that you use it as the main syntax for all your future needs.
194 |
195 | ```
196 | from featurewiz import FeatureWiz
197 | fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std",
198 | category_encoders="auto", add_missing=False, verbose=0, imbalanced=False,
199 | ae_options={})
200 | X_train_selected, y_train = fwiz.fit_transform(X_train, y_train)
201 | X_test_selected = fwiz.transform(X_test)
202 | ### get list of selected features ###
203 | fwiz.features
204 | ```
205 |
206 | - The second way is the old way and this was the original syntax of featurewiz. It is still being used by thousands of researchers in the field. Hence it will continue to be maintained. However, it can be discontinued any time without notice. You can use it if you like it.
207 |
208 | ```
209 | import featurewiz as fwiz
210 | outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',',
211 | header=0, test_data='',feature_engg='', category_encoders='',
212 | dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False)
213 | ```
214 |
215 | `outputs` is a tuple: There will always be two objects in output. It can vary:
216 | - In the first case, it can be `features` and `trainm`: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only)
217 | - In the second case, it can be `trainm` and `testm`: It can be two transformed dataframes when you send in both test and train but with selected features.
218 |
219 | In both cases, the features and dataframes are ready for you to do further modeling.
220 |
221 | Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want.
222 | You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically.
223 |
224 | ## API
225 |
226 | **Input Arguments for NEW syntax**
227 |
228 | Parameters
229 | ----------
230 | corr_limit : float, default=0.90
231 | The correlation limit to consider for feature selection. Features with correlations
232 | above this limit may be excluded.
233 |
234 | verbose : int, default=0
235 | Level of verbosity in output messages.
236 |
237 | feature_engg : str or list, default=''
238 | Specifies the feature engineering methods to apply, such as 'interactions', 'groupby',
239 | and 'target'.
240 |
241 | auto_encoders : str or list, default=''
242 | Five new options have been added recently to `auto_encoders` (starting in version 0.5.0): `DAE`, `VAE`, `DAE_ADD`, `VAE_ADD`, `CNN`, `CNN_ADD` and `GAN`. These are deep learning auto encoders (using tensorflow and keras) that can extract the most important patterns in your data and either replace your features or add them as extra features to your data. Try them for your toughest ML problems! See the notebooks folder for examples.
243 |
244 | ae_options : dict, default={}
245 | You can provide a dictionary for tuning auto encoders above. Supported auto encoders include 'dae',
246 | 'vae', and 'gan'. You must use the `help` function to see how to send a dict to each auto encoder. You can also check out this Auto Encoder demo notebook
247 |
248 | category_encoders : str or list, default=''
249 | Encoders for handling categorical variables. Supported encoders include 'onehot',
250 | 'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc',
251 | 'loo', 'base', 'james', 'helmert', 'label', 'auto', etc.
252 |
253 | add_missing : bool, default=False
254 | If True, adds indicators for missing values in the dataset.
255 |
256 | dask_xgboost_flag : bool, default=False
257 | If set to True, enables the use of Dask for parallel computing with XGBoost.
258 |
259 | nrows : int or None, default=None
260 | Limits the number of rows to process.
261 |
262 | skip_sulov : bool, default=False
263 | If True, skips the application of the Super Learning Optimized (SULO) method in
264 | feature selection.
265 |
266 | skip_xgboost : bool, default=False
267 | If True, bypasses the recursive XGBoost feature selection.
268 |
269 | transform_target : bool, default=False
270 | When True, transforms the target variable(s) into numeric format if they are not
271 | already.
272 |
273 | scalers : str or None, default=None
274 | Specifies the scaler to use for feature scaling. Available options include
275 | 'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'.
276 |
277 | imbalanced : True or False, default=False
278 | Specifies whether to use SMOTE technique for imbalanced datasets.
279 |
280 | **Input Arguments for old syntax**
281 |
282 | - `dataname`: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.
283 | - `target`: name of the target variable in the data set.
284 | - `corr_limit`: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal.
285 | - `verbose`: This has 3 possible states:
286 | - `0` - limited output. Great for running this silently and getting fast results.
287 | - `1` - verbose. Great for knowing how results were and making changes to flags in input.
288 | - `2` - more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method.
289 | - `test_data`: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. `test_data` could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string.
290 | - `dask_xgboost_flag`: default False. If you want to use dask with your data, then set this to True.
291 | - `feature_engg`: You can let featurewiz select its best encoders for your data set by setting this flag
292 | for adding feature engineering. There are three choices. You can choose one, two, or all three.
293 | - `interactions`: This will add interaction features to your data such as x1*x2, x2*x3, x1**2, x2**2, etc.
294 | - `groupby`: This will generate Group By features to your numeric vars by grouping all categorical vars.
295 | - `target`: This will encode and transform all your categorical features using certain target encoders.
296 | Default is empty string (which means no additional features)
297 | - `add_missing`: default is False. This is a new flag: the `add_missing` flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal.
298 | - `category_encoders`: default is "auto". Instead, you can choose your own category encoders from the list below.
299 | We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)
These descriptions are derived from the excellent category_encoders python library. Please check it out!
300 | - `HashingEncoder`: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
301 | - `SumEncoder`: SumEncoder is a Sum contrast coding for the encoding of categorical features.
302 | - `PolynomialEncoder`: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.
303 | - `BackwardDifferenceEncoder`: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.
304 | - `OneHotEncoder`: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.
305 | - `HelmertEncoder`: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.
306 | - `OrdinalEncoder`: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.
307 | - `FrequencyEncoder`: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.
308 | - `BaseNEncoder`: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
309 | - `TargetEncoder`: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper.
310 | - `CatBoostEncoder`: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
311 | - `WOEEncoder`: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.
312 | - `JamesSteinEncoder`: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper.
313 | For feature value i, James-Stein estimator returns a weighted average of:
314 | The mean target value for the observed feature value i.
315 | The mean target value (regardless of the feature value).
316 | - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask.
317 | - `skip_sulov`: default `False`. You can set the flag to skip the SULOV method if you want.
318 | - `skip_xgboost`: default `False`. You can set the flag to skip the Recursive XGBoost method if you want.
319 |
320 | **Output values for old syntax** This applies only to the old syntax.
321 | - `outputs`: Output is always a tuple. We can call our outputs in that tuple as `out1` and `out2` below.
322 | - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get:
323 | - 1. `features`: It will be a list (of selected features) and
324 | - 2. `trainm`: It will be a dataframe (if you sent in a file or dataname as input)
325 | - `out1` and `out2`: If you sent in two files or dataframes (train and test), you will get:
326 | - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and
327 | - 2. `testm`: a modified test dataframe with engineered and selected features from test_data.
328 |
329 | ## Additional
330 | To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)
331 |
332 | 
333 |
334 | featurewiz was designed for selecting High Performance variables with the fewest steps.
335 | In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).
336 |
337 | featurewiz is every Data Scientist's feature wizard that will:
338 | - Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
339 | - Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can use deep learning to extract features with the click of a mouse. This is very helpful when you have imbalanced classes or 1000's of features to deal with. However, be careful with this option. You can very easily spend a lot of time tuning these neural networks.
340 |
- Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
341 | - Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
342 | - Build a fast XGBoost or LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.
343 |
344 |
345 | *** Special thanks to fellow open source Contributors ***:
346 |
347 | - Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
348 | - Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html
349 |
350 |
351 | ## Maintainers
352 |
353 | * [@AutoViML](https://github.com/AutoViML)
354 |
355 | ## Contributing
356 |
357 | See [the contributing file](CONTRIBUTING.md)!
358 |
359 | PRs accepted.
360 |
361 | ## License
362 |
363 | Apache License 2.0 © 2020 Ram Seshadri
364 |
365 | ## DISCLAIMER
366 | This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.
367 |
368 |
369 | [page]: examples/cross_validate.py
370 |
--------------------------------------------------------------------------------
/examples/cross_validate.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from sklearn.model_selection import train_test_split
3 | from sklearn.linear_model import LogisticRegression
4 | from featurewiz import FeatureWiz
5 |
6 | # Load the dataset into a pandas dataframe
7 | df = pd.read_csv(trainfile, sep=sep)
8 |
9 | # Define your target variable
10 | target = target
11 |
12 | # Split the data into training and testing sets
13 | X_train, X_test, y_train, y_test = train_test_split(df.drop(target, axis=1), df[target], test_size=0.2, random_state=42)
14 |
15 | # Define the number of rounds
16 | num_rounds = 3
17 |
18 | # Perform multiple rounds of feature selection using rows
19 | selected_features = []
20 | for i in range(num_rounds):
21 | # Split the training set into a new training set and a validation set
22 | X_new_train, X_val, y_new_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=i)
23 |
24 | # Use Featurewiz to select the best features on the new training set
25 | fwiz = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='',
26 | dask_xgboost_flag=False, nrows=None, verbose=0)
27 | X_new_train_selected = fwiz.fit_transform(X_new_train, y_new_train)
28 | X_new_val_selected = fwiz.transform(X_val)
29 |
30 | # Evaluate the performance of the model on the validation set with the selected features
31 | model = LogisticRegression()
32 | model.fit(X_new_train_selected, y_new_train)
33 | accuracy = model.score(X_new_val_selected, y_val)
34 |
35 | # Print the accuracy of the model on the validation set
36 | print(f'Round {i+1}: Validation accuracy is {accuracy:.2f}.')
37 |
38 | # Get the selected features from Featurewiz and add them to a list
39 | selected_features.append(fwiz.features)
40 | fwiz_all = fwiz.lazy
41 | ### this saves the lazy transformer from featurewiz for next round ###
42 |
43 | # Find the most common set of features (most stable) and use them to train a logistic regression model
44 | common_features = list(set(selected_features[0]).intersection(*selected_features))
45 | print('Common most stable features:', len(common_features), 'features are:\n', common_features)
46 | #### Now transform your features to all-numeric using featurewiz' lazy transformer ###
47 | X_train_selected_all = fwiz_all.transform(X_train)
48 | X_test_selected_all = fwiz_all.transform(X_test)
49 |
50 | # Evaluate the performance of the model on each round and compare it to the final accuracy with common features
51 | accuracies = []
52 | for i in range(num_rounds):
53 | model_round = LogisticRegression()
54 | model_round.fit(X_train_selected_all[selected_features[i]], y_train)
55 | accuracy_round = model_round.score(X_test_selected_all[selected_features[i]], y_test)
56 | accuracies.append(accuracy_round)
57 |
58 | model_final = LogisticRegression()
59 | model_final.fit(X_train_selected_all[common_features], y_train)
60 | accuracy_final = model_final.score(X_test_selected_all[common_features], y_test)
61 | print('Individual accuracy from',len(accuracies),'rounds is:',accuracies)
62 | print('Average accuracy from 3 rounds = ', np.mean(accuracies), '\nvs. final accuracy with common features: ',accuracy_final)
--------------------------------------------------------------------------------
/examples/featurewiz_autoencoders_demo.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "567c5ccb",
7 | "metadata": {
8 | "scrolled": true
9 | },
10 | "outputs": [],
11 | "source": [
12 | "import numpy as np\n",
13 | "from sklearn.preprocessing import MinMaxScaler\n",
14 | "import pandas as pd"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 2,
20 | "id": "a0a0470e",
21 | "metadata": {},
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "Imported lazytransform v1.15. \n",
28 | "\n",
29 | "Imported featurewiz 0.5.6. Use the following syntax:\n",
30 | " >>> wiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True,\n",
31 | " \t\tcategory_encoders=\"auto\", auto_encoders='VAE', ae_options={},\n",
32 | " \t\tadd_missing=False, imbalanced=False, verbose=0)\n",
33 | " >>> X_train_selected, y_train = wiz.fit_transform(X_train, y_train)\n",
34 | " >>> X_test_selected = wiz.transform(X_test)\n",
35 | " >>> selected_features = wiz.features\n",
36 | " \n"
37 | ]
38 | }
39 | ],
40 | "source": [
41 | "from featurewiz import FeatureWiz\n",
42 | "from featurewiz import print_regression_metrics, print_classification_metrics"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "id": "21a19841",
49 | "metadata": {},
50 | "outputs": [
51 | {
52 | "name": "stdout",
53 | "output_type": "stream",
54 | "text": [
55 | "(6497, 13)\n"
56 | ]
57 | },
58 | {
59 | "data": {
60 | "text/html": [
61 | "\n",
62 | "\n",
75 | "
\n",
76 | " \n",
77 | " \n",
78 | " | \n",
79 | " fixed acidity | \n",
80 | " volatile acidity | \n",
81 | " citric acid | \n",
82 | " residual sugar | \n",
83 | " chlorides | \n",
84 | " free sulfur dioxide | \n",
85 | " total sulfur dioxide | \n",
86 | " density | \n",
87 | " pH | \n",
88 | " sulphates | \n",
89 | " alcohol | \n",
90 | " quality | \n",
91 | " red_wine | \n",
92 | "
\n",
93 | " \n",
94 | " \n",
95 | " \n",
96 | " 0 | \n",
97 | " 7.4 | \n",
98 | " 0.70 | \n",
99 | " 0.00 | \n",
100 | " 1.9 | \n",
101 | " 0.076 | \n",
102 | " 11.0 | \n",
103 | " 34.0 | \n",
104 | " 0.9978 | \n",
105 | " 3.51 | \n",
106 | " 0.56 | \n",
107 | " 9.4 | \n",
108 | " 5 | \n",
109 | " 1 | \n",
110 | "
\n",
111 | " \n",
112 | " 1 | \n",
113 | " 7.8 | \n",
114 | " 0.88 | \n",
115 | " 0.00 | \n",
116 | " 2.6 | \n",
117 | " 0.098 | \n",
118 | " 25.0 | \n",
119 | " 67.0 | \n",
120 | " 0.9968 | \n",
121 | " 3.20 | \n",
122 | " 0.68 | \n",
123 | " 9.8 | \n",
124 | " 5 | \n",
125 | " 1 | \n",
126 | "
\n",
127 | " \n",
128 | " 2 | \n",
129 | " 7.8 | \n",
130 | " 0.76 | \n",
131 | " 0.04 | \n",
132 | " 2.3 | \n",
133 | " 0.092 | \n",
134 | " 15.0 | \n",
135 | " 54.0 | \n",
136 | " 0.9970 | \n",
137 | " 3.26 | \n",
138 | " 0.65 | \n",
139 | " 9.8 | \n",
140 | " 5 | \n",
141 | " 1 | \n",
142 | "
\n",
143 | " \n",
144 | " 3 | \n",
145 | " 11.2 | \n",
146 | " 0.28 | \n",
147 | " 0.56 | \n",
148 | " 1.9 | \n",
149 | " 0.075 | \n",
150 | " 17.0 | \n",
151 | " 60.0 | \n",
152 | " 0.9980 | \n",
153 | " 3.16 | \n",
154 | " 0.58 | \n",
155 | " 9.8 | \n",
156 | " 6 | \n",
157 | " 1 | \n",
158 | "
\n",
159 | " \n",
160 | " 4 | \n",
161 | " 7.4 | \n",
162 | " 0.70 | \n",
163 | " 0.00 | \n",
164 | " 1.9 | \n",
165 | " 0.076 | \n",
166 | " 11.0 | \n",
167 | " 34.0 | \n",
168 | " 0.9978 | \n",
169 | " 3.51 | \n",
170 | " 0.56 | \n",
171 | " 9.4 | \n",
172 | " 5 | \n",
173 | " 1 | \n",
174 | "
\n",
175 | " \n",
176 | "
\n",
177 | "
"
178 | ],
179 | "text/plain": [
180 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
181 | "0 7.4 0.70 0.00 1.9 0.076 \n",
182 | "1 7.8 0.88 0.00 2.6 0.098 \n",
183 | "2 7.8 0.76 0.04 2.3 0.092 \n",
184 | "3 11.2 0.28 0.56 1.9 0.075 \n",
185 | "4 7.4 0.70 0.00 1.9 0.076 \n",
186 | "\n",
187 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
188 | "0 11.0 34.0 0.9978 3.51 0.56 \n",
189 | "1 25.0 67.0 0.9968 3.20 0.68 \n",
190 | "2 15.0 54.0 0.9970 3.26 0.65 \n",
191 | "3 17.0 60.0 0.9980 3.16 0.58 \n",
192 | "4 11.0 34.0 0.9978 3.51 0.56 \n",
193 | "\n",
194 | " alcohol quality red_wine \n",
195 | "0 9.4 5 1 \n",
196 | "1 9.8 5 1 \n",
197 | "2 9.8 5 1 \n",
198 | "3 9.8 6 1 \n",
199 | "4 9.4 5 1 "
200 | ]
201 | },
202 | "execution_count": 3,
203 | "metadata": {},
204 | "output_type": "execute_result"
205 | }
206 | ],
207 | "source": [
208 | "trainfile = 'c:/users/ram/documents/ram/data_sets/kaggle/diabetes.csv'\n",
209 | "datapath = '../Ram/Data_Sets/'\n",
210 | "filename = 'winequality.csv'\n",
211 | "#filename = 'affairs.csv'\n",
212 | "trainfile = datapath+filename\n",
213 | "sep = ','\n",
214 | "dft = pd.read_csv(trainfile,sep=sep)\n",
215 | "#dft.drop(['affairs','affair'],axis=1, inplace=True)\n",
216 | "print(dft.shape)\n",
217 | "dft.head()"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 4,
223 | "id": "603bf23c",
224 | "metadata": {},
225 | "outputs": [
226 | {
227 | "data": {
228 | "text/plain": [
229 | "7"
230 | ]
231 | },
232 | "execution_count": 4,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "target = 'quality'\n",
239 | "#target = 'affair_multiclass'\n",
240 | "modeltype = 'Multi_Classification'\n",
241 | "preds = [x for x in list(dft) if x not in target]\n",
242 | "dft[target].nunique()"
243 | ]
244 | },
245 | {
246 | "cell_type": "raw",
247 | "id": "1ba875d4",
248 | "metadata": {},
249 | "source": [
250 | "from sklearn.datasets import make_classification, make_regression\n",
251 | "from sklearn.model_selection import train_test_split\n",
252 | "from sklearn.metrics import accuracy_score, mean_squared_error\n",
253 | "if modeltype == 'Regression':\n",
254 | " X, y = make_regression(n_samples=10000, noise=1000, n_features=8, random_state=0)\n",
255 | "else:\n",
256 | " X, y = make_classification(n_samples=10000, n_classes=5, n_features=8, n_informative=4, random_state=0)\n",
257 | "# split dataset into train and test sets\n",
258 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=99)"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 5,
264 | "id": "1494931d",
265 | "metadata": {},
266 | "outputs": [
267 | {
268 | "name": "stdout",
269 | "output_type": "stream",
270 | "text": [
271 | "(5197, 12) (1300, 12)\n"
272 | ]
273 | }
274 | ],
275 | "source": [
276 | "from sklearn.model_selection import train_test_split\n",
277 | "from featurewiz import FE_kmeans_resampler\n",
278 | "if modeltype == 'Regression':\n",
279 | " X_train, X_test, y_train, y_test = train_test_split(dft[preds], dft[target], test_size=0.20, random_state=1,)\n",
280 | " X_train_over, y_train_over = FE_kmeans_resampler(X_train, y_train, target, smote='',verbose=0)\n",
281 | " print(X_train_over.shape, X_test.shape)\n",
282 | " #train, test = pd.concat([X_train_over, pd.Series(y_train_over,name=target)], axis=1), pd.concat([X_test, y_test], axis=1)\n",
283 | " train, test = train_test_split(dft, test_size=0.20, random_state=42)\n",
284 | "else:\n",
285 | " X_train, X_test, y_train, y_test = train_test_split(dft[preds], dft[target], test_size=0.20, \n",
286 | " stratify=dft[target],\n",
287 | " random_state=42)\n",
288 | " train, test = train_test_split(dft, test_size=0.20, random_state=42,\n",
289 | " stratify=dft[target]\n",
290 | " )\n",
291 | "print(X_train.shape, X_test.shape)"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 6,
297 | "id": "6f387d3e",
298 | "metadata": {},
299 | "outputs": [
300 | {
301 | "name": "stdout",
302 | "output_type": "stream",
303 | "text": [
304 | "featurewiz is given 0.9 as correlation limit...\n",
305 | " Skipping feature engineering since no feature_engg input...\n",
306 | " final list of category encoders given: ['onehot', 'label']\n",
307 | " You need to pip install tensorflow>= 2.5 in order to use this Autoencoder option.\n",
308 | "Since Auto Encoders are selected for feature extraction,\n",
309 | " Recursive XGBoost is also skipped...\n",
310 | "CNNAutoEncoder()\n",
311 | " AE dictionary given: dict_items([])\n",
312 | " final list of scalers given: [minmax]\n"
313 | ]
314 | }
315 | ],
316 | "source": [
317 | "scaler = FeatureWiz(feature_engg = '', nrows=None, transform_target=True,\n",
318 | " \t\tcategory_encoders=\"auto\", auto_encoders='CNN_ADD', ae_options={},\n",
319 | " \t\tadd_missing=False, imbalanced=False, verbose=0)"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 7,
325 | "id": "a8d5fedd",
326 | "metadata": {},
327 | "outputs": [
328 | {
329 | "name": "stdout",
330 | "output_type": "stream",
331 | "text": [
332 | "Loaded input data. Shape = (5197, 12)\n",
333 | "#### Starting featurewiz transform for train data ####\n",
334 | " Single_Label Multi_Classification problem \n",
335 | "Shape of dataset: (5197, 12). Now we classify variables into different types...\n",
336 | "Time taken to define data pipeline = 1 second(s)\n",
337 | "No model input given...\n",
338 | "Lazy Transformer Pipeline created...\n",
339 | " transformed target from object type to numeric\n",
340 | " Time taken to fit dataset = 1 second(s)\n",
341 | " Time taken to transform dataset = 1 second(s)\n",
342 | " Shape of transformed dataset: (5197, 12)\n",
343 | " No hyperparam selection since GAN or CNN is selected for auto_encoders...\n",
344 | "Fitting and transforming CNNAutoEncoder for dataset...\n",
345 | "Epoch 1/100\n",
346 | "130/130 [==============================] - 2s 7ms/step - loss: 0.0133 - val_loss: 0.0038 - lr: 0.0010\n",
347 | "Epoch 2/100\n",
348 | "130/130 [==============================] - 1s 5ms/step - loss: 0.0029 - val_loss: 0.0022 - lr: 0.0010\n",
349 | "Epoch 3/100\n",
350 | "130/130 [==============================] - 1s 5ms/step - loss: 0.0019 - val_loss: 0.0015 - lr: 0.0010\n",
351 | "Epoch 4/100\n",
352 | "130/130 [==============================] - 1s 6ms/step - loss: 0.0012 - val_loss: 0.0011 - lr: 0.0010\n",
353 | "Epoch 5/100\n",
354 | "130/130 [==============================] - 1s 6ms/step - loss: 9.0416e-04 - val_loss: 8.1538e-04 - lr: 0.0010\n",
355 | "Epoch 6/100\n",
356 | "130/130 [==============================] - 1s 6ms/step - loss: 6.8004e-04 - val_loss: 5.5630e-04 - lr: 0.0010\n",
357 | "Epoch 7/100\n",
358 | "130/130 [==============================] - 1s 6ms/step - loss: 4.7319e-04 - val_loss: 4.5762e-04 - lr: 0.0010\n",
359 | "Epoch 8/100\n",
360 | "130/130 [==============================] - 1s 5ms/step - loss: 4.0537e-04 - val_loss: 4.0449e-04 - lr: 0.0010\n",
361 | "Epoch 9/100\n",
362 | "130/130 [==============================] - 1s 5ms/step - loss: 3.7104e-04 - val_loss: 3.6171e-04 - lr: 0.0010\n",
363 | "Epoch 10/100\n",
364 | "130/130 [==============================] - 1s 5ms/step - loss: 3.4615e-04 - val_loss: 3.7683e-04 - lr: 0.0010\n",
365 | "Epoch 11/100\n",
366 | "130/130 [==============================] - 1s 5ms/step - loss: 3.2180e-04 - val_loss: 3.3386e-04 - lr: 0.0010\n",
367 | "Epoch 12/100\n",
368 | "130/130 [==============================] - 1s 6ms/step - loss: 3.1295e-04 - val_loss: 3.1308e-04 - lr: 0.0010\n",
369 | "Epoch 13/100\n",
370 | "130/130 [==============================] - 1s 6ms/step - loss: 2.9671e-04 - val_loss: 3.3688e-04 - lr: 0.0010\n",
371 | "Epoch 14/100\n",
372 | "130/130 [==============================] - 1s 6ms/step - loss: 2.4843e-04 - val_loss: 2.6806e-04 - lr: 5.0000e-04\n",
373 | "Epoch 15/100\n",
374 | "130/130 [==============================] - 1s 5ms/step - loss: 2.4102e-04 - val_loss: 2.5663e-04 - lr: 5.0000e-04\n",
375 | "Epoch 16/100\n",
376 | "130/130 [==============================] - 1s 5ms/step - loss: 2.3189e-04 - val_loss: 2.6313e-04 - lr: 5.0000e-04\n",
377 | "Epoch 17/100\n",
378 | "130/130 [==============================] - 1s 5ms/step - loss: 2.3202e-04 - val_loss: 2.5388e-04 - lr: 5.0000e-04\n",
379 | "Epoch 18/100\n",
380 | "130/130 [==============================] - 1s 5ms/step - loss: 2.2195e-04 - val_loss: 2.3446e-04 - lr: 5.0000e-04\n",
381 | "Epoch 19/100\n",
382 | "130/130 [==============================] - 1s 5ms/step - loss: 2.1615e-04 - val_loss: 2.3867e-04 - lr: 5.0000e-04\n",
383 | "Epoch 20/100\n",
384 | "130/130 [==============================] - 1s 5ms/step - loss: 1.9608e-04 - val_loss: 2.2372e-04 - lr: 2.5000e-04\n",
385 | "Epoch 21/100\n",
386 | "130/130 [==============================] - 1s 6ms/step - loss: 1.9390e-04 - val_loss: 2.1614e-04 - lr: 2.5000e-04\n",
387 | "Epoch 22/100\n",
388 | "130/130 [==============================] - 1s 6ms/step - loss: 1.8848e-04 - val_loss: 2.0742e-04 - lr: 2.5000e-04\n",
389 | "Epoch 23/100\n",
390 | "130/130 [==============================] - 1s 5ms/step - loss: 1.8496e-04 - val_loss: 2.1073e-04 - lr: 2.5000e-04\n",
391 | "Epoch 24/100\n",
392 | "130/130 [==============================] - 1s 5ms/step - loss: 1.8005e-04 - val_loss: 2.0087e-04 - lr: 2.5000e-04\n",
393 | "Epoch 25/100\n",
394 | "130/130 [==============================] - 1s 5ms/step - loss: 1.7124e-04 - val_loss: 1.9250e-04 - lr: 1.2500e-04\n",
395 | "Epoch 26/100\n",
396 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6716e-04 - val_loss: 1.8936e-04 - lr: 1.2500e-04\n",
397 | "Epoch 27/100\n",
398 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6401e-04 - val_loss: 1.8330e-04 - lr: 1.2500e-04\n",
399 | "Epoch 28/100\n",
400 | "130/130 [==============================] - 1s 5ms/step - loss: 1.6035e-04 - val_loss: 1.8186e-04 - lr: 1.2500e-04\n",
401 | "Epoch 29/100\n",
402 | "130/130 [==============================] - 1s 5ms/step - loss: 1.5901e-04 - val_loss: 1.8252e-04 - lr: 1.2500e-04\n",
403 | "Epoch 30/100\n",
404 | "130/130 [==============================] - 1s 6ms/step - loss: 1.5409e-04 - val_loss: 1.7704e-04 - lr: 1.0000e-04\n",
405 | "Epoch 31/100\n",
406 | "130/130 [==============================] - 1s 5ms/step - loss: 1.5029e-04 - val_loss: 1.6895e-04 - lr: 1.0000e-04\n",
407 | "Epoch 32/100\n",
408 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4684e-04 - val_loss: 1.6386e-04 - lr: 1.0000e-04\n",
409 | "Epoch 33/100\n",
410 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4408e-04 - val_loss: 1.6665e-04 - lr: 1.0000e-04\n",
411 | "Epoch 34/100\n",
412 | "130/130 [==============================] - 1s 5ms/step - loss: 1.4066e-04 - val_loss: 1.5804e-04 - lr: 1.0000e-04\n",
413 | "Epoch 35/100\n",
414 | "130/130 [==============================] - 1s 5ms/step - loss: 1.3626e-04 - val_loss: 1.5064e-04 - lr: 1.0000e-04\n",
415 | "Epoch 36/100\n",
416 | "130/130 [==============================] - 1s 5ms/step - loss: 1.3172e-04 - val_loss: 1.4678e-04 - lr: 1.0000e-04\n",
417 | "Epoch 37/100\n",
418 | "130/130 [==============================] - 1s 5ms/step - loss: 1.2683e-04 - val_loss: 1.4821e-04 - lr: 1.0000e-04\n",
419 | "Epoch 38/100\n",
420 | "130/130 [==============================] - 1s 5ms/step - loss: 1.2296e-04 - val_loss: 1.3887e-04 - lr: 1.0000e-04\n",
421 | "Epoch 39/100\n",
422 | "130/130 [==============================] - 1s 5ms/step - loss: 1.1663e-04 - val_loss: 1.2789e-04 - lr: 1.0000e-04\n",
423 | "Epoch 40/100\n",
424 | "130/130 [==============================] - 1s 5ms/step - loss: 1.1185e-04 - val_loss: 1.2739e-04 - lr: 1.0000e-04\n",
425 | "Epoch 41/100\n",
426 | "130/130 [==============================] - 1s 5ms/step - loss: 1.0750e-04 - val_loss: 1.1749e-04 - lr: 1.0000e-04\n",
427 | "Epoch 42/100\n",
428 | "130/130 [==============================] - 1s 5ms/step - loss: 1.0330e-04 - val_loss: 1.1409e-04 - lr: 1.0000e-04\n",
429 | "Epoch 43/100\n",
430 | "130/130 [==============================] - 1s 7ms/step - loss: 9.8775e-05 - val_loss: 1.0765e-04 - lr: 1.0000e-04\n",
431 | "Epoch 44/100\n",
432 | "130/130 [==============================] - 1s 6ms/step - loss: 9.4918e-05 - val_loss: 1.0886e-04 - lr: 1.0000e-04\n",
433 | "Epoch 45/100\n",
434 | "130/130 [==============================] - 1s 6ms/step - loss: 9.0940e-05 - val_loss: 1.0206e-04 - lr: 1.0000e-04\n",
435 | "Epoch 46/100\n",
436 | "130/130 [==============================] - 1s 6ms/step - loss: 8.8166e-05 - val_loss: 9.9367e-05 - lr: 1.0000e-04\n",
437 | "Epoch 47/100\n",
438 | "130/130 [==============================] - 1s 6ms/step - loss: 8.5611e-05 - val_loss: 9.5914e-05 - lr: 1.0000e-04\n",
439 | "Epoch 48/100\n",
440 | "130/130 [==============================] - 1s 6ms/step - loss: 8.2108e-05 - val_loss: 9.6259e-05 - lr: 1.0000e-04\n",
441 | "Epoch 49/100\n",
442 | "130/130 [==============================] - 1s 6ms/step - loss: 8.0729e-05 - val_loss: 9.1706e-05 - lr: 1.0000e-04\n",
443 | "Epoch 50/100\n",
444 | "130/130 [==============================] - 1s 6ms/step - loss: 7.8945e-05 - val_loss: 8.6837e-05 - lr: 1.0000e-04\n",
445 | "Epoch 51/100\n",
446 | "130/130 [==============================] - 1s 6ms/step - loss: 7.7156e-05 - val_loss: 8.8305e-05 - lr: 1.0000e-04\n",
447 | "Epoch 52/100\n",
448 | "130/130 [==============================] - 1s 6ms/step - loss: 7.5093e-05 - val_loss: 8.8565e-05 - lr: 1.0000e-04\n",
449 | "Epoch 53/100\n",
450 | "130/130 [==============================] - 1s 5ms/step - loss: 7.5755e-05 - val_loss: 8.5557e-05 - lr: 1.0000e-04\n",
451 | "Epoch 54/100\n",
452 | "130/130 [==============================] - 1s 5ms/step - loss: 7.2456e-05 - val_loss: 8.1475e-05 - lr: 1.0000e-04\n",
453 | "Epoch 55/100\n",
454 | "130/130 [==============================] - 1s 5ms/step - loss: 7.1072e-05 - val_loss: 7.8665e-05 - lr: 1.0000e-04\n",
455 | "Epoch 56/100\n",
456 | "130/130 [==============================] - 1s 5ms/step - loss: 7.1659e-05 - val_loss: 8.1373e-05 - lr: 1.0000e-04\n",
457 | "Epoch 57/100\n",
458 | "130/130 [==============================] - 1s 6ms/step - loss: 6.9972e-05 - val_loss: 8.0773e-05 - lr: 1.0000e-04\n",
459 | "Epoch 58/100\n",
460 | "130/130 [==============================] - 1s 6ms/step - loss: 6.8407e-05 - val_loss: 7.8222e-05 - lr: 1.0000e-04\n",
461 | "Epoch 59/100\n",
462 | "130/130 [==============================] - 1s 5ms/step - loss: 6.5964e-05 - val_loss: 7.4572e-05 - lr: 1.0000e-04\n",
463 | "Epoch 60/100\n",
464 | "130/130 [==============================] - 1s 5ms/step - loss: 6.6795e-05 - val_loss: 8.0847e-05 - lr: 1.0000e-04\n",
465 | "Epoch 61/100\n"
466 | ]
467 | },
468 | {
469 | "name": "stdout",
470 | "output_type": "stream",
471 | "text": [
472 | "130/130 [==============================] - 1s 5ms/step - loss: 6.5861e-05 - val_loss: 7.6759e-05 - lr: 1.0000e-04\n",
473 | "Epoch 62/100\n",
474 | "130/130 [==============================] - 1s 5ms/step - loss: 6.4737e-05 - val_loss: 7.4082e-05 - lr: 1.0000e-04\n",
475 | "Epoch 63/100\n",
476 | "130/130 [==============================] - 1s 5ms/step - loss: 6.4065e-05 - val_loss: 7.2013e-05 - lr: 1.0000e-04\n",
477 | "Epoch 64/100\n",
478 | "130/130 [==============================] - 1s 5ms/step - loss: 6.3586e-05 - val_loss: 7.1381e-05 - lr: 1.0000e-04\n",
479 | "Epoch 65/100\n",
480 | "130/130 [==============================] - 1s 5ms/step - loss: 6.2723e-05 - val_loss: 6.9830e-05 - lr: 1.0000e-04\n",
481 | "Epoch 66/100\n",
482 | "130/130 [==============================] - 1s 5ms/step - loss: 6.1998e-05 - val_loss: 7.1838e-05 - lr: 1.0000e-04\n",
483 | "Epoch 67/100\n",
484 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0689e-05 - val_loss: 6.7790e-05 - lr: 1.0000e-04\n",
485 | "Epoch 68/100\n",
486 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0923e-05 - val_loss: 6.7505e-05 - lr: 1.0000e-04\n",
487 | "Epoch 69/100\n",
488 | "130/130 [==============================] - 1s 5ms/step - loss: 6.0179e-05 - val_loss: 6.8082e-05 - lr: 1.0000e-04\n",
489 | "Epoch 70/100\n",
490 | "130/130 [==============================] - 1s 5ms/step - loss: 5.8860e-05 - val_loss: 7.0600e-05 - lr: 1.0000e-04\n",
491 | "Epoch 71/100\n",
492 | "130/130 [==============================] - 1s 5ms/step - loss: 5.9904e-05 - val_loss: 6.6730e-05 - lr: 1.0000e-04\n",
493 | "Epoch 72/100\n",
494 | "130/130 [==============================] - 1s 5ms/step - loss: 5.7617e-05 - val_loss: 6.6053e-05 - lr: 1.0000e-04\n",
495 | "Epoch 73/100\n",
496 | "130/130 [==============================] - 1s 5ms/step - loss: 5.8018e-05 - val_loss: 6.3887e-05 - lr: 1.0000e-04\n",
497 | "Epoch 74/100\n",
498 | "129/130 [============================>.] - ETA: 0s - loss: 5.6845e-05Restoring model weights from the end of the best epoch: 64.\n",
499 | "130/130 [==============================] - 1s 5ms/step - loss: 5.6692e-05 - val_loss: 6.5505e-05 - lr: 1.0000e-04\n",
500 | "Epoch 00074: early stopping\n",
501 | "Shape of transformed data due to auto encoder = (5197, 24)\n",
502 | " Single_Label Multi_Classification problem \n",
503 | "Starting SULOV with 24 features...\n",
504 | " there are no null values in dataset...\n",
505 | " there are no null values in target column...\n",
506 | "Completed SULOV. 12 features selected\n",
507 | " time taken to run entire featurewiz = 57 second(s)\n",
508 | "Recursive XGBoost selected 12 features...\n"
509 | ]
510 | }
511 | ],
512 | "source": [
513 | "# Load and preprocess your dataset\n",
514 | "# Assuming X_train and y_train are your training data and labels\n",
515 | "X_train_selected, y_train = scaler.fit_transform(X_train, y_train)"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": 8,
521 | "id": "1d287a58",
522 | "metadata": {},
523 | "outputs": [
524 | {
525 | "name": "stdout",
526 | "output_type": "stream",
527 | "text": [
528 | "#### Starting featurewiz transform for test data ####\n",
529 | "Loaded input data. Shape = (1300, 12)\n",
530 | "#### Starting lazytransform for test data ####\n",
531 | " Time taken to transform dataset = 1 second(s)\n",
532 | " Shape of transformed dataset: (1300, 12)\n",
533 | "Shape of transformed data due to auto encoder = (1300, 24)\n",
534 | "Returning dataframe with 12 features \n"
535 | ]
536 | }
537 | ],
538 | "source": [
539 | "### Since you modified y_train to numeric, you must do same for y_test\n",
540 | "X_test_selected = scaler.transform(X_test)\n",
541 | "if scaler.lazy.yformer:\n",
542 | " y_test = scaler.lazy.yformer.transform(y_test)"
543 | ]
544 | },
545 | {
546 | "cell_type": "code",
547 | "execution_count": 9,
548 | "id": "e1a62c58",
549 | "metadata": {},
550 | "outputs": [
551 | {
552 | "name": "stdout",
553 | "output_type": "stream",
554 | "text": [
555 | "Bal accu 36%\n",
556 | "ROC AUC = 0.84\n",
557 | " precision recall f1-score support\n",
558 | "\n",
559 | " 0 0.00 0.00 0.00 6\n",
560 | " 1 0.71 0.12 0.20 43\n",
561 | " 2 0.73 0.69 0.71 428\n",
562 | " 3 0.65 0.79 0.71 567\n",
563 | " 4 0.69 0.56 0.62 216\n",
564 | " 5 0.93 0.36 0.52 39\n",
565 | " 6 0.00 0.00 0.00 1\n",
566 | "\n",
567 | " accuracy 0.68 1300\n",
568 | " macro avg 0.53 0.36 0.39 1300\n",
569 | "weighted avg 0.69 0.68 0.67 1300\n",
570 | "\n",
571 | "final average balanced accuracy score = 0.36\n"
572 | ]
573 | }
574 | ],
575 | "source": [
576 | "import numpy as np\n",
577 | "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n",
578 | "from sklearn.utils import class_weight\n",
579 | "from sklearn.metrics import accuracy_score, classification_report\n",
580 | "from featurewiz import get_class_distribution\n",
581 | "from xgboost import XGBClassifier, XGBRFRegressor\n",
582 | "# Updating the Random Forest Classifier with the corrected class weights\n",
583 | "if modeltype == 'Regression':\n",
584 | " #rf_classifier = RandomForestRegressor(random_state=42)\n",
585 | " rf_classifier = XGBRFRegressor(n_estimators=300, random_state=99)\n",
586 | "else:\n",
587 | " # Correctly computing class weights for the classes present in the training set\n",
588 | " class_weights_dict_corrected = get_class_distribution(y_train)\n",
589 | " rf_classifier = RandomForestClassifier(n_estimators=100, class_weight=class_weights_dict_corrected, random_state=42)\n",
590 | "\n",
591 | "\n",
592 | "# Fitting the classifier on the training data\n",
593 | "rf_classifier.fit(X_train_selected, y_train)\n",
594 | "\n",
595 | "# Predicting on the test set\n",
596 | "y_pred = rf_classifier.predict(X_test_selected)\n",
597 | "\n",
598 | "if modeltype == 'Regression':\n",
599 | " print_regression_metrics(y_test, y_pred, verbose=1)\n",
600 | "else:\n",
601 | " # Evaluating the classifier\n",
602 | " y_probas = rf_classifier.predict_proba(X_test_selected)\n",
603 | " print_classification_metrics(y_test, y_pred, y_probas, verbose=1)"
604 | ]
605 | },
606 | {
607 | "cell_type": "code",
608 | "execution_count": null,
609 | "id": "1dd98599",
610 | "metadata": {},
611 | "outputs": [],
612 | "source": []
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": null,
617 | "id": "d71a2530",
618 | "metadata": {},
619 | "outputs": [],
620 | "source": []
621 | }
622 | ],
623 | "metadata": {
624 | "kernelspec": {
625 | "display_name": "Python 3",
626 | "language": "python",
627 | "name": "python3"
628 | },
629 | "language_info": {
630 | "codemirror_mode": {
631 | "name": "ipython",
632 | "version": 3
633 | },
634 | "file_extension": ".py",
635 | "mimetype": "text/x-python",
636 | "name": "python",
637 | "nbconvert_exporter": "python",
638 | "pygments_lexer": "ipython3",
639 | "version": "3.8.5"
640 | }
641 | },
642 | "nbformat": 4,
643 | "nbformat_minor": 5
644 | }
645 |
--------------------------------------------------------------------------------
/examples/heart.csv:
--------------------------------------------------------------------------------
1 | age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2 | 63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
3 | 37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
4 | 41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
5 | 56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
6 | 57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
7 | 57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
8 | 56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
9 | 44,1,1,120,263,0,1,173,0,0,2,0,3,1
10 | 52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
11 | 57,1,2,150,168,0,1,174,0,1.6,2,0,2,1
12 | 54,1,0,140,239,0,1,160,0,1.2,2,0,2,1
13 | 48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14 | 49,1,1,130,266,0,1,171,0,0.6,2,0,2,1
15 | 64,1,3,110,211,0,0,144,1,1.8,1,0,2,1
16 | 58,0,3,150,283,1,0,162,0,1,2,0,2,1
17 | 50,0,2,120,219,0,1,158,0,1.6,1,0,2,1
18 | 58,0,2,120,340,0,1,172,0,0,2,0,2,1
19 | 66,0,3,150,226,0,1,114,0,2.6,0,0,2,1
20 | 43,1,0,150,247,0,1,171,0,1.5,2,0,2,1
21 | 69,0,3,140,239,0,1,151,0,1.8,2,2,2,1
22 | 59,1,0,135,234,0,1,161,0,0.5,1,0,3,1
23 | 44,1,2,130,233,0,1,179,1,0.4,2,0,2,1
24 | 42,1,0,140,226,0,1,178,0,0,2,0,2,1
25 | 61,1,2,150,243,1,1,137,1,1,1,0,2,1
26 | 40,1,3,140,199,0,1,178,1,1.4,2,0,3,1
27 | 71,0,1,160,302,0,1,162,0,0.4,2,2,2,1
28 | 59,1,2,150,212,1,1,157,0,1.6,2,0,2,1
29 | 51,1,2,110,175,0,1,123,0,0.6,2,0,2,1
30 | 65,0,2,140,417,1,0,157,0,0.8,2,1,2,1
31 | 53,1,2,130,197,1,0,152,0,1.2,0,0,2,1
32 | 41,0,1,105,198,0,1,168,0,0,2,1,2,1
33 | 65,1,0,120,177,0,1,140,0,0.4,2,0,3,1
34 | 44,1,1,130,219,0,0,188,0,0,2,0,2,1
35 | 54,1,2,125,273,0,0,152,0,0.5,0,1,2,1
36 | 51,1,3,125,213,0,0,125,1,1.4,2,1,2,1
37 | 46,0,2,142,177,0,0,160,1,1.4,0,0,2,1
38 | 54,0,2,135,304,1,1,170,0,0,2,0,2,1
39 | 54,1,2,150,232,0,0,165,0,1.6,2,0,3,1
40 | 65,0,2,155,269,0,1,148,0,0.8,2,0,2,1
41 | 65,0,2,160,360,0,0,151,0,0.8,2,0,2,1
42 | 51,0,2,140,308,0,0,142,0,1.5,2,1,2,1
43 | 48,1,1,130,245,0,0,180,0,0.2,1,0,2,1
44 | 45,1,0,104,208,0,0,148,1,3,1,0,2,1
45 | 53,0,0,130,264,0,0,143,0,0.4,1,0,2,1
46 | 39,1,2,140,321,0,0,182,0,0,2,0,2,1
47 | 52,1,1,120,325,0,1,172,0,0.2,2,0,2,1
48 | 44,1,2,140,235,0,0,180,0,0,2,0,2,1
49 | 47,1,2,138,257,0,0,156,0,0,2,0,2,1
50 | 53,0,2,128,216,0,0,115,0,0,2,0,0,1
51 | 53,0,0,138,234,0,0,160,0,0,2,0,2,1
52 | 51,0,2,130,256,0,0,149,0,0.5,2,0,2,1
53 | 66,1,0,120,302,0,0,151,0,0.4,1,0,2,1
54 | 62,1,2,130,231,0,1,146,0,1.8,1,3,3,1
55 | 44,0,2,108,141,0,1,175,0,0.6,1,0,2,1
56 | 63,0,2,135,252,0,0,172,0,0,2,0,2,1
57 | 52,1,1,134,201,0,1,158,0,0.8,2,1,2,1
58 | 48,1,0,122,222,0,0,186,0,0,2,0,2,1
59 | 45,1,0,115,260,0,0,185,0,0,2,0,2,1
60 | 34,1,3,118,182,0,0,174,0,0,2,0,2,1
61 | 57,0,0,128,303,0,0,159,0,0,2,1,2,1
62 | 71,0,2,110,265,1,0,130,0,0,2,1,2,1
63 | 54,1,1,108,309,0,1,156,0,0,2,0,3,1
64 | 52,1,3,118,186,0,0,190,0,0,1,0,1,1
65 | 41,1,1,135,203,0,1,132,0,0,1,0,1,1
66 | 58,1,2,140,211,1,0,165,0,0,2,0,2,1
67 | 35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
68 | 51,1,2,100,222,0,1,143,1,1.2,1,0,2,1
69 | 45,0,1,130,234,0,0,175,0,0.6,1,0,2,1
70 | 44,1,1,120,220,0,1,170,0,0,2,0,2,1
71 | 62,0,0,124,209,0,1,163,0,0,2,0,2,1
72 | 54,1,2,120,258,0,0,147,0,0.4,1,0,3,1
73 | 51,1,2,94,227,0,1,154,1,0,2,1,3,1
74 | 29,1,1,130,204,0,0,202,0,0,2,0,2,1
75 | 51,1,0,140,261,0,0,186,1,0,2,0,2,1
76 | 43,0,2,122,213,0,1,165,0,0.2,1,0,2,1
77 | 55,0,1,135,250,0,0,161,0,1.4,1,0,2,1
78 | 51,1,2,125,245,1,0,166,0,2.4,1,0,2,1
79 | 59,1,1,140,221,0,1,164,1,0,2,0,2,1
80 | 52,1,1,128,205,1,1,184,0,0,2,0,2,1
81 | 58,1,2,105,240,0,0,154,1,0.6,1,0,3,1
82 | 41,1,2,112,250,0,1,179,0,0,2,0,2,1
83 | 45,1,1,128,308,0,0,170,0,0,2,0,2,1
84 | 60,0,2,102,318,0,1,160,0,0,2,1,2,1
85 | 52,1,3,152,298,1,1,178,0,1.2,1,0,3,1
86 | 42,0,0,102,265,0,0,122,0,0.6,1,0,2,1
87 | 67,0,2,115,564,0,0,160,0,1.6,1,0,3,1
88 | 68,1,2,118,277,0,1,151,0,1,2,1,3,1
89 | 46,1,1,101,197,1,1,156,0,0,2,0,3,1
90 | 54,0,2,110,214,0,1,158,0,1.6,1,0,2,1
91 | 58,0,0,100,248,0,0,122,0,1,1,0,2,1
92 | 48,1,2,124,255,1,1,175,0,0,2,2,2,1
93 | 57,1,0,132,207,0,1,168,1,0,2,0,3,1
94 | 52,1,2,138,223,0,1,169,0,0,2,4,2,1
95 | 54,0,1,132,288,1,0,159,1,0,2,1,2,1
96 | 45,0,1,112,160,0,1,138,0,0,1,0,2,1
97 | 53,1,0,142,226,0,0,111,1,0,2,0,3,1
98 | 62,0,0,140,394,0,0,157,0,1.2,1,0,2,1
99 | 52,1,0,108,233,1,1,147,0,0.1,2,3,3,1
100 | 43,1,2,130,315,0,1,162,0,1.9,2,1,2,1
101 | 53,1,2,130,246,1,0,173,0,0,2,3,2,1
102 | 42,1,3,148,244,0,0,178,0,0.8,2,2,2,1
103 | 59,1,3,178,270,0,0,145,0,4.2,0,0,3,1
104 | 63,0,1,140,195,0,1,179,0,0,2,2,2,1
105 | 42,1,2,120,240,1,1,194,0,0.8,0,0,3,1
106 | 50,1,2,129,196,0,1,163,0,0,2,0,2,1
107 | 68,0,2,120,211,0,0,115,0,1.5,1,0,2,1
108 | 69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
109 | 45,0,0,138,236,0,0,152,1,0.2,1,0,2,1
110 | 50,0,1,120,244,0,1,162,0,1.1,2,0,2,1
111 | 50,0,0,110,254,0,0,159,0,0,2,0,2,1
112 | 64,0,0,180,325,0,1,154,1,0,2,0,2,1
113 | 57,1,2,150,126,1,1,173,0,0.2,2,1,3,1
114 | 64,0,2,140,313,0,1,133,0,0.2,2,0,3,1
115 | 43,1,0,110,211,0,1,161,0,0,2,0,3,1
116 | 55,1,1,130,262,0,1,155,0,0,2,0,2,1
117 | 37,0,2,120,215,0,1,170,0,0,2,0,2,1
118 | 41,1,2,130,214,0,0,168,0,2,1,0,2,1
119 | 56,1,3,120,193,0,0,162,0,1.9,1,0,3,1
120 | 46,0,1,105,204,0,1,172,0,0,2,0,2,1
121 | 46,0,0,138,243,0,0,152,1,0,1,0,2,1
122 | 64,0,0,130,303,0,1,122,0,2,1,2,2,1
123 | 59,1,0,138,271,0,0,182,0,0,2,0,2,1
124 | 41,0,2,112,268,0,0,172,1,0,2,0,2,1
125 | 54,0,2,108,267,0,0,167,0,0,2,0,2,1
126 | 39,0,2,94,199,0,1,179,0,0,2,0,2,1
127 | 34,0,1,118,210,0,1,192,0,0.7,2,0,2,1
128 | 47,1,0,112,204,0,1,143,0,0.1,2,0,2,1
129 | 67,0,2,152,277,0,1,172,0,0,2,1,2,1
130 | 52,0,2,136,196,0,0,169,0,0.1,1,0,2,1
131 | 74,0,1,120,269,0,0,121,1,0.2,2,1,2,1
132 | 54,0,2,160,201,0,1,163,0,0,2,1,2,1
133 | 49,0,1,134,271,0,1,162,0,0,1,0,2,1
134 | 42,1,1,120,295,0,1,162,0,0,2,0,2,1
135 | 41,1,1,110,235,0,1,153,0,0,2,0,2,1
136 | 41,0,1,126,306,0,1,163,0,0,2,0,2,1
137 | 49,0,0,130,269,0,1,163,0,0,2,0,2,1
138 | 60,0,2,120,178,1,1,96,0,0,2,0,2,1
139 | 62,1,1,128,208,1,0,140,0,0,2,0,2,1
140 | 57,1,0,110,201,0,1,126,1,1.5,1,0,1,1
141 | 64,1,0,128,263,0,1,105,1,0.2,1,1,3,1
142 | 51,0,2,120,295,0,0,157,0,0.6,2,0,2,1
143 | 43,1,0,115,303,0,1,181,0,1.2,1,0,2,1
144 | 42,0,2,120,209,0,1,173,0,0,1,0,2,1
145 | 67,0,0,106,223,0,1,142,0,0.3,2,2,2,1
146 | 76,0,2,140,197,0,2,116,0,1.1,1,0,2,1
147 | 70,1,1,156,245,0,0,143,0,0,2,0,2,1
148 | 44,0,2,118,242,0,1,149,0,0.3,1,1,2,1
149 | 60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
150 | 44,1,2,120,226,0,1,169,0,0,2,0,2,1
151 | 42,1,2,130,180,0,1,150,0,0,2,0,2,1
152 | 66,1,0,160,228,0,0,138,0,2.3,2,0,1,1
153 | 71,0,0,112,149,0,1,125,0,1.6,1,0,2,1
154 | 64,1,3,170,227,0,0,155,0,0.6,1,0,3,1
155 | 66,0,2,146,278,0,0,152,0,0,1,1,2,1
156 | 39,0,2,138,220,0,1,152,0,0,1,0,2,1
157 | 58,0,0,130,197,0,1,131,0,0.6,1,0,2,1
158 | 47,1,2,130,253,0,1,179,0,0,2,0,2,1
159 | 35,1,1,122,192,0,1,174,0,0,2,0,2,1
160 | 58,1,1,125,220,0,1,144,0,0.4,1,4,3,1
161 | 56,1,1,130,221,0,0,163,0,0,2,0,3,1
162 | 56,1,1,120,240,0,1,169,0,0,0,0,2,1
163 | 55,0,1,132,342,0,1,166,0,1.2,2,0,2,1
164 | 41,1,1,120,157,0,1,182,0,0,2,0,2,1
165 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
166 | 38,1,2,138,175,0,1,173,0,0,2,4,2,1
167 | 67,1,0,160,286,0,0,108,1,1.5,1,3,2,0
168 | 67,1,0,120,229,0,0,129,1,2.6,1,2,3,0
169 | 62,0,0,140,268,0,0,160,0,3.6,0,2,2,0
170 | 63,1,0,130,254,0,0,147,0,1.4,1,1,3,0
171 | 53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
172 | 56,1,2,130,256,1,0,142,1,0.6,1,1,1,0
173 | 48,1,1,110,229,0,1,168,0,1,0,0,3,0
174 | 58,1,1,120,284,0,0,160,0,1.8,1,0,2,0
175 | 58,1,2,132,224,0,0,173,0,3.2,2,2,3,0
176 | 60,1,0,130,206,0,0,132,1,2.4,1,2,3,0
177 | 40,1,0,110,167,0,0,114,1,2,1,0,3,0
178 | 60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
179 | 64,1,2,140,335,0,1,158,0,0,2,0,2,0
180 | 43,1,0,120,177,0,0,120,1,2.5,1,0,3,0
181 | 57,1,0,150,276,0,0,112,1,0.6,1,1,1,0
182 | 55,1,0,132,353,0,1,132,1,1.2,1,1,3,0
183 | 65,0,0,150,225,0,0,114,0,1,1,3,3,0
184 | 61,0,0,130,330,0,0,169,0,0,2,0,2,0
185 | 58,1,2,112,230,0,0,165,0,2.5,1,1,3,0
186 | 50,1,0,150,243,0,0,128,0,2.6,1,0,3,0
187 | 44,1,0,112,290,0,0,153,0,0,2,1,2,0
188 | 60,1,0,130,253,0,1,144,1,1.4,2,1,3,0
189 | 54,1,0,124,266,0,0,109,1,2.2,1,1,3,0
190 | 50,1,2,140,233,0,1,163,0,0.6,1,1,3,0
191 | 41,1,0,110,172,0,0,158,0,0,2,0,3,0
192 | 51,0,0,130,305,0,1,142,1,1.2,1,0,3,0
193 | 58,1,0,128,216,0,0,131,1,2.2,1,3,3,0
194 | 54,1,0,120,188,0,1,113,0,1.4,1,1,3,0
195 | 60,1,0,145,282,0,0,142,1,2.8,1,2,3,0
196 | 60,1,2,140,185,0,0,155,0,3,1,0,2,0
197 | 59,1,0,170,326,0,0,140,1,3.4,0,0,3,0
198 | 46,1,2,150,231,0,1,147,0,3.6,1,0,2,0
199 | 67,1,0,125,254,1,1,163,0,0.2,1,2,3,0
200 | 62,1,0,120,267,0,1,99,1,1.8,1,2,3,0
201 | 65,1,0,110,248,0,0,158,0,0.6,2,2,1,0
202 | 44,1,0,110,197,0,0,177,0,0,2,1,2,0
203 | 60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
204 | 58,1,0,150,270,0,0,111,1,0.8,2,0,3,0
205 | 68,1,2,180,274,1,0,150,1,1.6,1,0,3,0
206 | 62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
207 | 52,1,0,128,255,0,1,161,1,0,2,1,3,0
208 | 59,1,0,110,239,0,0,142,1,1.2,1,1,3,0
209 | 60,0,0,150,258,0,0,157,0,2.6,1,2,3,0
210 | 49,1,2,120,188,0,1,139,0,2,1,3,3,0
211 | 59,1,0,140,177,0,1,162,1,0,2,1,3,0
212 | 57,1,2,128,229,0,0,150,0,0.4,1,1,3,0
213 | 61,1,0,120,260,0,1,140,1,3.6,1,1,3,0
214 | 39,1,0,118,219,0,1,140,0,1.2,1,0,3,0
215 | 61,0,0,145,307,0,0,146,1,1,1,0,3,0
216 | 56,1,0,125,249,1,0,144,1,1.2,1,1,2,0
217 | 43,0,0,132,341,1,0,136,1,3,1,0,3,0
218 | 62,0,2,130,263,0,1,97,0,1.2,1,1,3,0
219 | 63,1,0,130,330,1,0,132,1,1.8,2,3,3,0
220 | 65,1,0,135,254,0,0,127,0,2.8,1,1,3,0
221 | 48,1,0,130,256,1,0,150,1,0,2,2,3,0
222 | 63,0,0,150,407,0,0,154,0,4,1,3,3,0
223 | 55,1,0,140,217,0,1,111,1,5.6,0,0,3,0
224 | 65,1,3,138,282,1,0,174,0,1.4,1,1,2,0
225 | 56,0,0,200,288,1,0,133,1,4,0,2,3,0
226 | 54,1,0,110,239,0,1,126,1,2.8,1,1,3,0
227 | 70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
228 | 62,1,1,120,281,0,0,103,0,1.4,1,1,3,0
229 | 35,1,0,120,198,0,1,130,1,1.6,1,0,3,0
230 | 59,1,3,170,288,0,0,159,0,0.2,1,0,3,0
231 | 64,1,2,125,309,0,1,131,1,1.8,1,0,3,0
232 | 47,1,2,108,243,0,1,152,0,0,2,0,2,0
233 | 57,1,0,165,289,1,0,124,0,1,1,3,3,0
234 | 55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
235 | 64,1,0,120,246,0,0,96,1,2.2,0,1,2,0
236 | 70,1,0,130,322,0,0,109,0,2.4,1,3,2,0
237 | 51,1,0,140,299,0,1,173,1,1.6,2,0,3,0
238 | 58,1,0,125,300,0,0,171,0,0,2,2,3,0
239 | 60,1,0,140,293,0,0,170,0,1.2,1,2,3,0
240 | 77,1,0,125,304,0,0,162,1,0,2,3,2,0
241 | 35,1,0,126,282,0,0,156,1,0,2,0,3,0
242 | 70,1,2,160,269,0,1,112,1,2.9,1,1,3,0
243 | 59,0,0,174,249,0,1,143,1,0,1,0,2,0
244 | 64,1,0,145,212,0,0,132,0,2,1,2,1,0
245 | 57,1,0,152,274,0,1,88,1,1.2,1,1,3,0
246 | 56,1,0,132,184,0,0,105,1,2.1,1,1,1,0
247 | 48,1,0,124,274,0,0,166,0,0.5,1,0,3,0
248 | 56,0,0,134,409,0,0,150,1,1.9,1,2,3,0
249 | 66,1,1,160,246,0,1,120,1,0,1,3,1,0
250 | 54,1,1,192,283,0,0,195,0,0,2,1,3,0
251 | 69,1,2,140,254,0,0,146,0,2,1,3,3,0
252 | 51,1,0,140,298,0,1,122,1,4.2,1,3,3,0
253 | 43,1,0,132,247,1,0,143,1,0.1,1,4,3,0
254 | 62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
255 | 67,1,0,100,299,0,0,125,1,0.9,1,2,2,0
256 | 59,1,3,160,273,0,0,125,0,0,2,0,2,0
257 | 45,1,0,142,309,0,0,147,1,0,1,3,3,0
258 | 58,1,0,128,259,0,0,130,1,3,1,2,3,0
259 | 50,1,0,144,200,0,0,126,1,0.9,1,0,3,0
260 | 62,0,0,150,244,0,1,154,1,1.4,1,0,2,0
261 | 38,1,3,120,231,0,1,182,1,3.8,1,0,3,0
262 | 66,0,0,178,228,1,1,165,1,1,1,2,3,0
263 | 52,1,0,112,230,0,1,160,0,0,2,1,2,0
264 | 53,1,0,123,282,0,1,95,1,2,1,2,3,0
265 | 63,0,0,108,269,0,1,169,1,1.8,1,2,2,0
266 | 54,1,0,110,206,0,0,108,1,0,1,1,2,0
267 | 66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
268 | 55,0,0,180,327,0,2,117,1,3.4,1,0,2,0
269 | 49,1,2,118,149,0,0,126,0,0.8,2,3,2,0
270 | 54,1,0,122,286,0,0,116,1,3.2,1,2,2,0
271 | 56,1,0,130,283,1,0,103,1,1.6,0,0,3,0
272 | 46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
273 | 61,1,3,134,234,0,1,145,0,2.6,1,2,2,0
274 | 67,1,0,120,237,0,1,71,0,1,1,0,2,0
275 | 58,1,0,100,234,0,1,156,0,0.1,2,1,3,0
276 | 47,1,0,110,275,0,0,118,1,1,1,1,2,0
277 | 52,1,0,125,212,0,1,168,0,1,2,2,3,0
278 | 58,1,0,146,218,0,1,105,0,2,1,1,3,0
279 | 57,1,1,124,261,0,1,141,0,0.3,2,0,3,0
280 | 58,0,1,136,319,1,0,152,0,0,2,2,2,0
281 | 61,1,0,138,166,0,0,125,1,3.6,1,1,2,0
282 | 42,1,0,136,315,0,1,125,1,1.8,1,0,1,0
283 | 52,1,0,128,204,1,1,156,1,1,1,0,0,0
284 | 59,1,2,126,218,1,1,134,0,2.2,1,1,1,0
285 | 40,1,0,152,223,0,1,181,0,0,2,0,3,0
286 | 61,1,0,140,207,0,0,138,1,1.9,2,1,3,0
287 | 46,1,0,140,311,0,1,120,1,1.8,1,2,3,0
288 | 59,1,3,134,204,0,1,162,0,0.8,2,2,2,0
289 | 57,1,1,154,232,0,0,164,0,0,2,1,2,0
290 | 57,1,0,110,335,0,1,143,1,3,1,1,3,0
291 | 55,0,0,128,205,0,2,130,1,2,1,1,3,0
292 | 61,1,0,148,203,0,1,161,0,0,2,1,3,0
293 | 58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
294 | 58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
295 | 67,1,2,152,212,0,0,150,0,0.8,1,0,3,0
296 | 44,1,0,120,169,0,1,144,1,2.8,0,0,1,0
297 | 63,1,0,140,187,0,0,144,1,4,2,2,3,0
298 | 63,0,0,124,197,0,1,136,1,0,1,0,2,0
299 | 59,1,0,164,176,1,0,90,0,1,1,2,1,0
300 | 57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
301 | 45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
302 | 68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
303 | 57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
304 | 57,0,1,130,236,0,0,174,0,0,1,1,2,0
305 |
--------------------------------------------------------------------------------
/featurewiz/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | ################################################################################
3 | # featurewiz - advanced feature engineering and best features selection in single line of code
4 | # Python v3.6+
5 | # Created by Ram Seshadri
6 | # Licensed under Apache License v2
7 | ################################################################################
8 | # Version
9 | from .__version__ import __version__
10 | from .featurewiz import featurewiz
11 | from .featurewiz import FE_split_one_field_into_many, FE_add_groupby_features_aggregated_to_dataframe
12 | from .featurewiz import FE_start_end_date_time_features
13 | from .featurewiz import classify_features
14 | from .featurewiz import classify_columns,FE_combine_rare_categories
15 | from .featurewiz import FE_count_rows_for_all_columns_by_group
16 | from .featurewiz import FE_add_age_by_date_col, FE_split_add_column, FE_get_latest_values_based_on_date_column
17 | from .featurewiz import FE_capping_outliers_beyond_IQR_Range
18 | from .featurewiz import EDA_classify_and_return_cols_by_type, EDA_classify_features_for_deep_learning
19 | from .featurewiz import FE_create_categorical_feature_crosses, EDA_find_skewed_variables
20 | from .featurewiz import FE_find_and_cap_outliers, EDA_find_outliers
21 | from .featurewiz import split_data_n_ways, FE_concatenate_multiple_columns
22 | from .featurewiz import FE_discretize_numeric_variables, reduce_mem_usage
23 | from .ml_models import simple_XGBoost_model, simple_LightGBM_model, complex_XGBoost_model
24 | from .ml_models import complex_LightGBM_model,data_transform, MultiClassSVM
25 | from .ml_models import IterativeBestClassifier, IterativeDoubleClassifier, IterativeSearchClassifier
26 | from .my_encoders import My_LabelEncoder, Groupby_Aggregator, My_LabelEncoder_Pipe, Ranking_Aggregator, DateTime_Transformer
27 | from .my_encoders import Rare_Class_Combiner, Rare_Class_Combiner_Pipe, FE_create_time_series_features, Binning_Transformer
28 | from .my_encoders import Column_Names_Transformer, FE_convert_all_object_columns_to_numeric, Numeric_Transformer
29 | from .my_encoders import TS_Lagging_Transformer, TS_Fourier_Transformer, TS_Trend_Seasonality_Transformer
30 | from .my_encoders import TS_Lagging_Transformer_Pipe, TS_Fourier_Transformer_Pipe
31 | from lazytransform import LazyTransformer, SuloRegressor, SuloClassifier, print_regression_metrics, print_classification_metrics
32 | from lazytransform import print_regression_model_stats, YTransformer, print_sulo_accuracy
33 | from .sulov_method import FE_remove_variables_using_SULOV_method
34 | from .featurewiz import FE_transform_numeric_columns_to_bins, FE_create_interaction_vars
35 | from .stacking_models import Stacking_Classifier, Blending_Regressor, Stacking_Regressor, stacking_models_list
36 | from .stacking_models import StackingClassifier_Multi, analyze_problem_type_array, get_class_distribution
37 | from .auto_encoders import DenoisingAutoEncoder, VariationalAutoEncoder, CNNAutoEncoder
38 | from .auto_encoders import GAN, GANAugmenter
39 | from .featurewiz import EDA_binning_numeric_column_displaying_bins, FE_calculate_duration_from_timestamp
40 | from .featurewiz import FE_convert_mixed_datatypes_to_string, FE_drop_rows_with_infinity
41 | from .featurewiz import EDA_find_remove_columns_with_infinity, FE_split_list_into_columns
42 | from .featurewiz import EDA_remove_special_chars, FE_remove_commas_in_numerics
43 | from .featurewiz import EDA_randomly_select_rows_from_dataframe, remove_duplicate_cols_in_dataset
44 | from .featurewiz import cross_val_model_predictions
45 | from .blagging import BlaggingClassifier
46 | from .featurewiz import FeatureWiz
47 | ################################################################################
48 | if __name__ == "__main__":
49 | module_type = 'Running'
50 | else:
51 | module_type = 'Imported'
52 | version_number = __version__
53 | print("""%s featurewiz %s. Use the following syntax:
54 | >>> wiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True,
55 | category_encoders="auto", auto_encoders='VAE', ae_options={},
56 | add_missing=False, imbalanced=False, verbose=0)
57 | >>> X_train_selected, y_train = wiz.fit_transform(X_train, y_train)
58 | >>> X_test_selected = wiz.transform(X_test)
59 | >>> selected_features = wiz.features
60 | """ %(module_type, version_number))
61 | ################################################################################
62 |
--------------------------------------------------------------------------------
/featurewiz/__version__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """Specifies the version of the featurewiz package."""
3 |
4 | __title__ = "featurewiz"
5 | __author__ = "Ram Seshadri"
6 | __description__ = "Advanced Feature Engineering and Feature Selection for any data set, any size"
7 | __url__ = "https://github.com/Auto_ViML/featurewiz.git"
8 | __version__ = "0.6.1"
9 | __license__ = "Apache License 2.0"
10 | __copyright__ = "2020-23 Google"
11 |
--------------------------------------------------------------------------------
/featurewiz/classify_method.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import random
4 | np.random.seed(99)
5 | random.seed(42)
6 | ################################################################################
7 | #### The warnings from Sklearn are so annoying that I have to shut it off #######
8 | import warnings
9 | warnings.filterwarnings("ignore")
10 | from sklearn.exceptions import DataConversionWarning
11 | warnings.filterwarnings(action='ignore', category=DataConversionWarning)
12 | def warn(*args, **kwargs):
13 | pass
14 | warnings.warn = warn
15 | import logging
16 | ####################################################################################
17 | import pdb
18 | from functools import reduce
19 | import copy
20 | import time
21 | #################################################################################
22 | def left_subtract(l1,l2):
23 | lst = []
24 | for i in l1:
25 | if i not in l2:
26 | lst.append(i)
27 | return lst
28 | #################################################################################
29 | import copy
30 | def EDA_find_remove_columns_with_infinity(df, remove=False):
31 | """
32 | This function finds all columns in a dataframe that have inifinite values (np.inf or -np.inf)
33 | It returns a list of column names. If the list is empty, it means no columns were found.
34 | If remove flag is set, then it returns a smaller dataframe with inf columns removed.
35 | """
36 | nums = df.select_dtypes(include='number').columns.tolist()
37 | dfx = df[nums]
38 | sum_rows = np.isinf(dfx).values.sum()
39 | add_cols = list(dfx.columns.to_series()[np.isinf(dfx).any()])
40 | if sum_rows > 0:
41 | print(' there are %d rows and %d columns with infinity in them...' %(sum_rows,len(add_cols)))
42 | if remove:
43 | ### here you need to use df since the whole dataset is involved ###
44 | nocols = [x for x in df.columns if x not in add_cols]
45 | print(" Shape of dataset before %s and after %s removing columns with infinity" %(df.shape,(df[nocols].shape,)))
46 | return df[nocols]
47 | else:
48 | ## this will be a list of columns with infinity ####
49 | return add_cols
50 | else:
51 | ## this will be an empty list if there are no columns with infinity
52 | return add_cols
53 | ####################################################################################
54 | def classify_columns(df_preds, verbose=0):
55 | """
56 | This actually does Exploratory data analysis - it means this function performs EDA
57 | ######################################################################################
58 | Takes a dataframe containing only predictors to be classified into various types.
59 | DO NOT SEND IN A TARGET COLUMN since it will try to include that into various columns.
60 | Returns a data frame containing columns and the class it belongs to such as numeric,
61 | categorical, date or id column, boolean, nlp, discrete_string and cols to delete...
62 | ####### Returns a dictionary with 10 kinds of vars like the following: # continuous_vars,int_vars
63 | # cat_vars,factor_vars, bool_vars,discrete_string_vars,nlp_vars,date_vars,id_vars,cols_delete
64 | """
65 | train = copy.deepcopy(df_preds)
66 | #### If there are 30 chars are more in a discrete_string_var, it is then considered an NLP variable
67 | max_nlp_char_size = 30
68 | max_cols_to_print = 30
69 | print('#######################################################################################')
70 | print('######################## C L A S S I F Y I N G V A R I A B L E S ####################')
71 | print('#######################################################################################')
72 | if verbose:
73 | print('Classifying variables in data set...')
74 | #### Cat_Limit defines the max number of categories a column can have to be called a categorical colum
75 | cat_limit = 35
76 | float_limit = 15 #### Make this limit low so that float variables below this limit become cat vars ###
77 | def add(a,b):
78 | return a+b
79 | sum_all_cols = dict()
80 | orig_cols_total = train.shape[1]
81 | #Types of columns
82 | cols_delete = []
83 | cols_delete = [col for col in list(train) if (len(train[col].value_counts()) == 1
84 | ) | (train[col].isnull().sum()/len(train) >= 0.90)]
85 | inf_cols = EDA_find_remove_columns_with_infinity(train)
86 | mixed_cols = [x for x in list(train) if len(train[x].dropna().apply(type).value_counts()) > 1]
87 | if len(mixed_cols) > 0:
88 | print(' Removing %s column(s) due to mixed data type detected...' %mixed_cols)
89 | cols_delete += mixed_cols
90 | cols_delete += inf_cols
91 | train = train[left_subtract(list(train),cols_delete)]
92 | var_df = pd.Series(dict(train.dtypes)).reset_index(drop=False).rename(
93 | columns={0:'type_of_column'})
94 | sum_all_cols['cols_delete'] = cols_delete
95 |
96 | var_df['bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in ['bool','object']
97 | and len(train[x['index']].value_counts()) == 2 else 0, axis=1)
98 | string_bool_vars = list(var_df[(var_df['bool'] ==1)]['index'])
99 | sum_all_cols['string_bool_vars'] = string_bool_vars
100 | var_df['num_bool'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8,
101 | np.uint16, np.uint32, np.uint64,
102 | 'int8','int16','int32','int64',
103 | 'float16','float32','float64'] and len(
104 | train[x['index']].value_counts()) == 2 else 0, axis=1)
105 | num_bool_vars = list(var_df[(var_df['num_bool'] ==1)]['index'])
106 | sum_all_cols['num_bool_vars'] = num_bool_vars
107 | ###### This is where we take all Object vars and split them into diff kinds ###
108 | discrete_or_nlp = var_df.apply(lambda x: 1 if x['type_of_column'] in ['object'] and x[
109 | 'index'] not in string_bool_vars+cols_delete else 0,axis=1)
110 | ######### This is where we figure out whether a string var is nlp or discrete_string var ###
111 | var_df['nlp_strings'] = 0
112 | var_df['discrete_strings'] = 0
113 | var_df['cat'] = 0
114 | var_df['id_col'] = 0
115 | discrete_or_nlp_vars = var_df.loc[discrete_or_nlp==1]['index'].values.tolist()
116 | copy_discrete_or_nlp_vars = copy.deepcopy(discrete_or_nlp_vars)
117 | if len(discrete_or_nlp_vars) > 0:
118 | for col in copy_discrete_or_nlp_vars:
119 | #### first fill empty or missing vals since it will blowup ###
120 | ### Remember that fillna only works at the dataframe level!
121 | train[[col]] = train[[col]].fillna(' ')
122 | if train[col].map(lambda x: len(x) if type(x)==str else 0).max(
123 | ) >= 50 and len(train[col].value_counts()
124 | ) >= int(0.9*len(train)) and col not in string_bool_vars:
125 | var_df.loc[var_df['index']==col,'nlp_strings'] = 1
126 | elif train[col].map(lambda x: len(x) if type(x)==str else 0).mean(
127 | ) >= max_nlp_char_size and train[col].map(lambda x: len(x) if type(x)==str else 0).max(
128 | ) < 50 and len(train[col].value_counts()
129 | ) <= int(0.9*len(train)) and col not in string_bool_vars:
130 | var_df.loc[var_df['index']==col,'discrete_strings'] = 1
131 | elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts()
132 | ) <= int(0.9*len(train)) and col not in string_bool_vars:
133 | var_df.loc[var_df['index']==col,'discrete_strings'] = 1
134 | elif len(train[col].value_counts()) > cat_limit and len(train[col].value_counts()
135 | ) == len(train) and col not in string_bool_vars:
136 | var_df.loc[var_df['index']==col,'id_col'] = 1
137 | else:
138 | var_df.loc[var_df['index']==col,'cat'] = 1
139 | nlp_vars = list(var_df[(var_df['nlp_strings'] ==1)]['index'])
140 | sum_all_cols['nlp_vars'] = nlp_vars
141 | discrete_string_vars = list(var_df[(var_df['discrete_strings'] ==1) ]['index'])
142 | sum_all_cols['discrete_string_vars'] = discrete_string_vars
143 | ###### This happens only if a string column happens to be an ID column #######
144 | #### DO NOT Add this to ID_VARS yet. It will be done later.. Dont change it easily...
145 | #### Category DTYPE vars are very special = they can be left as is and not disturbed in Python. ###
146 | var_df['dcat'] = var_df.apply(lambda x: 1 if str(x['type_of_column'])=='category' else 0,
147 | axis=1)
148 | factor_vars = list(var_df[(var_df['dcat'] ==1)]['index'])
149 | sum_all_cols['factor_vars'] = factor_vars
150 | ########################################################################
151 | date_or_id = var_df.apply(lambda x: 1 if x['type_of_column'] in [np.uint8,
152 | np.uint16, np.uint32, np.uint64,
153 | 'int8','int16',
154 | 'int32','int64'] and x[
155 | 'index'] not in string_bool_vars+num_bool_vars+discrete_string_vars+nlp_vars else 0,
156 | axis=1)
157 | ######### This is where we figure out whether a numeric col is date or id variable ###
158 | var_df['int'] = 0
159 | var_df['date_time'] = 0
160 | ### if a particular column is date-time type, now set it as a date time variable ##
161 | var_df['date_time'] = var_df.apply(lambda x: 1 if x['type_of_column'] in [' 2050:
169 | var_df.loc[var_df['index']==col,'id_col'] = 1
170 | else:
171 | try:
172 | pd.to_datetime(train[col],infer_datetime_format=True)
173 | var_df.loc[var_df['index']==col,'date_time'] = 1
174 | except:
175 | var_df.loc[var_df['index']==col,'id_col'] = 1
176 | else:
177 | if train[col].min() < 1900 or train[col].max() > 2050:
178 | if col not in num_bool_vars:
179 | var_df.loc[var_df['index']==col,'int'] = 1
180 | else:
181 | try:
182 | pd.to_datetime(train[col],infer_datetime_format=True)
183 | var_df.loc[var_df['index']==col,'date_time'] = 1
184 | except:
185 | if col not in num_bool_vars:
186 | var_df.loc[var_df['index']==col,'int'] = 1
187 | else:
188 | pass
189 | int_vars = list(var_df[(var_df['int'] ==1)]['index'])
190 | date_vars = list(var_df[(var_df['date_time'] == 1)]['index'])
191 | id_vars = list(var_df[(var_df['id_col'] == 1)]['index'])
192 | sum_all_cols['int_vars'] = int_vars
193 | copy_date_vars = copy.deepcopy(date_vars)
194 | for date_var in copy_date_vars:
195 | #### This test is to make sure sure date vars are actually date vars
196 | try:
197 | pd.to_datetime(train[date_var],infer_datetime_format=True)
198 | except:
199 | ##### if not a date var, then just add it to delete it from processing
200 | cols_delete.append(date_var)
201 | date_vars.remove(date_var)
202 | sum_all_cols['date_vars'] = date_vars
203 | sum_all_cols['id_vars'] = id_vars
204 | sum_all_cols['cols_delete'] = cols_delete
205 | ## This is an EXTREMELY complicated logic for cat vars. Don't change it unless you test it many times!
206 | var_df['numeric'] = 0
207 | float_or_cat = var_df.apply(lambda x: 1 if x['type_of_column'] in ['float16',
208 | 'float32','float64'] else 0,
209 | axis=1)
210 | ####### We need to make sure there are no categorical vars in float #######
211 | if len(var_df.loc[float_or_cat == 1]) > 0:
212 | for col in var_df.loc[float_or_cat == 1]['index'].values.tolist():
213 | if len(train[col].value_counts()) > 2 and len(train[col].value_counts()
214 | ) <= float_limit and len(train[col].value_counts()) <= len(train):
215 | var_df.loc[var_df['index']==col,'cat'] = 1
216 | else:
217 | if col not in (num_bool_vars + factor_vars):
218 | var_df.loc[var_df['index']==col,'numeric'] = 1
219 | cat_vars = list(var_df[(var_df['cat'] ==1)]['index'])
220 | continuous_vars = list(var_df[(var_df['numeric'] ==1)]['index'])
221 |
222 | ######## V E R Y I M P O R T A N T ###################################################
223 | cat_vars_copy = copy.deepcopy(factor_vars)
224 | for cat in cat_vars_copy:
225 | if df_preds[cat].dtype==float:
226 | continuous_vars.append(cat)
227 | factor_vars.remove(cat)
228 | var_df.loc[var_df['index']==cat,'dcat'] = 0
229 | var_df.loc[var_df['index']==cat,'numeric'] = 1
230 | elif len(df_preds[cat].value_counts()) == df_preds.shape[0]:
231 | id_vars.append(cat)
232 | factor_vars.remove(cat)
233 | var_df.loc[var_df['index']==cat,'dcat'] = 0
234 | var_df.loc[var_df['index']==cat,'id_col'] = 1
235 |
236 | sum_all_cols['factor_vars'] = factor_vars
237 | ##### There are a couple of extra tests you need to do to remove abberations in cat_vars ###
238 | cat_vars_copy = copy.deepcopy(cat_vars)
239 | for cat in cat_vars_copy:
240 | if df_preds[cat].dtype==float:
241 | continuous_vars.append(cat)
242 | cat_vars.remove(cat)
243 | var_df.loc[var_df['index']==cat,'cat'] = 0
244 | var_df.loc[var_df['index']==cat,'numeric'] = 1
245 | elif len(df_preds[cat].value_counts()) == df_preds.shape[0]:
246 | id_vars.append(cat)
247 | cat_vars.remove(cat)
248 | var_df.loc[var_df['index']==cat,'cat'] = 0
249 | var_df.loc[var_df['index']==cat,'id_col'] = 1
250 | sum_all_cols['cat_vars'] = cat_vars
251 | sum_all_cols['continuous_vars'] = continuous_vars
252 | sum_all_cols['id_vars'] = id_vars
253 | ###### This is where you consoldate the numbers ###########
254 | var_dict_sum = dict(zip(var_df.values[:,0], var_df.values[:,2:].sum(1)))
255 | for col, sumval in var_dict_sum.items():
256 | if sumval == 0:
257 | print('%s of type=%s is not classified' %(col,train[col].dtype))
258 | elif sumval > 1:
259 | print('%s of type=%s is classified into more then one type' %(col,train[col].dtype))
260 | else:
261 | pass
262 | ##### If there are more than 1000 unique values, then add it to NLP vars ###
263 | copy_discretes = copy.deepcopy(discrete_string_vars)
264 | for each_discrete in copy_discretes:
265 | if train[each_discrete].nunique() >= 1000:
266 | nlp_vars.append(each_discrete)
267 | discrete_string_vars.remove(each_discrete)
268 | elif train[each_discrete].nunique() > 100 and train[each_discrete].nunique() < 1000:
269 | pass
270 | else:
271 | ### If it is less than 100 unique values, then make it categorical var
272 | cat_vars.append(each_discrete)
273 | discrete_string_vars.remove(each_discrete)
274 | sum_all_cols['discrete_string_vars'] = discrete_string_vars
275 | sum_all_cols['cat_vars'] = cat_vars
276 | sum_all_cols['nlp_vars'] = nlp_vars
277 | ############### This is where you print all the types of variables ##############
278 | ####### Returns 8 vars in the following order: continuous_vars,int_vars,cat_vars,
279 | ### string_bool_vars,discrete_string_vars,nlp_vars,date_or_id_vars,cols_delete
280 | if verbose == 1:
281 | print(" Number of Numeric Columns = ", len(continuous_vars))
282 | print(" Number of Integer-Categorical Columns = ", len(int_vars))
283 | print(" Number of String-Categorical Columns = ", len(cat_vars))
284 | print(" Number of Factor-Categorical Columns = ", len(factor_vars))
285 | print(" Number of String-Boolean Columns = ", len(string_bool_vars))
286 | print(" Number of Numeric-Boolean Columns = ", len(num_bool_vars))
287 | print(" Number of Discrete String Columns = ", len(discrete_string_vars))
288 | print(" Number of NLP String Columns = ", len(nlp_vars))
289 | print(" Number of Date Time Columns = ", len(date_vars))
290 | print(" Number of ID Columns = ", len(id_vars))
291 | print(" Number of Columns to Delete = ", len(cols_delete))
292 | if verbose == 2:
293 | print(' Printing upto %d columns max in each category:' %max_cols_to_print)
294 | print(" Numeric Columns : %s" %continuous_vars[:max_cols_to_print])
295 | print(" Integer-Categorical Columns: %s" %int_vars[:max_cols_to_print])
296 | print(" String-Categorical Columns: %s" %cat_vars[:max_cols_to_print])
297 | print(" Factor-Categorical Columns: %s" %factor_vars[:max_cols_to_print])
298 | print(" String-Boolean Columns: %s" %string_bool_vars[:max_cols_to_print])
299 | print(" Numeric-Boolean Columns: %s" %num_bool_vars[:max_cols_to_print])
300 | print(" Discrete String Columns: %s" %discrete_string_vars[:max_cols_to_print])
301 | print(" NLP text Columns: %s" %nlp_vars[:max_cols_to_print])
302 | print(" Date Time Columns: %s" %date_vars[:max_cols_to_print])
303 | print(" ID Columns: %s" %id_vars[:max_cols_to_print])
304 | print(" Columns that will not be considered in modeling: %s" %cols_delete[:max_cols_to_print])
305 | ##### now collect all the column types and column names into a single dictionary to return!
306 |
307 | len_sum_all_cols = reduce(add,[len(v) for v in sum_all_cols.values()])
308 | if len_sum_all_cols == orig_cols_total:
309 | if verbose:
310 | print(' %d Predictors classified...' %orig_cols_total)
311 | #print(' This does not include the Target column(s)')
312 | else:
313 | print('No of columns classified %d does not match %d total cols. Continuing...' %(
314 | len_sum_all_cols, orig_cols_total))
315 | ls = sum_all_cols.values()
316 | flat_list = [item for sublist in ls for item in sublist]
317 | if len(left_subtract(list(train),flat_list)) > 0:
318 | print(' Error: some columns missing from classification are: %s' %left_subtract(list(train),flat_list))
319 | return sum_all_cols
320 | ####################################################################################
321 |
--------------------------------------------------------------------------------
/featurewiz/databunch.py:
--------------------------------------------------------------------------------
1 | ###############################################################################
2 | # MIT License
3 | #
4 | # Copyright (c) 2020 Alex Lekov
5 | #
6 | # Permission is hereby granted, free of charge, to any person obtaining a copy
7 | # of this software and associated documentation files (the "Software"), to deal
8 | # in the Software without restriction, including without limitation the rights
9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 | # copies of the Software, and to permit persons to whom the Software is
11 | # furnished to do so, subject to the following conditions:
12 | #
13 | # The above copyright notice and this permission notice shall be included in all
14 | # copies or substantial portions of the Software.
15 | #
16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 | # SOFTWARE.
23 | ###############################################################################
24 | ##### This amazing Library was created by Alex Lekov: Many Thanks to Alex! ###
25 | ##### https://github.com/Alex-Lekov/AutoML_Alex ###
26 | ###############################################################################
27 | import pandas as pd
28 | import numpy as np
29 | from itertools import combinations
30 | from sklearn.preprocessing import StandardScaler
31 | from category_encoders import HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder
32 | from category_encoders import OneHotEncoder, HelmertEncoder, OrdinalEncoder, CountEncoder, BaseNEncoder
33 | from category_encoders import TargetEncoder, CatBoostEncoder, WOEEncoder, JamesSteinEncoder
34 | from category_encoders.glmm import GLMMEncoder
35 | from sklearn.preprocessing import LabelEncoder
36 | from category_encoders.wrapper import PolynomialWrapper
37 | from .encoders import FrequencyEncoder
38 | from . import settings
39 |
40 | import pdb
41 | # disable chained assignments
42 | pd.options.mode.chained_assignment = None
43 | import copy
44 | import dask
45 | import dask.dataframe as dd
46 |
47 | class DataBunch(object):
48 | """
49 | Сlass for storing, cleaning and processing your dataset
50 | """
51 | def __init__(self,
52 | X_train=None,
53 | y_train=None,
54 | X_test=None,
55 | y_test=None,
56 | cat_features=None,
57 | clean_and_encod_data=True,
58 | cat_encoder_names=None,
59 | clean_nan=True,
60 | num_generator_features=True,
61 | group_generator_features=True,
62 | target_enc_cat_features=True,
63 | normalization=True,
64 | random_state=42,
65 | verbose=1):
66 | """
67 | Description of __init__
68 |
69 | Args:
70 | X_train=None (undefined): dataset
71 | y_train=None (undefined): y
72 | X_test=None (undefined): dataset
73 | y_test=None (undefined): y
74 | cat_features=None (list or None):
75 | clean_and_encod_data=True (undefined):
76 | cat_encoder_names=None (list or None):
77 | clean_nan=True (undefined):
78 | num_generator_features=True (undefined):
79 | group_generator_features=True (undefined):
80 | target_enc_cat_features=True (undefined)
81 | random_state=42 (undefined):
82 | verbose = 1 (undefined)
83 | """
84 | self.random_state = random_state
85 |
86 | self.X_train = None
87 | self.y_train = None
88 | self.X_test = None
89 | self.y_test = None
90 | self.X_train_predicts = None
91 | self.X_test_predicts = None
92 | self.cat_features = None
93 |
94 | # Encoders
95 | self.cat_encoders_names = settings.cat_encoders_names
96 | self.target_encoders_names = settings.target_encoders_names
97 |
98 |
99 | self.cat_encoder_names = cat_encoder_names
100 | self.cat_encoder_names_list = list(self.cat_encoders_names.keys()) + list(self.target_encoders_names.keys())
101 | self.target_encoders_names_list = list(self.target_encoders_names.keys())
102 |
103 | # check X_train, y_train, X_test
104 | if self.check_data_format(X_train):
105 | if type(X_train) == dask.dataframe.core.DataFrame:
106 | self.X_train_source = X_train.compute()
107 | else:
108 | self.X_train_source = pd.DataFrame(X_train)
109 | self.X_train_source = remove_duplicate_cols_in_dataset(self.X_train_source)
110 | if X_test is not None:
111 | if self.check_data_format(X_test):
112 | if type(X_test) == dask.dataframe.core.DataFrame:
113 | self.X_test_source = X_test.compute()
114 | else:
115 | self.X_test_source = pd.DataFrame(X_test)
116 | self.X_test_source = remove_duplicate_cols_in_dataset(self.X_test_source)
117 |
118 |
119 | ### There is a chance for an error in this - so worth watching!
120 | if y_train is not None:
121 | le = LabelEncoder()
122 | if self.check_data_format(y_train):
123 | if settings.multi_label:
124 | ### if the model is mult-Label, don't transform it since it won't work
125 | self.y_train_source = y_train
126 | else:
127 | if not isinstance(y_train, pd.DataFrame):
128 | if y_train.dtype == 'object' or str(y_train.dtype) == 'category':
129 | self.y_train_source = le.fit_transform(y_train)
130 | else:
131 | if settings.modeltype == 'Multi_Classification':
132 | rare_class = find_rare_class(y_train)
133 | if rare_class != 0:
134 | ### if the rare class is not zero, then transform it using Label Encoder
135 | y_train = le.fit_transform(y_train)
136 | self.y_train_source = copy.deepcopy(y_train)
137 | else:
138 | print('Error: y_train should be a series. Skipping target encoding for dataset...')
139 | target_enc_cat_features = False
140 | else:
141 | if settings.multi_label:
142 | self.y_train_source = pd.DataFrame(y_train)
143 | else:
144 | if y_train.dtype == 'object' or str(y_train.dtype) == 'category':
145 | self.y_train_source = le.fit_transform(pd.DataFrame(y_train))
146 | else:
147 | self.y_train_source = copy.deepcopy(y_train)
148 | else:
149 | print("No target data found!")
150 | return
151 |
152 | if y_test is not None:
153 | self.y_test = y_test
154 |
155 | if verbose > 0:
156 | print('Source X_train shape: ', self.X_train_source.shape)
157 | if not X_test is None:
158 | print('| Source X_test shape: ', self.X_test_source.shape)
159 | print('#'*50)
160 |
161 | # add categorical features in DataBunch
162 | if cat_features is None:
163 | self.cat_features = self.auto_detect_cat_features(self.X_train_source)
164 | if verbose > 0:
165 | print('Auto detect cat features: ', len(self.cat_features))
166 |
167 | else:
168 | self.cat_features = list(cat_features)
169 |
170 | # preproc_data in DataBunch
171 | if clean_and_encod_data:
172 | if verbose > 0:
173 | print('> Start preprocessing with %d variables' %self.X_train_source.shape[1])
174 | self.X_train, self.X_test = self.preproc_data(self.X_train_source,
175 | self.X_test_source,
176 | self.y_train_source,
177 | cat_features=self.cat_features,
178 | cat_encoder_names=cat_encoder_names,
179 | clean_nan=clean_nan,
180 | num_generator_features=num_generator_features,
181 | group_generator_features=group_generator_features,
182 | target_enc_cat_features=target_enc_cat_features,
183 | normalization=normalization,
184 | verbose=verbose,)
185 | else:
186 | self.X_train, self.X_test = X_train, X_test
187 |
188 |
189 | def check_data_format(self, data):
190 | """
191 | Description of check_data_format:
192 | Check that data is not pd.DataFrame or empty
193 |
194 | Args:
195 | data (undefined): dataset
196 | Return:
197 | True or Exception
198 | """
199 | data_tmp = pd.DataFrame(data)
200 | if data_tmp is None or data_tmp.empty:
201 | raise Exception("data is not pd.DataFrame or empty")
202 | else:
203 | if isinstance(data, pd.Series) or isinstance(data, pd.DataFrame):
204 | return True
205 | elif isinstance(data, np.ndarray):
206 | return True
207 | elif type(data) == dask.dataframe.core.DataFrame:
208 | return True
209 | else:
210 | False
211 |
212 | def clean_nans(self, data, cols=None):
213 | """
214 | Fill Nans and add column, that there were nans in this column
215 |
216 | Args:
217 | data (pd.DataFrame, shape (n_samples, n_features)): the input data
218 | cols list() features: the input data
219 | Return:
220 | Clean data (pd.DataFrame, shape (n_samples, n_features))
221 |
222 | """
223 | if cols is not None:
224 | nan_columns = list(data[cols].columns[data[cols].isnull().sum() > 0])
225 | if nan_columns:
226 | for nan_column in nan_columns:
227 | data[nan_column+'_isNAN'] = pd.isna(data[nan_column]).astype('uint8')
228 | data.fillna(data.median(), inplace=True)
229 | return(data)
230 |
231 |
232 | def auto_detect_cat_features(self, data):
233 | """
234 | Description of _auto_detect_cat_features:
235 | Auto-detection categorical_features by simple rule:
236 | categorical feature == if feature nunique low 1% of data
237 |
238 | Args:
239 | data (pd.DataFrame): dataset
240 |
241 | Returns:
242 | cat_features (list): columns names cat features
243 |
244 | """
245 | #object_features = list(data.columns[data.dtypes == 'object'])
246 | cat_features = data.columns[(data.nunique(dropna=False) < len(data)//100) & \
247 | (data.nunique(dropna=False) >2)]
248 | #cat_features = list(set([*object_features, *cat_features]))
249 | return (cat_features)
250 |
251 |
252 | def gen_cat_encodet_features(self, data, cat_encoder_name):
253 | """
254 | Description of _encode_features:
255 | Encode car features
256 |
257 | Args:
258 | data (pd.DataFrame):
259 | cat_encoder_name (str): cat Encoder name
260 |
261 | Returns:
262 | pd.DataFrame
263 |
264 | """
265 |
266 | if isinstance(cat_encoder_name, str):
267 | if cat_encoder_name in self.cat_encoder_names_list and cat_encoder_name not in self.target_encoders_names_list:
268 | if cat_encoder_name == 'HashingEncoder':
269 | encoder = self.cat_encoders_names[cat_encoder_name][0](cols=self.cat_features, n_components=int(np.log(len(data.columns))*1000),
270 | drop_invariant=True)
271 | else:
272 | encoder = self.cat_encoders_names[cat_encoder_name][0](cols=self.cat_features, drop_invariant=True)
273 | data_encodet = encoder.fit_transform(data)
274 | data_encodet = data_encodet.add_prefix(cat_encoder_name + '_')
275 | else:
276 | print(f"{cat_encoder_name} is not supported!")
277 | return ('', '')
278 | else:
279 | encoder = copy.deepcopy(cat_encoder_name)
280 | data_encodet = encoder.transform(data)
281 | data_encodet = data_encodet.add_prefix(str(cat_encoder_name).split("(")[0] + '_')
282 |
283 |
284 | return (data_encodet, encoder)
285 |
286 |
287 | def gen_target_encodet_features(self, x_data, y_data=None, cat_encoder_name=''):
288 | """
289 | Description of _encode_features:
290 | Encode car features
291 |
292 | Args:
293 | data (pd.DataFrame):
294 | cat_encoder_name (str): cat Encoder name
295 |
296 | Returns:
297 | pd.DataFrame
298 |
299 | """
300 |
301 |
302 | if isinstance(cat_encoder_name, str):
303 | ### If it is the first time, it will perform fit_transform !
304 | if cat_encoder_name in self.target_encoders_names_list:
305 | encoder = self.target_encoders_names[cat_encoder_name][0](cols=self.cat_features, drop_invariant=True)
306 | if settings.modeltype == 'Multi_Classification':
307 | ### you must put a Polynomial Wrapper on the cat_encoder in case the model is multi-class
308 | if cat_encoder_name in ['WOEEncoder']:
309 | encoder = PolynomialWrapper(encoder)
310 | ### All other encoders TargetEncoder CatBoostEncoder GLMMEncoder don't need
311 | ### Polynomial Wrappers since they handle multi-class (label encoded) very well!
312 | cols = encoder.cols
313 | for each_col in cols:
314 | x_data[each_col] = encoder.fit_transform(x_data[each_col], y_data).values
315 | data_encodet = encoder.fit_transform(x_data, y_data)
316 | data_encodet = data_encodet.add_prefix(cat_encoder_name + '_')
317 | else:
318 | print(f"{cat_encoder_name} is not supported!")
319 | return ('', '')
320 | else:
321 | ### if it is already fit, then it will only do transform here !
322 | encoder = copy.deepcopy(cat_encoder_name)
323 | data_encodet = encoder.transform(x_data)
324 | data_encodet = data_encodet.add_prefix(str(cat_encoder_name).split("(")[0] + '_')
325 |
326 |
327 | return (data_encodet, encoder)
328 |
329 | def gen_numeric_interaction_features(self,
330 | df,
331 | columns,
332 | operations=['/','*','-','+'],):
333 | """
334 | Description of numeric_interaction_terms:
335 | Numerical interaction generator features: A/B, A*B, A-B,
336 |
337 | Args:
338 | df (pd.DataFrame):
339 | columns (list): num columns names
340 | operations (list): operations type
341 |
342 | Returns:
343 | pd.DataFrame
344 |
345 | """
346 | copy_columns = copy.deepcopy(columns)
347 | fe_df = pd.DataFrame()
348 | for combo_col in combinations(columns,2):
349 | if '/' in operations:
350 | fe_df['{}_div_by_{}'.format(combo_col[0], combo_col[1]) ] = (df[combo_col[0]]*1.) / df[combo_col[1]]
351 | if '*' in operations:
352 | fe_df['{}_mult_by_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] * df[combo_col[1]]
353 | if '-' in operations:
354 | fe_df['{}_minus_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] - df[combo_col[1]]
355 | if '+' in operations:
356 | fe_df['{}_plus_{}'.format(combo_col[0], combo_col[1]) ] = df[combo_col[0]] + df[combo_col[1]]
357 |
358 | for each_col in copy_columns:
359 | fe_df['{}_squared'.format(each_col) ] = df[each_col].pow(2)
360 | return (fe_df)
361 |
362 |
363 | def gen_groupby_cat_encode_features(self, data, cat_columns, num_column,
364 | cat_encoder_name='JamesSteinEncoder'):
365 | """
366 | Description of group_encoder
367 |
368 | Args:
369 | data (pd.DataFrame): dataset
370 | cat_columns (list): cat columns names
371 | num_column (str): num column name
372 |
373 | Returns:
374 | pd.DataFrame
375 |
376 | """
377 |
378 | if isinstance(cat_encoder_name, str):
379 | if cat_encoder_name in self.cat_encoder_names_list:
380 | encoder = JamesSteinEncoder(cols=self.cat_features, model='beta', return_df = True, drop_invariant=True)
381 | encoder.fit(X=data[cat_columns], y=data[num_column].values)
382 | else:
383 | print(f"{cat_encoder_name} is not supported!")
384 | return ('', '')
385 | else:
386 | encoder = copy.deepcopy(cat_encoder_name)
387 |
388 | data_encodet = encoder.transform(X=data[cat_columns], y=data[num_column].values)
389 | data_encodet = data_encodet.add_prefix('GroupEncoded_' + num_column + '_')
390 |
391 | return (data_encodet, encoder)
392 |
393 | def preproc_data(self, X_train=None,
394 | X_test=None,
395 | y_train=None,
396 | cat_features=None,
397 | cat_encoder_names=None,
398 | clean_nan=True,
399 | num_generator_features=True,
400 | group_generator_features=True,
401 | target_enc_cat_features=True,
402 | normalization=True,
403 | verbose=1,):
404 | """
405 | Description of preproc_data:
406 | dataset preprocessing function
407 |
408 | Args:
409 | X_train=None (pd.DataFrame):
410 | X_test=None (pd.DataFrame):
411 | y_train=None (pd.DataFrame):
412 | cat_features=None (list):
413 | cat_encoder_names=None (list):
414 | clean_nan=True (Bool):
415 | num_generator_features=True (Bool):
416 | group_generator_features=True (Bool):
417 |
418 | Returns:
419 | X_train (pd.DataFrame)
420 | X_test (pd.DataFrame)
421 |
422 | """
423 |
424 | #### Sometimes there are duplicates in column names. You must remove them here. ###
425 | cat_features = find_remove_duplicates(cat_features)
426 |
427 | # concat datasets for correct processing.
428 | df_train = X_train.copy()
429 |
430 | if X_test is None:
431 | data = df_train
432 | test_data = None ### Set test_data to None if X_test is None
433 | else:
434 | test_data = X_test.copy()
435 | test_data = remove_duplicate_cols_in_dataset(test_data)
436 | data = copy.deepcopy(df_train)
437 |
438 | data = remove_duplicate_cols_in_dataset(data)
439 |
440 | # object & num features
441 | object_features = list(data.columns[(data.dtypes == 'object') | (data.dtypes == 'category')])
442 | num_features = list(set(data.columns) - set(cat_features) - set(object_features) - {'test'})
443 | encodet_features_names = list(set(object_features + list(cat_features)))
444 |
445 | original_number_features = len(encodet_features_names)
446 | count_number_features = df_train.shape[1]
447 |
448 | self.encodet_features_names = encodet_features_names
449 | self.num_features_names = num_features
450 | self.binary_features_names = []
451 |
452 | # LabelEncode all Binary Features - leave the rest alone
453 | cols = data.columns.tolist()
454 | #### This sometimes errors because there are duplicate columns in a dataset ###
455 | print('LabelEncode all Boolean Features. Leave the rest alone')
456 | for feature in cols:
457 | if data[feature].dtype == bool :
458 | print(' boolean feature = ',feature)
459 |
460 | for feature in cols:
461 | if (data[feature].dtype == bool):
462 | data[feature] = data[feature].astype('category').cat.codes
463 | if test_data is not None:
464 | test_data[feature] = test_data[feature].astype('category').cat.codes
465 | self.binary_features_names.append(feature)
466 |
467 | # Convert all Category features "Category" type variables if no encoding is specified
468 | cat_only_encoders = [x for x in self.cat_encoder_names if x in self.cat_encoders_names]
469 | if len(cat_only_encoders) > 0:
470 | ### Just skip if this encoder is not in the list of category encoders ##
471 | if encodet_features_names:
472 | if cat_encoder_names is None:
473 | for feature in encodet_features_names:
474 | data[feature] = data[feature].fillna('missing')
475 | data[feature] = data[feature].astype('category').cat.codes
476 | if test_data is not None:
477 | test_data[feature] = test_data[feature].fillna('missing')
478 | test_data[feature] = test_data[feature].astype('category').cat.codes
479 | else:
480 | #### If an encoder is specified, then use that encoder to transform categorical variables
481 | if verbose > 0:
482 | print('> Generate Categorical Encoded features')
483 |
484 | copy_cat_encoder_names = copy.deepcopy(cat_encoder_names)
485 | for encoder_name in copy_cat_encoder_names:
486 | if verbose > 0:
487 | print(' + To know more, click: %s' %self.cat_encoders_names[encoder_name][1])
488 | data_encodet, train_encoder = self.gen_cat_encodet_features(data[encodet_features_names],
489 | encoder_name)
490 | if not isinstance(data_encodet, str):
491 | data = pd.concat([data, data_encodet], axis=1)
492 | if test_data is not None:
493 | test_encodet, _ = self.gen_cat_encodet_features(test_data[encodet_features_names],
494 | train_encoder)
495 | if not isinstance(test_encodet, str):
496 | test_data = pd.concat([test_data, test_encodet], axis=1)
497 |
498 | if verbose > 0:
499 | if not isinstance(data_encodet, str):
500 | addl_features = data_encodet.shape[1] - original_number_features
501 | count_number_features += addl_features
502 | print(' + added ', addl_features, ' additional Features using',encoder_name)
503 |
504 | # Generate Target related Encoder features for cat variables:
505 |
506 |
507 | target_encoders = [x for x in self.cat_encoder_names if x in self.target_encoders_names_list]
508 | if len(target_encoders) > 0:
509 | target_enc_cat_features = True
510 | if target_enc_cat_features:
511 | if encodet_features_names:
512 | if verbose > 0:
513 | print('> Generate Target Encoded categorical features')
514 |
515 | if len(target_encoders) == 0:
516 | target_encoders = ['TargetEncoder'] ### set the default as TargetEncoder if nothing is specified
517 | copy_target_encoders = copy.deepcopy(target_encoders)
518 | for encoder_name in copy_target_encoders:
519 | if verbose > 0:
520 | print(' + To know more, click: %s' %self.target_encoders_names[encoder_name][1])
521 | data_encodet, train_encoder = self.gen_target_encodet_features(data[encodet_features_names],
522 | self.y_train_source, encoder_name)
523 | if not isinstance(data_encodet, str):
524 | data = pd.concat([data, data_encodet], axis=1)
525 |
526 | if test_data is not None:
527 | test_encodet, _ = self.gen_target_encodet_features(test_data[encodet_features_names],'',
528 | train_encoder)
529 | if not isinstance(test_encodet, str):
530 | test_data = pd.concat([test_data, test_encodet], axis=1)
531 |
532 |
533 | if verbose > 0:
534 | if not isinstance(data_encodet, str):
535 | addl_features = data_encodet.shape[1] - original_number_features
536 | count_number_features += addl_features
537 | print(' + added ', len(encodet_features_names) , ' additional Features using ', encoder_name)
538 |
539 | # Clean NaNs in Numeric variables only
540 | if clean_nan:
541 | if verbose > 0:
542 | print('> Cleaned NaNs in numeric features')
543 | data = self.clean_nans(data, cols=num_features)
544 | if test_data is not None:
545 | test_data = self.clean_nans(test_data, cols=num_features)
546 | ### Sometimes, train has nulls while test doesn't and vice versa
547 | if test_data is not None:
548 | rem_cols = left_subtract(list(data),list(test_data))
549 | if len(rem_cols) > 0:
550 | for rem_col in rem_cols:
551 | test_data[rem_col] = 0
552 | elif len(left_subtract(list(test_data),list(data))) > 0:
553 | rem_cols = left_subtract(list(test_data),list(data))
554 | for rem_col in rem_cols:
555 | data[rem_col] = 0
556 | else:
557 | print(' + test and train have similar NaN columns')
558 |
559 | # Generate interaction features for Numeric variables
560 | if num_generator_features:
561 | if len(num_features) > 1:
562 | if verbose > 0:
563 | print('> Generate Interactions features among Numeric variables')
564 | fe_df = self.gen_numeric_interaction_features(data[num_features],
565 | num_features,
566 | operations=['/','*','-','+'],)
567 |
568 | if not isinstance(fe_df, str):
569 | data = pd.concat([data,fe_df],axis=1)
570 | if test_data is not None:
571 | fe_test = self.gen_numeric_interaction_features(test_data[num_features],
572 | num_features,
573 | operations=['/','*','-','+'],)
574 | if not isinstance(fe_test, str):
575 | test_data = pd.concat([test_data, fe_test], axis=1)
576 |
577 | if verbose > 0:
578 | if not isinstance(fe_df, str):
579 | addl_features = fe_df.shape[1]
580 | count_number_features += addl_features
581 | print(' + added ', addl_features, ' Interaction Features ',)
582 |
583 | # Generate Group Encoded Features for Numeric variables only using all Categorical variables
584 | if group_generator_features:
585 | if encodet_features_names and num_features:
586 | if verbose > 0:
587 | print('> Generate Group-by Encoded Features')
588 | print(' + To know more, click: %s' %self.target_encoders_names['JamesSteinEncoder'][1])
589 |
590 | for num_col in num_features:
591 | data_encodet, train_group_encoder = self.gen_groupby_cat_encode_features(
592 | data,
593 | encodet_features_names,
594 | num_col,)
595 | if not isinstance(data_encodet, str):
596 | data = pd.concat([data, data_encodet],axis=1)
597 | if test_data is not None:
598 | test_encodet, _ = self.gen_groupby_cat_encode_features(
599 | data,
600 | encodet_features_names,
601 | num_col,train_group_encoder)
602 | if not isinstance(test_encodet, str):
603 | test_data = pd.concat([test_data, test_encodet], axis=1)
604 |
605 | if verbose > 0:
606 | addl_features = data_encodet.shape[1]*len(num_features)
607 | count_number_features += addl_features
608 | print(' + added ', addl_features, ' Group-by Encoded Features using JamesSteinEncoder')
609 |
610 |
611 | # Drop source cat features
612 | if not len(cat_encoder_names) == 0:
613 | ### if there is no categorical encoding, then let the categorical_vars pass through.
614 | ### If they have been transformed into Cat Encoded variables, then you can drop them!
615 | data.drop(columns=encodet_features_names, inplace=True)
616 | # In this case, there may be some inf values, replace them ######
617 | data.replace([np.inf, -np.inf], np.nan, inplace=True)
618 | #data.fillna(0, inplace=True)
619 | if test_data is not None:
620 | if not len(cat_encoder_names) == 0:
621 | ### if there is no categorical encoding, then let the categorical_vars pass through.
622 | test_data.drop(columns=encodet_features_names, inplace=True)
623 | test_data.replace([np.inf, -np.inf], np.nan, inplace=True)
624 | #test_data.fillna(0, inplace=True)
625 |
626 | X_train = copy.deepcopy(data)
627 | X_test = copy.deepcopy(test_data)
628 |
629 | # Normalization Data
630 | if normalization:
631 | if verbose > 0:
632 | print('> Normalization Features')
633 | columns_name = X_train.columns.values
634 | scaler = StandardScaler().fit(X_train)
635 | X_train = scaler.transform(X_train)
636 | X_test = scaler.transform(X_test)
637 | X_train = pd.DataFrame(X_train, columns=columns_name)
638 | X_test = pd.DataFrame(X_test, columns=columns_name)
639 |
640 | if verbose > 0:
641 | print('#'*50)
642 | print('> Final Number of Features: ', (X_train.shape[1]))
643 | print('#'*50)
644 | print('New X_train rows: %s, X_test rows: %s' %(X_train.shape[0], X_test.shape[0]))
645 | print('New X_train columns: %s, X_test columns: %s' %(X_train.shape[1], X_test.shape[1]))
646 | if len(left_subtract(X_test.columns, X_train.columns)) > 0:
647 | print("""There are more columns in test than train
648 | due to missing columns being more in test than train. Continuing...""")
649 |
650 | return X_train, X_test
651 | ################################################################################
652 | def find_rare_class(series, verbose=0):
653 | ######### Print the % count of each class in a Target variable #####
654 | """
655 | Works on Multi Class too. Prints class percentages count of target variable.
656 | It returns the name of the Rare class (the one with the minimum class member count).
657 | This can also be helpful in using it as pos_label in Binary and Multi Class problems.
658 | """
659 | return series.value_counts().index[-1]
660 | #################################################################################
661 | def left_subtract(l1,l2):
662 | lst = []
663 | for i in l1:
664 | if i not in l2:
665 | lst.append(i)
666 | return lst
667 | #################################################################################
668 | def remove_duplicate_cols_in_dataset(df):
669 | df = copy.deepcopy(df)
670 | cols = df.columns.tolist()
671 | number_duplicates = df.columns.duplicated().astype(int).sum()
672 | if number_duplicates > 0:
673 | print('Detected %d duplicate columns in dataset. Removing duplicates...' %number_duplicates)
674 | df = df.loc[:,~df.columns.duplicated()]
675 | return df
676 | ###########################################################################
677 | # Removes duplicates from a list to return unique values - USED ONLYONCE
678 | def find_remove_duplicates(values):
679 | output = []
680 | seen = set()
681 | for value in values:
682 | if value not in seen:
683 | output.append(value)
684 | seen.add(value)
685 | return output
686 | #################################################################################
687 |
--------------------------------------------------------------------------------
/featurewiz/encoders.py:
--------------------------------------------------------------------------------
1 | ###############################################################################
2 | # MIT License
3 | #
4 | # Copyright (c) 2020 Alex Lekov
5 | #
6 | # Permission is hereby granted, free of charge, to any person obtaining a copy
7 | # of this software and associated documentation files (the "Software"), to deal
8 | # in the Software without restriction, including without limitation the rights
9 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 | # copies of the Software, and to permit persons to whom the Software is
11 | # furnished to do so, subject to the following conditions:
12 | #
13 | # The above copyright notice and this permission notice shall be included in all
14 | # copies or substantial portions of the Software.
15 | #
16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 | # SOFTWARE.
23 | ###############################################################################
24 | ##### This amazing Library was created by Alex Lekov: Many Thanks to Alex! ###
25 | ##### https://github.com/Alex-Lekov/AutoML_Alex ###
26 | ###############################################################################
27 | import pandas as pd
28 | import numpy as np
29 |
30 | ################################################################
31 | # Simple Encoders
32 | # (do not use information about target)
33 | ################################################################
34 |
35 | class FrequencyEncoder():
36 | """
37 | FrequencyEncoder
38 | Conversion of category into frequencies.
39 | Parameters
40 | ----------
41 | cols : list of categorical features.
42 | drop_invariant : not used
43 | """
44 | def __init__(self, cols=None, drop_invariant=None):
45 | """
46 | Description of __init__
47 |
48 | Args:
49 | cols=None (undefined): columns in dataset
50 | drop_invariant=None (undefined): not used
51 |
52 | """
53 | self.cols = cols
54 | self.counts_dict = None
55 |
56 | def fit(self, X: pd.DataFrame, y=None) -> pd.DataFrame:
57 | """
58 | Description of fit
59 |
60 | Args:
61 | X (pd.DataFrame): dataset
62 | y=None (not used): not used
63 |
64 | Returns:
65 | pd.DataFrame
66 |
67 | """
68 | counts_dict = {}
69 | if self.cols is None:
70 | self.cols = X.columns
71 | for col in self.cols:
72 | values = X[col].value_counts(dropna=False).index
73 | n_obs = np.float(len(X))
74 | counts = list(X[col].value_counts(dropna=False) / n_obs)
75 | counts_dict[col] = dict(zip(values, counts))
76 | self.counts_dict = counts_dict
77 |
78 | def transform(self, X: pd.DataFrame) -> pd.DataFrame:
79 | """
80 | Description of transform
81 |
82 | Args:
83 | X (pd.DataFrame): dataset
84 |
85 | Returns:
86 | pd.DataFrame
87 |
88 | """
89 | counts_dict_test = {}
90 | res = []
91 | for col in self.cols:
92 | values = X[col].value_counts(1,dropna=False).index.tolist()
93 | counts = X[col].value_counts(1,dropna=False).values.tolist()
94 | counts_dict_test[col] = dict(zip(values, counts))
95 |
96 | # if value is in "train" keys - replace "test" counts with "train" counts
97 | for k in [
98 | key
99 | for key in counts_dict_test[col].keys()
100 | if key in self.counts_dict[col].keys()
101 | ]:
102 | counts_dict_test[col][k] = self.counts_dict[col][k]
103 | res.append(X[col].map(counts_dict_test[col]).values.reshape(-1, 1))
104 | try:
105 | res = np.hstack(res)
106 | except:
107 | pdb.set_trace()
108 | X[self.cols] = res
109 | return X
110 |
111 | def fit_transform(self, X: pd.DataFrame, y=None) -> pd.DataFrame:
112 | """
113 | Description of fit_transform
114 |
115 | Args:
116 | X (pd.DataFrame): dataset
117 | y=None (undefined): not used
118 |
119 | Returns:
120 | pd.DataFrame
121 |
122 | """
123 | self.fit(X, y)
124 | X = self.transform(X)
125 | return X
126 |
--------------------------------------------------------------------------------
/featurewiz/settings.py:
--------------------------------------------------------------------------------
1 | ### this defines some of the global settings for encoder names in one place ####
2 | from category_encoders import HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder
3 | from category_encoders import OneHotEncoder, HelmertEncoder, OrdinalEncoder, CountEncoder, BaseNEncoder
4 | from category_encoders import TargetEncoder, CatBoostEncoder, WOEEncoder, JamesSteinEncoder
5 | from category_encoders.glmm import GLMMEncoder
6 | from sklearn.preprocessing import LabelEncoder
7 | from category_encoders.wrapper import PolynomialWrapper
8 | from .encoders import FrequencyEncoder
9 | #################################################################################
10 | def init():
11 | global cat_encoders_names
12 | cat_encoders_names = {
13 | 'HashingEncoder': [HashingEncoder,'https://contrib.scikit-learn.org/category_encoders/hashing.html'],
14 | 'SumEncoder': [SumEncoder,'https://contrib.scikit-learn.org/category_encoders/sum.html'],
15 | 'PolynomialEncoder': [PolynomialEncoder,'https://contrib.scikit-learn.org/category_encoders/polynomial.html'],
16 | 'BackwardDifferenceEncoder': [BackwardDifferenceEncoder,'https://contrib.scikit-learn.org/category_encoders/backward_difference.html'],
17 | 'OneHotEncoder': [OneHotEncoder,'https://contrib.scikit-learn.org/category_encoders/onehot.html'],
18 | 'HelmertEncoder': [HelmertEncoder,'https://contrib.scikit-learn.org/category_encoders/helmert.html'],
19 | 'OrdinalEncoder': [OrdinalEncoder,'https://contrib.scikit-learn.org/category_encoders/ordinal.html'],
20 | 'BaseNEncoder': [BaseNEncoder,'https://contrib.scikit-learn.org/category_encoders/basen.html'],
21 | 'FrequencyEncoder': [FrequencyEncoder,'https://github.com/Alex-Lekov/AutoML_Alex/blob/master/automl_alex/encoders.py'],
22 | }
23 |
24 | global target_encoders_names
25 | target_encoders_names = {
26 | 'TargetEncoder': [TargetEncoder,'https://contrib.scikit-learn.org/category_encoders/targetencoder.html'],
27 | 'CatBoostEncoder': [CatBoostEncoder,'https://contrib.scikit-learn.org/category_encoders/catboost.html'],
28 | 'WOEEncoder': [WOEEncoder,'https://contrib.scikit-learn.org/category_encoders/woe.html'],
29 | 'JamesSteinEncoder': [JamesSteinEncoder,'https://contrib.scikit-learn.org/category_encoders/jamesstein.html'],
30 | 'GLMMEncoder': [GLMMEncoder,'https://contrib.scikit-learn.org/category_encoders/glmm.html'],
31 | }
32 |
33 | global modeltpe
34 | modeltpe = ''
35 | global multi_label
36 | multi_label = False
37 | #################################################################################
38 |
--------------------------------------------------------------------------------
/featurewiz/sulov_method.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import random
4 | np.random.seed(99)
5 | random.seed(42)
6 | from . import settings
7 | settings.init()
8 | ################################################################################
9 | #### The warnings from Sklearn are so annoying that I have to shut it off #######
10 | import warnings
11 | warnings.filterwarnings("ignore")
12 | from sklearn.exceptions import DataConversionWarning
13 | warnings.filterwarnings(action='ignore', category=DataConversionWarning)
14 | def warn(*args, **kwargs):
15 | pass
16 | warnings.warn = warn
17 | import logging
18 | ####################################################################################
19 | import pdb
20 | import copy
21 | import time
22 | from sklearn.feature_selection import chi2, mutual_info_regression, mutual_info_classif
23 | from sklearn.feature_selection import SelectKBest
24 | from itertools import combinations
25 | import matplotlib.patches as mpatches
26 | import matplotlib.pyplot as plt
27 | #################################################################################################
28 | from collections import defaultdict
29 | from collections import OrderedDict
30 | import time
31 | import networkx as nx # Import networkx for groupwise method
32 | #################################################################################
33 | def left_subtract(l1,l2):
34 | lst = []
35 | for i in l1:
36 | if i not in l2:
37 | lst.append(i)
38 | return lst
39 | #################################################################################
40 | def return_dictionary_list(lst_of_tuples):
41 | """ Returns a dictionary of lists if you send in a list of Tuples"""
42 | orDict = defaultdict(list)
43 | # iterating over list of tuples
44 | for key, val in lst_of_tuples:
45 | orDict[key].append(val)
46 | return orDict
47 | ################################################################################
48 | def find_remove_duplicates(list_of_values):
49 | """
50 | # Removes duplicates from a list to return unique values - USED ONLY ONCE
51 | """
52 | output = []
53 | seen = set()
54 | for value in list_of_values:
55 | if value not in seen:
56 | output.append(value)
57 | seen.add(value)
58 | return output
59 | ##################################################################################
60 | def remove_highly_correlated_vars_fast(df, corr_limit): # Keeping original function for fallback
61 | """Fast method to remove highly correlated vars using just linear correlation."""
62 | corr_matrix = df.corr().abs()
63 | upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
64 | to_drop = [column for column in upper.columns if any(upper[column] > corr_limit)]
65 | return to_drop
66 |
67 | def FE_remove_variables_using_SULOV_method(df, numvars, modeltype, target,
68 | corr_limit = 0.70, verbose=0, dask_xgboost_flag=False,
69 | correlation_types = ['pearson'], # New parameter for correlation types
70 | adaptive_threshold = False, # New parameter for adaptive threshold
71 | sulov_mode = 'pairwise'): # New parameter for SULOV mode (pairwise/groupwise)
72 | """
73 | FE stands for Feature Engineering - it means this function performs feature engineering
74 | ###########################################################################################
75 | ##### SULOV stands for Searching Uncorrelated List Of Variables #############
76 | ###########################################################################################
77 | SULOV method was created by Ram Seshadri in 2018. This highly efficient method removes
78 | variables that are highly correlated using a series of pair-wise correlation knockout
79 | rounds. It is extremely fast and hence can work on thousands of variables in less than
80 | a minute, even on a laptop. You need to send in a list of numeric variables and that's
81 | all! The method defines high Correlation as anything over 0.70 (absolute) but this can
82 | be changed. If two variables have absolute correlation higher than this, they will be
83 | marked, and using a process of elimination, one of them will get knocked out:
84 | To decide order of variables to keep, we use mutuail information score to select. MIS returns
85 | a ranked list of these correlated variables: when we select one, we knock out others that
86 | are highly correlated to it. Then we select next variable to inspect. This continues until
87 | we knock out all highly correlated variables in each set of variables. Finally we are
88 | left with only uncorrelated variables that are also highly important in mutual score.
89 | ###########################################################################################
90 | ######## YOU MUST INCLUDE THE ABOVE MESSAGE IF YOU COPY SULOV method IN YOUR LIBRARY ##########
91 | ###########################################################################################
92 | """
93 | df = copy.deepcopy(df)
94 | df_target = df[target]
95 | df = df[numvars]
96 | ### for some reason, doing a mass fillna of vars doesn't work! Hence doing it individually!
97 | null_vars = np.array(numvars)[df.isnull().sum()>0]
98 | for each_num in null_vars:
99 | df[each_num] = df[each_num].fillna(0)
100 | target = copy.deepcopy(target)
101 | if verbose:
102 | print('#######################################################################################')
103 | print('##### Searching for Uncorrelated List Of Variables (SULOV) in %s features ############' %len(numvars))
104 | print('#######################################################################################')
105 | print('Starting SULOV with %d features...' %len(numvars))
106 |
107 | # 1. Calculate Correlation Matrices based on correlation_types parameter
108 | correlation_matrices = {}
109 | for corr_type in correlation_types:
110 | correlation_matrices[corr_type] = df.corr(method=corr_type).abs()
111 |
112 | # 2. Adaptive Threshold (if enabled)
113 | current_corr_threshold = corr_limit
114 | if adaptive_threshold:
115 | combined_corr_matrix = pd.concat(correlation_matrices.values()).max(level=0) # Max across all corr types
116 | upper_triangle_corrs = combined_corr_matrix.where(np.triu(np.ones(combined_corr_matrix.shape),k=1).astype(bool)).stack().sort_values(ascending=False)
117 | correlation_values = upper_triangle_corrs.values
118 | current_corr_threshold = np.percentile(correlation_values, 75) # Example: 75th percentile
119 | print(f"Adaptive Correlation Threshold: {current_corr_threshold:.3f}")
120 |
121 | # 3. Find Correlated Pairs based on all selected correlation types
122 | correlated_pairs = []
123 | for i in range(len(df.columns)):
124 | for j in range(i + 1, len(df.columns)):
125 | col1 = df.columns[i]
126 | col2 = df.columns[j]
127 | is_correlated = False
128 | for corr_type, corr_matrix in correlation_matrices.items():
129 | if corr_matrix.loc[col1, col2] >= current_corr_threshold:
130 | is_correlated = True
131 | break # If correlated by any type, consider them correlated
132 | if is_correlated:
133 | correlated_pairs.append((col1, col2))
134 |
135 | # Deterministic sorting of correlated pairs (always applied)
136 | correlated_pairs.sort()
137 |
138 | if modeltype == 'Regression':
139 | sel_function = mutual_info_regression
140 | else:
141 | sel_function = mutual_info_classif
142 |
143 | if correlated_pairs: # Proceed only if correlated pairs are found
144 | if isinstance(target, list):
145 | target = target[0]
146 | max_feats = len(numvars) # Changed from len(corr_list) to numvars to be more robust
147 |
148 | ##### you must ensure there are no infinite nor null values in corr_list df ##
149 | df_fit_cols = find_remove_duplicates(sum(correlated_pairs,())) # Unique cols from correlated pairs
150 | df_fit = df[df_fit_cols]
151 |
152 | ### Now check if there are any NaN values in the dataset #####
153 | if df_fit.isnull().sum().sum() > 0:
154 | df_fit = df_fit.dropna()
155 | else:
156 | print(' there are no null values in dataset...')
157 |
158 | if df_target.isnull().sum().sum() > 0:
159 | print(' there are null values in target. Returning with all vars...')
160 | return numvars
161 | else:
162 | print(' there are no null values in target column...')
163 |
164 | ##### Ready to perform fit and find mutual information score ####
165 |
166 | try:
167 | if modeltype == 'Regression':
168 | fs = mutual_info_regression(df_fit, df_target, n_neighbors=5, discrete_features=False, random_state=42)
169 | else:
170 | fs = mutual_info_classif(df_fit, df_target, n_neighbors=5, discrete_features=False, random_state=42)
171 | except Exception as e:
172 | print(f' SelectKBest() function is erroring with: {e}. Returning with all {len(numvars)} variables...')
173 | return numvars
174 |
175 | try:
176 | #################################################################################
177 | ####### This is the main section where we use mutual info score to select vars
178 | #################################################################################
179 | mutual_info = dict(zip(df_fit_cols,fs)) # Use df_fit_cols as keys
180 | #### The first variable in list has the highest correlation to the target variable ###
181 | sorted_by_mutual_info =[key for (key,val) in sorted(mutual_info.items(), key=lambda kv: kv[1],reverse=True)]
182 |
183 | if sulov_mode == 'pairwise':
184 | ##### Now we select the final list of correlated variables (Pairwise SULOV) ###########
185 | selected_corr_list = []
186 | copy_sorted = copy.deepcopy(sorted_by_mutual_info)
187 | analyzed_pairs = set() # Track analyzed pairs to avoid redundancy
188 |
189 | for col1_sorted in copy_sorted:
190 | selected_corr_list.append(col1_sorted)
191 | for col2_tup in correlated_pairs:
192 | col1_corr, col2_corr = col2_tup
193 | pair = tuple(sorted(col2_tup)) # Ensure consistent pair order
194 | if col1_sorted == col1_corr and pair not in analyzed_pairs: # Check if current sorted col is part of a correlated pair
195 | analyzed_pairs.add(pair)
196 | if col2_corr in copy_sorted:
197 | copy_sorted.remove(col2_corr)
198 | elif col1_sorted == col2_corr and pair not in analyzed_pairs: # Check if current sorted col is part of a correlated pair
199 | analyzed_pairs.add(pair)
200 | if col1_corr in copy_sorted:
201 | copy_sorted.remove(col1_corr)
202 |
203 | elif sulov_mode == 'groupwise':
204 | ##### Groupwise SULOV ###########
205 | G = nx.Graph()
206 | for col in df_fit_cols: # Use df_fit_cols for graph nodes
207 | G.add_node(col)
208 | for col1_g, col2_g in correlated_pairs:
209 | G.add_edge(col1_g, col2_g)
210 | correlated_feature_groups = list(nx.connected_components(G))
211 |
212 | selected_corr_list = []
213 | features_to_drop_in_group = set()
214 | for group in correlated_feature_groups:
215 | if len(group) > 1:
216 | group_mis_scores = {feature: mutual_info.get(feature, 0) for feature in group} # Get MIS for group, default 0 if not found
217 | best_feature = max(group_mis_scores, key=group_mis_scores.get) # Feature with max MIS
218 | selected_corr_list.append(best_feature) # Keep best feature
219 | for feature in group:
220 | if feature != best_feature:
221 | features_to_drop_in_group.add(feature)
222 | selected_corr_list = list(set(selected_corr_list)) # Ensure unique selected features
223 | removed_cols_sulov = list(features_to_drop_in_group) # Features removed in groupwise mode
224 | final_list_corr_part = selected_corr_list # Renamed for clarity
225 |
226 | else: # Default to original pairwise logic if mode is not recognized
227 | print(f"Warning: Unknown SULOV mode '{sulov_mode}'. Defaulting to pairwise mode.")
228 | ##### Original Pairwise SULOV logic (as fallback) ######
229 | selected_corr_list = []
230 | copy_sorted = copy.deepcopy(sorted_by_mutual_info)
231 | copy_pair = dict(return_dictionary_list(correlated_pairs)) # Recreate pair dict if needed
232 | for each_corr_name in copy_sorted:
233 | selected_corr_list.append(each_corr_name)
234 | if each_corr_name in copy_pair: # Check if key exists before accessing
235 | for each_remove in copy_pair[each_corr_name]:
236 | if each_remove in copy_sorted:
237 | copy_sorted.remove(each_remove)
238 | final_list_corr_part = selected_corr_list # Renamed for clarity
239 |
240 |
241 | if sulov_mode != 'groupwise': # For pairwise and default modes
242 | final_list_corr_part = selected_corr_list # Renamed for consistency
243 | removed_cols_sulov = left_subtract(df_fit_cols, final_list_corr_part) # Calculate removed cols
244 |
245 | ##### Now we combine the uncorrelated list to the selected correlated list above
246 | rem_col_list = left_subtract(numvars, df_fit_cols) # Uncorrelated columns are those not in df_fit_cols
247 | final_list = rem_col_list + final_list_corr_part
248 | removed_cols = left_subtract(numvars, final_list) + removed_cols_sulov # Combine all removed cols
249 |
250 | except Exception as e:
251 | print(f' SULOV Method crashing due to: {e}')
252 | #### Dropping highly correlated Features fast using simple linear correlation ###
253 | removed_cols = remove_highly_correlated_vars_fast(df,corr_limit)
254 | final_list = left_subtract(numvars, removed_cols)
255 |
256 | if len(removed_cols) > 0:
257 | if verbose:
258 | print(f' Removing ({len(removed_cols)}) highly correlated variables:')
259 | if len(removed_cols) <= 30:
260 | print(f' {removed_cols}')
261 | if len(final_list) <= 30:
262 | print(f' Following ({len(final_list)}) vars selected: {final_list}')
263 |
264 | ############## D R A W C O R R E L A T I O N N E T W O R K ##################
265 | selected = copy.deepcopy(final_list)
266 | if verbose and len(selected) <= 1000 and correlated_pairs: # Draw only if correlated pairs exist
267 | try:
268 | #### Now start building the graph ###################
269 | gf = nx.Graph()
270 | ### the mutual info score gives the size of the bubble ###
271 | multiplier = 2100
272 | for each in sorted_by_mutual_info: # Use sorted_by_mutual_info for node order
273 | if each in mutual_info: # Check if mutual_info exists for the node
274 | gf.add_node(each, size=int(max(1,mutual_info[each]*multiplier)))
275 |
276 | ######### This is where you calculate the size of each node to draw
277 | sizes = [mutual_info.get(x,0)*multiplier for x in list(gf.nodes())] # Use .get with default 0 for robustness
278 |
279 | corr = df[df_fit_cols].corr() # Use df_fit_cols for correlation calculation
280 | high_corr = corr[abs(corr)>current_corr_threshold] # Use adaptive/original threshold
281 | combos = combinations(df_fit_cols,2) # Use df_fit_cols for combinations
282 |
283 | ### this gives the strength of correlation between 2 nodes ##
284 | multiplier_edge = 20 # Renamed to avoid confusion
285 | for (var1, var2) in combos:
286 | if np.isnan(high_corr.loc[var1,var2]):
287 | pass
288 | else:
289 | gf.add_edge(var1, var2,weight=multiplier_edge*high_corr.loc[var1,var2])
290 |
291 | ######## Now start building the networkx graph ##########################
292 | widths = nx.get_edge_attributes(gf, 'weight')
293 | nodelist = gf.nodes()
294 | cols = 5
295 | height_size = 5
296 | width_size = 15
297 | rows = int(len(df_fit_cols)/cols) # Use df_fit_cols length
298 | if rows < 1:
299 | rows = 1
300 | plt.figure(figsize=(width_size,min(20,height_size*rows)))
301 | pos = nx.shell_layout(gf)
302 | nx.draw_networkx_nodes(gf,pos,
303 | nodelist=nodelist,
304 | node_size=sizes,
305 | node_color='blue',
306 | alpha=0.5)
307 | nx.draw_networkx_edges(gf,pos,
308 | edgelist = widths.keys(),
309 | width=list(widths.values()),
310 | edge_color='lightblue',
311 | alpha=0.6)
312 | pos_higher = {}
313 | x_off = 0.04 # offset on the x axis
314 | y_off = 0.04 # offset on the y axis
315 | for k, v in pos.items():
316 | pos_higher[k] = (v[0]+x_off, v[1]+y_off)
317 |
318 | labels_dict = {} # Create labels dictionary
319 | for x in nodelist:
320 | if x in selected:
321 | labels_dict[x] = x+' (selected)'
322 | else:
323 | labels_dict[x] = x+' (removed)'
324 |
325 | nx.draw_networkx_labels(gf, pos=pos_higher,
326 | labels = labels_dict,
327 | font_color='black')
328 |
329 | plt.box(True)
330 | plt.title("""In SULOV, we repeatedly remove features with lower mutual info scores among highly correlated pairs (see figure),
331 | SULOV selects the feature with higher mutual info score related to target when choosing between a pair. """, fontsize=10)
332 | plt.suptitle('How SULOV Method Works by Removing Highly Correlated Features', fontsize=20,y=1.03)
333 | red_patch = mpatches.Patch(color='blue', label='Bigger circle denotes higher mutual info score with target')
334 | blue_patch = mpatches.Patch(color='lightblue', label='Thicker line denotes higher correlation between two variables')
335 | plt.legend(handles=[red_patch, blue_patch],loc='best')
336 | plt.show();
337 | ##### N E T W O R K D I A G R A M C O M P L E T E #################
338 | return final_list
339 | except Exception as e:
340 | print(f' Networkx library visualization crashing due to {e}')
341 | print(f'Completed SULOV. {len(final_list)} features selected')
342 | return final_list
343 | else:
344 | print(f'Completed SULOV. {len(final_list)} features selected')
345 | return final_list
346 | print(f'Completed SULOV. All {len(numvars)} features selected')
347 | return numvars
348 | ###################################################################################
--------------------------------------------------------------------------------
/images/MRMR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/MRMR.png
--------------------------------------------------------------------------------
/images/SULOV.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/SULOV.jpg
--------------------------------------------------------------------------------
/images/feather_example.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feather_example.jpg
--------------------------------------------------------------------------------
/images/feature_engg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feature_engg.png
--------------------------------------------------------------------------------
/images/feature_engg_old.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/feature_engg_old.jpg
--------------------------------------------------------------------------------
/images/featurewiz_background.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_background.jpg
--------------------------------------------------------------------------------
/images/featurewiz_logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logo.jpg
--------------------------------------------------------------------------------
/images/featurewiz_logos.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logos.png
--------------------------------------------------------------------------------
/images/featurewiz_logos_old.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_logos_old.png
--------------------------------------------------------------------------------
/images/featurewiz_mrmr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/featurewiz_mrmr.png
--------------------------------------------------------------------------------
/images/xgboost.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AutoViML/featurewiz/1fe1341a4272957a32d42f63a2446f5f8c4011c2/images/xgboost.jpg
--------------------------------------------------------------------------------
/old_README.md:
--------------------------------------------------------------------------------
1 | # featurewiz
2 | `featurewiz` is the best feature selection library for boosting your machine learning performance with minimal effort and maximum relevance using the famous MRMR algorithm.
3 |
4 | 
5 |
6 | # Table of Contents
7 |
23 |
24 | ## Latest
25 | `featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. If you are looking for the latest and greatest updates about our library, check out our updates page.
26 |
27 |
28 | ## Citation
29 | If you use featurewiz in your research project or paper, please use the following format for citations:
30 |
31 | "Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz "
32 |
33 | Current citations for featurewiz
34 |
35 | [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=)
36 |
37 | ## Highlights
38 | `featurewiz` stands out as a versatile and powerful tool for feature selection and engineering, capable of significantly enhancing model performance through intelligent feature transformation and selection techniques. Its unique methods like SULOV and recursive XGBoost, combined with advanced feature engineering options, make it a valuable addition to any data scientist's toolkit:
39 | ### Best Feature Selection Algorithm
40 |
- It provides one of the best automatic feature selection algorithms (Minimum Redundancy Maximum Relevance (MRMR) algorithm) as described by wikipedia in this page: "The MRMR selection has been found to be more powerful than the maximum relevance feature selection" such as Boruta.
41 |
42 | ### Advanced Feature Engineering Options
43 | featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as:
44 | - Auto Encoders, including Denoising Auto Encoders (DAEs) and Variational Auto Encoders (VAEs), for improved model performance, especially on imbalanced datasets.
45 | - A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.
46 | - The ability to add interaction features (e.g., x1x2, x2x3, x1^2), group by features, and target encoding.
47 |
48 | ### SULOV Method for Feature Selection
49 | - SULOV stands for "Searching for Uncorrelated List Of Variables". It selects features that are uncorrelated with each other but have high correlation with the target variable, based on the Minimum Redundancy Maximum Relevance (mRMR) principle. This method effectively reduces redundancy in features while retaining those with high relevance to the target.
50 |
51 | ### Recursive XGBoost Method
52 | - After applying the SULOV method, featurewiz employs a recursive approach using XGBoost's feature importance. This process is repeated multiple times on subsets of data, combining and deduplicating selected features to identify the most impactful ones.
53 |
54 | ### Comprehensive Encoding and Transformation
55 | - featurewiz allows for extensive customization in how features are encoded and transformed, making it highly adaptable to various types of data.
56 | - The ability to combine multiple encoding and transformation methods enhances its flexibility and effectiveness in feature engineering.
57 |
58 | ### Used by PhD's and Researchers and actively maintained
59 | - featurewiz is used by researchers and PhD data scientists around the world: there are 64 citations for featurewiz since its release:
60 |
61 | [Google Scholar](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C31&q=featurewiz&btnG=)
62 | - It's efficient in handling large datasets, making it suitable for a wide range of applications from small to big data scenarios.
63 | - It is well-documented, and it comes with a number of examples.
64 | - It is actively maintained, and it is regularly updated with new features and bug fixes.
65 |
66 | ## Internals
67 | `featurewiz` has two major internal modules. They are explained below.
68 | ### 1. Feature Engineering module
69 | The first step is not absolutely necessary but it can be used to create new features that may or may not be helpful (be careful with automated feature engineering tools!).
70 | One of the gaps in open-source AutoML tools and especially Auto_ViML has been the lack of feature engineering capabilities that high-powered competitions such as Kaggle required. The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables was difficult and sifting through those hundreds of new features to find the best features was difficult and left only to "experts" or "professionals". featurewiz was created to help you in this endeavor.
71 |
featurewiz now enables you to add hundreds of such features with a single line of code. Set the "feature_engg" flag to "interactions", "groupby" or "target" and featurewiz will select the best encoders for each of those options and create hundreds (perhaps thousands) of features in one go. Not only that, using the next step, featurewiz will sift through numerous such variables and find only the least correlated and most relevant features to your model. All in one step!.
72 |
73 | 
74 |
75 | ### 2. Feature Selection module
76 |
The second step is Feature Selection. `featurewiz` uses the MRMR (Minimum Redundancy Maximum Relevance) algorithm as the basis for its feature selection.
77 | Why perform Feature Selection? Once you have created 100's of new features, you still have three questions left to answer:
78 | 1. How do we interpret those newly created features?
79 | 2. Which of these features is important and which is useless? How many of them are highly correlated to each other causing redundancy?
80 | 3. Does the model overfit now on these new features and perform better or worse than before?
81 |
82 | All are very important questions and featurewiz answers them by using the SULOV method and Recursive XGBoost to reduce features in your dataset to the best "minimum optimal" features for the model.
83 |
SULOV: SULOV stands for `Searching for Uncorrelated List of Variables`. The SULOV algorithm is based on the Minimum-Redundancy-Maximum-Relevance (MRMR) algorithm explained in this article as one of the best feature selection methods. To understand how MRMR works and how it is different from `Boruta` and other feature selection methods, see the chart below. Here "Minimal Optimal" refers to MRMR (featurewiz) while "all-relevant" refers to Boruta.
84 |
85 | 
86 |
87 | ## Working
88 | `featurewiz` performs feature selection in 2 steps. Each step is explained below.
89 | The working of the `SULOV` algorithm is as follows:
90 |
91 | - Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).
92 | - Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.
93 | - Now take each pair of correlated variables, then knock off the one with the lower MIS score.
94 | - What’s left is the ones with the highest Information scores and least correlation with each other.
95 |
96 |
97 | 
98 |
99 | The working of the Recursive XGBoost is as follows:
100 | Once SULOV has selected variables that have high mutual information scores with the least correlation among them, featurewiz uses XGBoost to repeatedly find the best features among the remaining variables after SULOV.
101 |
102 | - Select all variables in the data set and the full data split into train and valid sets.
103 | - Find top X features (could be 10) on train using valid for early stopping (to prevent over-fitting)
104 | - Then take the next set of vars and find top X
105 | - Do this 5 times. Combine all selected features and de-duplicate them.
106 |
107 |
108 | 
109 |
110 | ## Tips
111 | Here are some additional tips for ML engineers and data scientists when using featurewiz:
112 |
113 | - How to cross-validate your results: When you use featurewiz, we automatically perform multiple rounds of feature selection using permutations on the number of columns. However, you can perform feature selection using permutations of rows as follows in cross_validate using featurewiz.
114 |
- Use multiple feature selection tools: It is a good idea to use multiple feature selection tools and compare the results. This will help you to get a better understanding of which features are most important for your data.
115 | - Don't forget to engineer new features: Feature selection is only one part of the process of building a good machine learning model. You should also spend time engineering your features to make them as informative as possible. This can involve things like creating new features, transforming existing features, and removing irrelevant features.
116 | - Don't overfit your model: It is important to avoid overfitting your model to the training data. Overfitting occurs when your model learns the noise in the training data, rather than the underlying signal. To avoid overfitting, you can use regularization techniques, such as lasso or elasticnet.
117 | - Start with a small number of features: When you are first starting out, it is a good idea to start with a small number of features. This will help you to avoid overfitting your model. As you become more experienced, you can experiment with adding more features.
118 |
119 |
120 | ## Install
121 |
122 | **Prerequisites:**
123 |
124 | - featurewiz is built using xgboost, dask, numpy, pandas and matplotlib. It should run on most Python 3 Anaconda installations. You won't have to import any special libraries other than "dask", "XGBoost" and "networkx" library. Optionally, it uses LightGBM for fast modeling, which it installs automatically.
125 | - We use "networkx" library for charts and interpretability.
But if you don't have these libraries, featurewiz will install those for you automatically.
126 |
127 | To install from source:
128 |
129 | ```
130 | cd
131 | git clone git@github.com:AutoViML/featurewiz.git
132 | # or download and unzip https://github.com/AutoViML/featurewiz/archive/master.zip
133 | conda create -n python=3.7 anaconda
134 | conda activate # ON WINDOWS: `source activate `
135 | cd featurewiz
136 | pip install -r requirements.txt
137 | ```
138 |
139 | ## Good News: You can install featurewiz on Colab and Kaggle easily in 2 steps!
140 | As of June 2022, thanks to [arturdaraujo](https://github.com/arturdaraujo), featurewiz is now available on conda-forge. You can try:
141 |
142 | ```
143 | conda install -c conda-forge featurewiz
144 | ```
145 |
146 | ### If the above conda install fails, you can try installing featurewiz this way:
147 | #### Install featurewiz using git+
148 |
149 | ```
150 | !pip install git+https://github.com/AutoViML/featurewiz.git
151 | ```
152 |
153 | ## Usage
154 |
155 | There are two ways to use featurewiz.
156 |
157 | - The first way is the new way where you use scikit-learn's `fit and predict` syntax. It also includes the `lazytransformer` library that I created to transform datetime, NLP and categorical variables into numeric variables automatically. We recommend that you use it as the main syntax for all your future needs.
158 |
159 | ```
160 | from featurewiz import FeatureWiz
161 | fwiz = FeatureWiz(feature_engg = '', nrows=None, transform_target=True, scalers="std",
162 | category_encoders="auto", add_missing=False, verbose=0)
163 | X_train_selected, y_train = fwiz.fit_transform(X_train, y_train)
164 | X_test_selected = fwiz.transform(X_test)
165 | ### get list of selected features ###
166 | fwiz.features
167 | ```
168 |
169 | - The second way is the old way and this was the original syntax of featurewiz. It is still being used by thousands of researchers in the field. Hence it will continue to be maintained. However, it can be discontinued any time without notice. You can use it if you like it.
170 |
171 | ```
172 | import featurewiz as fwiz
173 | outputs = fwiz.featurewiz(dataname=train, target=target, corr_limit=0.70, verbose=2, sep=',',
174 | header=0, test_data='',feature_engg='', category_encoders='',
175 | dask_xgboost_flag=False, nrows=None, skip_sulov=False, skip_xgboost=False)
176 | ```
177 |
178 | `outputs` is a tuple: There will always be two objects in output. It can vary:
179 | - In the first case, it can be `features` and `trainm`: features is a list (of selected features) and trainm is the transformed dataframe (if you sent in train only)
180 | - In the second case, it can be `trainm` and `testm`: It can be two transformed dataframes when you send in both test and train but with selected features.
181 |
182 | In both cases, the features and dataframes are ready for you to do further modeling.
183 |
184 | Featurewiz works on any multi-class, multi-label data Set. So you can have as many target labels as you want.
185 | You don't have to tell Featurewiz whether it is a Regression or Classification problem. It will decide that automatically.
186 |
187 | ## API
188 |
189 | **Input Arguments for NEW syntax**
190 |
191 | Parameters
192 | ----------
193 | corr_limit : float, default=0.90
194 | The correlation limit to consider for feature selection. Features with correlations
195 | above this limit may be excluded.
196 |
197 | verbose : int, default=0
198 | Level of verbosity in output messages.
199 |
200 | feature_engg : str or list, default=''
201 | Specifies the feature engineering methods to apply, such as 'interactions', 'groupby',
202 | and 'target'.
203 |
204 | category_encoders : str or list, default=''
205 | Encoders for handling categorical variables. Supported encoders include 'onehot',
206 | 'ordinal', 'hashing', 'count', 'catboost', 'target', 'glm', 'sum', 'woe', 'bdc',
207 | 'loo', 'base', 'james', 'helmert', 'label', 'auto', etc.
208 |
209 | add_missing : bool, default=False
210 | If True, adds indicators for missing values in the dataset.
211 |
212 | dask_xgboost_flag : bool, default=False
213 | If set to True, enables the use of Dask for parallel computing with XGBoost.
214 |
215 | nrows : int or None, default=None
216 | Limits the number of rows to process.
217 |
218 | skip_sulov : bool, default=False
219 | If True, skips the application of the Super Learning Optimized (SULO) method in
220 | feature selection.
221 |
222 | skip_xgboost : bool, default=False
223 | If True, bypasses the recursive XGBoost feature selection.
224 |
225 | transform_target : bool, default=False
226 | When True, transforms the target variable(s) into numeric format if they are not
227 | already.
228 |
229 | scalers : str or None, default=None
230 | Specifies the scaler to use for feature scaling. Available options include
231 | 'std', 'standard', 'minmax', 'max', 'robust', 'maxabs'.
232 |
233 | **Input Arguments for old syntax**
234 |
235 | - `dataname`: could be a datapath+filename or a dataframe. It will detect whether your input is a filename or a dataframe and load it automatically.
236 | - `target`: name of the target variable in the data set.
237 | - `corr_limit`: if you want to set your own threshold for removing variables as highly correlated, then give it here. The default is 0.9 which means variables less than -0.9 and greater than 0.9 in pearson's correlation will be candidates for removal.
238 | - `verbose`: This has 3 possible states:
239 | - `0` - limited output. Great for running this silently and getting fast results.
240 | - `1` - verbose. Great for knowing how results were and making changes to flags in input.
241 | - `2` - more charts such as SULOV and output. Great for finding out what happens under the hood for SULOV method.
242 | - `test_data`: This is only applicable to the old syntax if you want to transform both train and test data at the same time in the same way. `test_data` could be the name of a datapath+filename or a dataframe. featurewiz will detect whether your input is a filename or a dataframe and load it automatically. Default is empty string.
243 | - `dask_xgboost_flag`: default False. If you want to use dask with your data, then set this to True.
244 | - `feature_engg`: You can let featurewiz select its best encoders for your data set by setting this flag
245 | for adding feature engineering. There are three choices. You can choose one, two, or all three.
246 | - `interactions`: This will add interaction features to your data such as x1*x2, x2*x3, x1**2, x2**2, etc.
247 | - `groupby`: This will generate Group By features to your numeric vars by grouping all categorical vars.
248 | - `target`: This will encode and transform all your categorical features using certain target encoders.
249 | Default is empty string (which means no additional features)
250 | - `add_missing`: default is False. This is a new flag: the `add_missing` flag will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal.
251 | - `category_encoders`: default is "auto". Instead, you can choose your own category encoders from the list below.
252 | We recommend you do not use more than two of these. Featurewiz will automatically select only two if you have more than two in your list. You can set "auto" for our own choice or the empty string "" (which means no encoding of your categorical features)
These descriptions are derived from the excellent category_encoders python library. Please check it out!
253 | - `HashingEncoder`: HashingEncoder is a multivariate hashing implementation with configurable dimensionality/precision. The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
254 | - `SumEncoder`: SumEncoder is a Sum contrast coding for the encoding of categorical features.
255 | - `PolynomialEncoder`: PolynomialEncoder is a Polynomial contrast coding for the encoding of categorical features.
256 | - `BackwardDifferenceEncoder`: BackwardDifferenceEncoder is a Backward difference contrast coding for encoding categorical variables.
257 | - `OneHotEncoder`: OneHotEncoder is the traditional Onehot (or dummy) coding for categorical features. It produces one feature per category, each being a binary.
258 | - `HelmertEncoder`: HelmertEncoder uses the Helmert contrast coding for encoding categorical features.
259 | - `OrdinalEncoder`: OrdinalEncoder uses Ordinal encoding to designate a single column of integers to represent the categories in your data. Integers however start in the same order in which the categories are found in your dataset. If you want to change the order, just sort the column and send it in for encoding.
260 | - `FrequencyEncoder`: FrequencyEncoder is a count encoding technique for categorical features. For a given categorical feature, it replaces the names of the categories with the group counts of each category.
261 | - `BaseNEncoder`: BaseNEncoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), and a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
262 | - `TargetEncoder`: TargetEncoder performs Target encoding for categorical features. It supports the following kinds of targets: binary and continuous. For multi-class targets, it uses a PolynomialWrapper.
263 | - `CatBoostEncoder`: CatBoostEncoder performs CatBoost coding for categorical features. It supports the following kinds of targets: binary and continuous. For polynomial target support, it uses a PolynomialWrapper. This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
264 | - `WOEEncoder`: WOEEncoder uses the Weight of Evidence technique for categorical features. It supports only one kind of target: binary. For polynomial target support, it uses a PolynomialWrapper. It cannot be used for Regression.
265 | - `JamesSteinEncoder`: JamesSteinEncoder uses the James-Stein estimator. It supports 2 kinds of targets: binary and continuous. For polynomial target support, it uses PolynomialWrapper.
266 | For feature value i, James-Stein estimator returns a weighted average of:
267 | The mean target value for the observed feature value i.
268 | The mean target value (regardless of the feature value).
269 | - `nrows`: default `None`. You can set the number of rows to read from your datafile if it is too large to fit into either dask or pandas. But you won't have to if you use dask.
270 | - `skip_sulov`: default `False`. You can set the flag to skip the SULOV method if you want.
271 | - `skip_xgboost`: default `False`. You can set the flag to skip the Recursive XGBoost method if you want.
272 |
273 | **Output values for old syntax** This applies only to the old syntax.
274 | - `outputs`: Output is always a tuple. We can call our outputs in that tuple as `out1` and `out2` below.
275 | - `out1` and `out2`: If you sent in just one dataframe or filename as input, you will get:
276 | - 1. `features`: It will be a list (of selected features) and
277 | - 2. `trainm`: It will be a dataframe (if you sent in a file or dataname as input)
278 | - `out1` and `out2`: If you sent in two files or dataframes (train and test), you will get:
279 | - 1. `trainm`: a modified train dataframe with engineered and selected features from dataname and
280 | - 2. `testm`: a modified test dataframe with engineered and selected features from test_data.
281 |
282 | ## Additional
283 | To learn more about how featurewiz works under the hood, watch this [video](https://www.youtube.com/embed/ZiNutwPcAU0)
284 |
285 | 
286 |
287 | featurewiz was designed for selecting High Performance variables with the fewest steps.
288 | In most cases, featurewiz builds models with 20%-99% fewer features than your original data set with nearly the same or slightly lower performance (this is based on my trials. Your experience may vary).
289 |
290 | featurewiz is every Data Scientist's feature wizard that will:
291 | - Automatically pre-process data: you can send in your entire dataframe "as is" and featurewiz will classify and change/label encode categorical variables changes to help XGBoost processing. It classifies variables as numeric or categorical or NLP or date-time variables automatically so it can use them correctly to model.
292 | - Perform feature engineering automatically: The ability to create "interaction" variables or adding "group-by" features or "target-encoding" categorical variables is difficult and sifting through those hundreds of new features is painstaking and left only to "experts". Now, with featurewiz you can create hundreds or even thousands of new features with the click of a mouse. This is very helpful when you have a small number of features to start with. However, be careful with this option. You can very easily create a monster with this option.
293 |
- Perform feature reduction automatically. When you have small data sets and you know your domain well, it is easy to perhaps do EDA and identify which variables are important. But when you have a very large data set with hundreds if not thousands of variables, selecting the best features from your model can mean the difference between a bloated and highly complex model or a simple model with the fewest and most information-rich features. featurewiz uses XGBoost repeatedly to perform feature selection. You must try it on your large data sets and compare!
294 | - Explain SULOV method graphically using networkx library so you can see which variables are highly correlated to which ones and which of those have high or low mutual information scores automatically. Just set verbose = 2 to see the graph.
295 | - Build a fast XGBoost or LightGBM model using the features selected by featurewiz. There is a function called "simple_lightgbm_model" which you can use to build a fast model. It is a new module, so check it out.
296 |
297 |
298 | *** Special thanks to fellow open source Contributors ***:
299 |
300 | - Alex Lekov (https://github.com/Alex-Lekov/AutoML_Alex/tree/master/automl_alex) for his DataBunch and encoders modules which are used by the tool (although with some modifications).
301 | - Category Encoders library in Python : This is an amazing library. Make sure you read all about the encoders that featurewiz uses here: https://contrib.scikit-learn.org/category_encoders/index.html
302 |
303 |
304 | ## Maintainers
305 |
306 | * [@AutoViML](https://github.com/AutoViML)
307 |
308 | ## Contributing
309 |
310 | See [the contributing file](CONTRIBUTING.md)!
311 |
312 | PRs accepted.
313 |
314 | ## License
315 |
316 | Apache License 2.0 © 2020 Ram Seshadri
317 |
318 | ## DISCLAIMER
319 | This project is not an official Google project. It is not supported by Google and Google specifically disclaims all warranties as to its quality, merchantability, or fitness for a particular purpose.
320 |
321 |
322 | [page]: examples/cross_validate.py
323 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy<2.0
2 | ipython
3 | jupyter
4 | xgboost>=1.6.2,<=1.7.6
5 | pandas>=2.0
6 | matplotlib
7 | seaborn
8 | scipy
9 | scikit-learn>=1.2.2,<=1.5.2
10 | networkx
11 | category_encoders==2.6.3
12 | xlrd>=2.0.0
13 | dask>=2021.11.0
14 | lightgbm>=3.2.1
15 | distributed>=2021.11.0
16 | feather-format>=0.4.1
17 | pyarrow>=7.0.0
18 | fsspec>=0.3.3
19 | Pillow>=9.0.0
20 | tqdm>=4.61.1
21 | numexpr>=2.7.3
22 | tensorflow>=2.5.2
23 | lazytransform>=1.17
24 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import setuptools
4 |
5 | with open("README.md", "r", encoding="utf-8") as fh:
6 | long_description = fh.read()
7 |
8 | setuptools.setup(
9 | name="featurewiz",
10 | version="0.6.1",
11 | author="Ram Seshadri",
12 | author_email="rsesha2001@yahoo.com",
13 | description="Select Best Features from your data set - any size - now with XGBoost!",
14 | long_description=long_description,
15 | long_description_content_type="text/markdown",
16 | license='Apache License 2.0',
17 | url="https://github.com/AutoViML/featurewiz",
18 | packages=setuptools.find_packages(exclude=("tests",)),
19 | install_requires=[
20 | "numpy<2.0",
21 | "ipython",
22 | "jupyter",
23 | "xgboost>=1.6.2,<=1.7.6",
24 | "pandas>=2.0",
25 | "matplotlib",
26 | "seaborn",
27 | "scipy",
28 | "scikit-learn>=1.2.2,<=1.5.2",
29 | "networkx",
30 | "category_encoders==2.6.3",
31 | "xlrd>=2.0.0",
32 | "dask>=2021.11.0",
33 | "lightgbm>=3.2.1",
34 | "distributed>=2021.11.0",
35 | "feather-format>=0.4.1",
36 | "pyarrow>=7.0.0",
37 | "fsspec>=0.3.3",
38 | "Pillow>=9.0.0",
39 | "tqdm>=4.61.1",
40 | "numexpr>=2.7.3",
41 | "tensorflow>=2.5.2",
42 | "lazytransform>=1.17",
43 | ],
44 | classifiers=[
45 | "Programming Language :: Python :: 3",
46 | "Operating System :: OS Independent",
47 | ],
48 | )
49 |
--------------------------------------------------------------------------------
/updates.md:
--------------------------------------------------------------------------------
1 | # featurewiz latest updates page
2 | This is the main page where we will post the latest updates to the featurewiz library. Make sure you bookmark this page and upgrade your featurewiz library before you run it. There are new updates almost every week to featurewiz!
3 |
4 | ### Update (Jan 2024): Introducing BlaggingClassifier, built to handle imbalanced distributions!
5 | Featurewiz now introduces The BlaggingClassifier one of the best classifiers ever, built by # Original Author: Gilles Louppe and Licensed in BSD 3 clause with Adaptations by Tom E Fawcett. This is an amazing classifier that everyone must try for their best imbalanced and multi-class problems. Don't just take our word for it, try it out yourself and see!
6 |
7 | ## Jan 2024 update
8 | `featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the `IterativeDoubleClassifier` and the `BlaggingClassifier`. If you are looking for the latest and greatest updates about our library, check out our updates page.
9 |
10 |
11 | 
12 |
13 | #### Update (December 2023): FeatureWiz 0.5 is here! Includes powerful deep learning autoencoders!
14 | The FeatureWiz transformer now includes powerful deep learning auto encoders in the new `auto_encoders` argument. They will transform your features into a lower dimension space but capturing important patterns inherent in your dataset. This is known as "feature extraction" and is a powerful tool for tackling very difficult problems in classification. You can set the `auto_encoders` option to `VAE` for Variational Auto Encoder, `DAE` for Denoising Auto Encoder, `CNN` for CNN's and `GAN` for GAN data augmentation for generating synthetic data. These options will completely replace your existing features. Suppose you want to add them as additional features? You can do that by setting the `auto_encoders` option to `VAE_ADD`, `DAE_ADD`, and `CNN_ADD` and featurewiz will automatically add these features to your existing dataset. In addition, it will do feature selection among your old and new features. Isn't that awesome? I have uploaded a sample notebook for you to test and improve your Classifier performance using these options for Imbalanced datasets. Please send me comments via email which is displayed on my Github main page.
15 |
16 | 
17 |
18 | #### Update (November 2023): The FeatureWiz transformer (version 0.4.3 on) includes an "add_missing" flag
19 | The FeatureWiz transformer now includes an `add_missing` flag which will add a new column for missing values for all your variables in your dataset. This will help you catch missing values as an added signal when you use FeatureWiz library. Try it out and let us know in your comments via email.
20 |
21 |
22 | #### Update (October 2023): FeatureWiz transformer (version 0.4.0 on) now has lazytransformer library
23 | The new FeatureWiz transformer includes a categorical encoder + date-time + NLP transfomer that transforms all your string, text, date-time columns into numeric variables in one step. You will see a fully transformed `(all-numeric)` dataset when you use FeatureWiz transformer. Try it out and let us know in your comments via email.
24 |
25 | #### Update (June 2023): featurewiz now has skip_sulov and skip_xgboost flags
26 |
27 | There are two flags that are available to skip the recursive xgboost and/or SULOV methods. They are the `skip_xgboost` and `skip_sulov` flags. They are by default set to `False`. But you can change them to `True` if you want to skip them.
28 |
29 | #### Update (May 2023): featurewiz 3.0 is here with better accuracy and speed
30 |
31 | The latest version of featurewiz is here! The new 3.0 version of featurewiz provides slightly better performance by about 1-2% in diverse datasets (your experience may vary). Install it and check it out!
32 |
33 |
34 | #### Update (March 2023): XGBoost 1.7 and higher versions have issues with featurewiz
35 |
36 | The latest version of XGBoost 1.7+ does not work with featurewiz. They have made massive changes to their API. So please switch to xgboost 1.5 if you want to run featurewiz.
37 |
38 | #### Update (October 2022): FeatureWiz 2.0 is here.
39 | featurewiz 2.0 is here. You have two small performance improvements:
40 | 1. SULOV method now has a higher correlation limit of 0.90 as default. This means fewer variables are removed and hence more vars are selected. You can always set it back to the old limit by setting `corr_limit`=0.70 if you want.
41 |
42 | 2. Recursive XGBoost algorithm is tighter in that it selects fewer features in each iteration. To see how many it selects, set `verbose` flag to 1.
43 | The net effect is that the same number of features are selected but they are better at producing more accurate models. Try it out and let us know.
44 |
45 | #### Update (September 2022): You can now skip SULOV method using skip_sulov flag
46 | featurewiz now has a new input: `skip_sulov` flag is here. You can set it to `True` to skip the SULOV method if needed.
47 |
48 | #### Update (August 2022): Silent mode with verbose=0
49 | featurewiz now has a "silent" mode which you can set using the "verbose=0" option. It will run silently with no charts or graphs and very minimal verbose output. Hope this helps!
50 |
51 | #### Update (May 2022): New high performance modules based on XGBoost and LightGBM
52 | featurewiz as of version 0.1.50 or higher has multiple high performance models that you can use to build highly performant models once you have completed feature selection. These models are based on LightGBM and XGBoost and have even Stacking and Blending ensembles. You can find them as functions starting with "simple_" and "complex_" under featurewiz. All the best!
53 |
54 | #### Update (March 2022): Ability to read feather format files
55 | featurewiz as of version 0.1.04 or higher can read `feather-format` files at blazing speeds. See example below on how to convert your CSV files to feather. Then you can feed those '.ftr' files to featurewiz and it will read it 10-100X faster!
56 |
57 |
58 | 
59 |
60 | featurewiz now runs at blazing speeds thanks to using GPU's by default. So if you are running a large data set on Colab and/or Kaggle, make sure you turn on the GPU kernels. featurewiz will automatically detect that GPU is turned on and will utilize XGBoost using GPU-hist. That will ensure it will crunch your datasets even faster. I have tested it with a very large data set and it reduced the running time from 52 mins to 1 minute! That's a 98% reduction in running time using GPU compared to CPU!
61 |
62 | ### Update (Jan 2022): FeatureWiz is now a sklearn-compatible transformer that you can use in data pipelines
63 | FeatureWiz as of version 0.0.90 or higher is a scikit-learn compatible feature selection transformer. You can perform fit and predict as follows. You will get a Transformer that can select the top variables from your dataset. You can also use it in sklearn pipelines as a Transformer.
64 |
65 | ```
66 | from featurewiz import FeatureWiz
67 | features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='',
68 | dask_xgboost_flag=False, nrows=None, verbose=2)
69 | X_train_selected, y_train = features.fit_transform(X_train, y_train)
70 | X_test_selected = features.transform(X_test)
71 | features.features ### provides the list of selected features ###
72 | ```
73 |
74 | ### Featurewiz is now upgraded with XGBOOST 1.5.1 for DASK for blazing fast performance even for very large data sets! Set `dask_xgboost_flag = True` to run dask + xgboost.
75 | featurewiz now runs with a default setting of `nrows=None`. This means it will run using all rows. But if you want it to run faster, then you can change `nrows` to 1000 or whatever, so it will sample that many rows and run.
76 |
77 | ### Featurewiz has lots of new fast model training functions that you can use to train highly performant models with the features selected by featurewiz. They are:
78 | 1. simple_LightGBM_model() - simple regression and classification with one target label.
79 | 2. simple_XGBoost_model() - simple regression and classification with one target label.
80 | 3. complex_LightGBM_model() - more complex multi-label and multi-class models.
81 | 4. complex_XGBoost_model() - more complex multi-label and multi-class models.
82 | 5. Stacking_Classifier(): Stacking model that can handle multi-label, multi-class problems.
83 | 6. Stacking_Regressor(): Stacking model that can handle multi-label, regression problems.
84 | 7. Blending_Regressor(): Blending model that can handle multi-label, regression problems.
85 |
86 |
--------------------------------------------------------------------------------