├── .github ├── ISSUE_TEMPLATE │ ├── bug_report.md │ ├── feature_request.md │ └── pull_request.md └── workflows │ └── build.yml ├── .gitignore ├── .pylintrc ├── AUTHORS.rst ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── MANIFEST.in ├── README.md ├── autoimpute ├── __init__.py ├── __version__.py ├── analysis │ ├── __init__.py │ ├── base_regressor.py │ ├── linear_regressor.py │ ├── logistic_regressor.py │ └── metrics.py ├── imputations │ ├── __init__.py │ ├── dataframe │ │ ├── __init__.py │ │ ├── base_imputer.py │ │ ├── mice_imputer.py │ │ ├── multiple_imputer.py │ │ └── single_imputer.py │ ├── deletion.py │ ├── errors.py │ ├── helpers.py │ ├── method_names.py │ ├── mis_classifier.py │ └── series │ │ ├── __init__.py │ │ ├── base.py │ │ ├── bayesian_regression.py │ │ ├── categorical.py │ │ ├── default.py │ │ ├── ffill.py │ │ ├── interpolation.py │ │ ├── linear_regression.py │ │ ├── logistic_regression.py │ │ ├── lrd.py │ │ ├── mean.py │ │ ├── median.py │ │ ├── mode.py │ │ ├── norm.py │ │ ├── norm_unit_variance.py │ │ ├── pmm.py │ │ └── random.py ├── utils │ ├── __init__.py │ ├── checks.py │ ├── dataframes.py │ ├── helpers.py │ └── patterns.py └── visuals │ ├── __init__.py │ ├── helpers.py │ ├── imputations.py │ └── utils.py ├── docs ├── Makefile └── source │ ├── conf.py │ ├── index.rst │ └── user_guide │ ├── analysis.rst │ ├── getting_started.rst │ ├── imputers.rst │ ├── missingness.rst │ ├── strategies.rst │ ├── utils.rst │ └── visuals.rst ├── pytest.ini ├── requirements.readthedocs.txt ├── requirements.txt ├── setup.cfg ├── setup.py └── tests ├── __init__.py ├── test_imputations ├── __init__.py ├── test_mice_imputer.py ├── test_missingness_classifier.py └── test_single_imputer.py ├── test_utils ├── __init__.py ├── test_checks.py └── test_patterns.py └── test_visuals └── __init__.py /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | Name: **Bug Report** 2 | About: **Create a bug report to help us fix what's broken in Autoimpute!** 3 | Labels: **bug** 4 | 5 | **Before Reporting** 6 | - Make sure you're trying to impute a `pandas DataFrame`. Other data structures not yet supported. 7 | - Make sure the `DataFrame` doesn't have any columns fully missing. Remove them before imputing. 8 | - Check active and resolved issues to ensure that someone has not already come across the same issue. 9 | 10 | **Describe the bug** 11 | - A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Discuss steps to reproduce the behavior. Minimum relevant information includes: 15 | 1. Operating system & Python version 16 | 2. Method or class used 17 | 3. Description of the data (# features, feature data types, etc.) 18 | 4. Arguments passed to method or class in question 19 | 5. Name of the error 20 | 21 | **Expected behavior** 22 | - A clear and concise description of what you expected to happen. 23 | 24 | **Code Samples & Screenshots** 25 | - If possible, provide the actual code that generated the bug 26 | - If applicable, add screenshots to help explain your problem 27 | - Screenshots of the traceback are especially helpful 28 | - Information about the DataFrame used is very helpful as well. Consider sharing `df.info()` 29 | 30 | **Additional Information to Provide:** 31 | - Operating System 32 | - Python Version 33 | - Anaconda Distribution & Package Versions -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | Name: **Feature Request** 2 | About: **Suggest an idea for what would make Autoimpute even better!** 3 | Labels: **enhancement** 4 | 5 | **Is your feature request related to a problem? Please describe.** 6 | - A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 7 | 8 | **Is your feature request something completely new? Please describe.** 9 | - A clear and concise description of what the new feature is. 10 | - This description should include where it fits within `autoimpute` 11 | - (i.e. an imputation method, a visualization, a utility, etc.). 12 | - In addition, describe why the new feature is necessary. 13 | - What does it bring to the package? 14 | - What will it improve or add? 15 | 16 | **Describe the solution you'd like** 17 | - A clear and concise description of what you want to happen. 18 | 19 | **Describe alternatives you've considered** 20 | - A clear and concise description of any alternative solutions or features you've considered. 21 | 22 | **Additional context** 23 | - Add any other context or screenshots about the feature request here. 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/pull_request.md: -------------------------------------------------------------------------------- 1 | Name: **Pull Request** 2 | About: **Request to Merge your code into Autoimpute!** 3 | 4 | To make sure the pull request gets approved, you must follow the steps below: 5 | 6 | 1. Raise a bug report, create an issue, or request a new feature. Follow the templates for those to get started. 7 | 2. Once the authors review, develop the agreed upon solution. Do so by creating a feature branch. 8 | 3. When finished coding, ensure that you've written a unit test 9 | - Place the test in the tests/ directory. Choose the appropriate file, or ask us if you're unsure where to test. 10 | - Note that we use pytests, so you must prefix your function name with test_ to ensure it will run 11 | 4. Run the tests, ensuring that your test is successful and no other tests have broken. 12 | 5. Issue a pull request to merge your branch into the **dev branch** 13 | - We use Travis for CI, so pull requests to the dev branch run all appropriate integration testing. 14 | - If your tests succeeds, the authors will rebase and merge your pull request. 15 | 6. Once merged to dev, we'll take care of merging to master (via a version bump or hotfix depending on what you merged) 16 | -------------------------------------------------------------------------------- /.github/workflows/build.yml: -------------------------------------------------------------------------------- 1 | name: Tests for autoimpute 2 | on: 3 | push: 4 | branches: 5 | - dev 6 | - master 7 | 8 | env: 9 | BRANCH: ${{ github.ref_name }} 10 | 11 | jobs: 12 | 13 | pybuild: 14 | name: Python setup and Unit Tests 15 | runs-on: ubuntu-latest 16 | steps: 17 | 18 | # step 1 - echo branch 19 | - name: Print vars 20 | run: | 21 | echo "Branches: $BRANCH" 22 | 23 | # step 2 - checkout repo 24 | - name: Checkout 25 | uses: actions/checkout@v2 26 | 27 | # step 4 - deploy python 28 | - name: Set up Python 29 | uses: actions/setup-python@v1 30 | with: 31 | python-version: 3.8 32 | 33 | # step 5 - check cache & install dependencies 34 | - name: Check cache 35 | uses: actions/cache@v1 36 | id: cache 37 | with: 38 | path: ~/.cache/pip 39 | key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} 40 | restore-keys: | 41 | ${{ runner.os }}-pip- 42 | 43 | - name: Install dependencies 44 | run: | 45 | pip install -r requirements.txt 46 | 47 | # step 6 - run unit tests 48 | - name: Test with pytest 49 | run: | 50 | pytest -v 51 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | *.swp 6 | .gitignore 7 | 8 | # C extensions 9 | *.so 10 | 11 | # vscode 12 | .vscode/ 13 | 14 | # Distribution / packaging 15 | .Python 16 | build/ 17 | develop-eggs/ 18 | dist/ 19 | downloads/ 20 | eggs/ 21 | .eggs/ 22 | lib/ 23 | lib64/ 24 | parts/ 25 | sdist/ 26 | var/ 27 | wheels/ 28 | *.egg-info/ 29 | .installed.cfg 30 | *.egg 31 | MANIFEST 32 | 33 | # Todo lists 34 | .todo.txt 35 | 36 | # PyInstaller 37 | # Usually these files are written by a python script from a template 38 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .coverage 50 | .coverage.* 51 | .cache 52 | nosetests.xml 53 | coverage.xml 54 | *.cover 55 | .hypothesis/ 56 | .pytest_cache/ 57 | 58 | # Translations 59 | *.mo 60 | *.pot 61 | 62 | # Django stuff: 63 | *.log 64 | local_settings.py 65 | db.sqlite3 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | target/ 79 | 80 | # Jupyter Notebook 81 | .ipynb_checkpoints 82 | 83 | # pyenv 84 | .python-version 85 | 86 | # celery beat schedule file 87 | celerybeat-schedule 88 | 89 | # SageMath parsed files 90 | *.sage.py 91 | 92 | # Environments 93 | .env 94 | .venv 95 | env/ 96 | venv/ 97 | ENV/ 98 | env.bak/ 99 | venv.bak/ 100 | 101 | # Spyder project settings 102 | .spyderproject 103 | .spyproject 104 | 105 | # Rope project settings 106 | .ropeproject 107 | 108 | # mkdocs documentation 109 | /site 110 | 111 | # mypy 112 | .mypy_cache/ 113 | -------------------------------------------------------------------------------- /AUTHORS.rst: -------------------------------------------------------------------------------- 1 | AutoImpute is developed and maintained by Joseph Kearney and Shahid Barkat. 2 | 3 | Contacting the Authors 4 | ``````````````````````` 5 | - Joseph Kearney `@kearnz ` 6 | - Shahid Barkat `@shabarka ` 7 | - Arnab Bose `@bosearnab ` 8 | 9 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Code of Conduct 2 | 3 | As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities. 4 | 5 | We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. 6 | 7 | Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct. 8 | 9 | Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team. 10 | 11 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers. 12 | 13 | This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available from http://contributor-covenant.org/version/1/0/0/. -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to Autoimpute 2 | Welcome to `autoimpute`! We appreciate your desire to contribute, and we're excited to see how you can add to or improve our project! We've outlined some guidelines to make contributing easy and ensure the contribution process is smooth. 3 | 4 | Jump to specific sections: 5 | * [Getting Started](#getting-started) 6 | * [Code of Conduct](#code-of-conduct) 7 | * [Reporting Bugs](#reporting-bugs) 8 | * [New Features](#new-features) 9 | * [Pull Requests](#pull-requests) 10 | 11 | ## Getting Started 12 | If you're completely new to `autoimpute`, read through our [README](https://github.com/kearnz/autoimpute/blob/master/README.md). To learn more about using the package and review sample use cases, check out our [tutorials](https://kearnz.github.io/autoimpute-tutorials/). 13 | 14 | ## Code of Conduct 15 | Before continuing, please read our [Code of Conduct](https://github.com/kearnz/autoimpute/blob/master/CODE_OF_CONDUCT.md). Our code of conduct extends to questions, issues raised, pull requests, and code itself. 16 | 17 | ## Reporting Bugs 18 | Because `autoimpute` is in rapid development, bugs are inevitable. We do our best to avoid them through testing and squash them quickly if they arise. That being said, we rely on users to help us find bugs that we missed. As noted in the [Pull Requests](#pull-requests) section, always start with a bug report before you actually submit a pull request. A bug report gives us a chance to discuss the feature in more detail and ensure a pull request develops smoothly. Here are some additional pointers for reporting bugs. 19 | 20 | #### Before Reporting Bugs 21 | * Make sure the bug occurs when using the master branch. 22 | * Check other active branches to ensure the bug is not being worked on or already fixed and slated for a future release. 23 | * Comb through the [Issues Board](https://github.com/kearnz/autoimpute/issues) to see if the bug has already been identified. 24 | * For errors, make sure that `autoimpute` is not raising the error as expected behavior. 25 | * For unexpected behavior, determine if any implicit assumptions from `autoimpute` logic produce the output. 26 | 27 | #### When Reporting Bugs 28 | * Follow the guidelines specified from our [Bug Report Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/bug_report.md) 29 | * If you're unsure who should act on this bug, assign it to `kearnz` by default. 30 | 31 | #### After Reporting Bugs 32 | * Stay involved! Monitor the issue thread. Be prepared to respond to comments & questions we may have. 33 | * Keep us updated! If you have more info to provide after the fact, follow up with us in a comment. 34 | * After acknowledgement and discussion, submit a pull request! Follow the pull request guidelines to do so. 35 | 36 | ## New Features 37 | Missing data imputation is a vast field. While we hope to cover as much as we possibly can, it'd be impossible for us to do it all. As noted in the [Pull Requests](#pull-requests) section, always start with a new feature request before you actually submit a pull request. A new feature request gives us a chance to discuss the feature in more detail and ensure a pull request develops smoothly. Here are some additional pointers for submitting new features. 38 | 39 | #### Before Suggesting New Features 40 | * Make sure the feature does not exist within any active branch. 41 | * If the feature exists in a stale branch, get in contact with us first. 42 | * Comb through the [Issues Board](https://github.com/kearnz/autoimpute/issues) to see if the feature has already been requested. 43 | 44 | #### When Suggesting New Features 45 | * Follow the guidelines specified from our [New Feature Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/feature_request.md) 46 | * If you're unsure who should act on this feature request, assign it to `kearnz` by default. 47 | 48 | #### After Suggesting New Features 49 | * Stay involved! Monitor the issue thread. Be prepared to respond to comments & questions we may have. 50 | * Keep us updated! If you have more info to provide after the fact, follow up with us in a comment. 51 | * After acknowledgement and discussion, submit a pull request! Follow the pull request guidelines to do so. 52 | 53 | ## Pull Requests 54 | We've open sourced `autoimpute` early on so colleagues and students can use the package. We have completed what we feel is the first phase of `autoimpute`, and we are ready to accept pull requests from those who want to contribute. As noted in the previous sections, never issue a pull request to fix a bug or add a new feature until you've submitted a bug report or new feature request. This intermediary step helps us understand your issue, discuss it, and ackwoledge the work to be done. We believe it makes the pull request process more successful! Below are some additional pointers for pull requests. 55 | 56 | #### Before Submitting a Pull Request 57 | * Raise an issue, create a new feature request, or alert us via a bug report BEFORE you submit a pull request. 58 | * Comb through existing [pull requests](https://github.com/kearnz/autoimpute/pulls) to ensure that a similar or identical pull request is not in the queue. 59 | 60 | #### When Submitting a Pull Request 61 | * Follow the guidelines specified from our [New Feature Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/pull_request.md) 62 | * If you're unsure who should act on this pull request or review the code, assign it to `kearnz` by default. 63 | 64 | #### After Submitting a Pull Request 65 | * Let us know if you'd like your name and profile added to an acknowledgement board! 66 | * Get in touch if you'd like to collaborate further on the project - we have plenty of ideas and need bandwidth! 67 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 kearnz, shabarka 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md AUTHORS.rst LICENSE 2 | global-exclude *.py[cod] __pycache__ *.so 3 | prune tests -------------------------------------------------------------------------------- /autoimpute/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the autoimpute package. 2 | 3 | This module handles imports that should be accessible from the top level. 4 | Because this package contains specific directories for imputation, diagnostics, 5 | exploratory data analysis, and visualizations, no classes or methods should be 6 | accessible from autoimpute itself. Users of this package should import specific 7 | classes or functions they need from the appropriate folder. 8 | 9 | Examples of correctly specified imports: 10 | - import autoimupte.imputations as ai 11 | - from autoimpute.imputations import MissingnessClassifier 12 | - from autoimpute.utils import md_pairs, md_pattern, flux 13 | 14 | Examples of incorrectly specified imports: 15 | - import autoimpute as ai (gives folder access only) 16 | - from autoimpute import * (wildcard imports discouraged and overridden) 17 | 18 | This module handles `from autoimpute import *` with the __all__ variable below. 19 | This command imports the major directories from autoimpute. 20 | """ 21 | 22 | from .__version__ import __version__ 23 | 24 | __all__ = [ 25 | "utils", 26 | "visuals", 27 | "imputations", 28 | "analysis" 29 | ] 30 | -------------------------------------------------------------------------------- /autoimpute/__version__.py: -------------------------------------------------------------------------------- 1 | """Version specification.""" 2 | 3 | VERSION = (0, 14, 1) 4 | 5 | __version__ = ".".join(map(str, VERSION)) 6 | -------------------------------------------------------------------------------- /autoimpute/analysis/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the analysis folder from the autoimpute package. 2 | 3 | This module handles imports from the analysis folder that should be accessible 4 | whenever someone imports autoimpute.analysis. The list below specifies the 5 | methods and classes that are available on import. 6 | 7 | This module handles `from autoimpute.analysis import *` with the __all__ 8 | variable below. This command imports the public classes and methods from 9 | autoimpute.analysis. 10 | """ 11 | 12 | from .base_regressor import MiBaseRegressor 13 | from .linear_regressor import MiLinearRegression 14 | from .logistic_regressor import MiLogisticRegression 15 | from .metrics import raw_bias, percent_bias 16 | 17 | __all__ = [ 18 | "MiBaseRegressor", 19 | "MiLinearRegression", 20 | "MiLogisticRegression", 21 | "raw_bias", 22 | "percent_bias" 23 | ] 24 | -------------------------------------------------------------------------------- /autoimpute/analysis/linear_regressor.py: -------------------------------------------------------------------------------- 1 | """Module containing linear regression for multiply imputed datasets.""" 2 | 3 | import pandas as pd 4 | from sklearn.base import BaseEstimator 5 | from sklearn.linear_model import LinearRegression 6 | from sklearn.utils.validation import check_is_fitted 7 | from statsmodels.api import OLS 8 | from autoimpute.utils import check_nan_columns 9 | from .base_regressor import MiBaseRegressor 10 | 11 | # pylint:disable=attribute-defined-outside-init 12 | # pylint:disable=too-many-locals 13 | 14 | class MiLinearRegression(MiBaseRegressor, BaseEstimator): 15 | """Linear Regression wrapper for multiply imputed datasets. 16 | 17 | The MiLinearRegression class wraps the sklearn and statsmodels libraries 18 | to extend linear regression to multiply imputed datasets. The class wraps 19 | statsmodels as well as sklearn because sklearn alone does not provide 20 | sufficient functionality to pool estimates under Rubin's rules. sklearn is 21 | for machine learning; therefore, important inference capabilities are 22 | lacking, such as easily calculating std. error estimates for parameters. 23 | If users want inference from regression analysis of multiply imputed 24 | data, utilze the statsmodels implementation in this class instead. 25 | 26 | Attributes: 27 | linear_models (dict): linear models used by supported python libs. 28 | """ 29 | 30 | linear_models = { 31 | "type": "linear", 32 | "statsmodels": OLS, 33 | "sklearn": LinearRegression 34 | } 35 | 36 | def __init__(self, mi=None, model_lib="statsmodels", mi_kwgs=None, 37 | model_kwgs=None): 38 | """Create an instance of the Autoimpute MiLinearRegression class. 39 | 40 | Args: 41 | mi (MiceImputer, Optional): An instance of a MiceImputer. 42 | Default is none. Can create one through `mi_kwgs` instead. 43 | model_lib (str, Optional): library the regressor will use to 44 | implement regression. Options are sklearn and statsmodels. 45 | Default is statsmodels. 46 | mi_kwgs (dict, Optional): keyword args to instantiate 47 | MiceImputer. Default is None. If valid MiceImputer 48 | passed as mi argument, then mi_kwgs ignored. 49 | model_kwgs (dict, Optional): keyword args to instantiate 50 | regressor. Default is None. 51 | 52 | Returns: 53 | self. Instance of the class. 54 | """ 55 | MiBaseRegressor.__init__( 56 | self, 57 | mi=mi, 58 | model_lib=model_lib, 59 | mi_kwgs=mi_kwgs, 60 | model_kwgs=model_kwgs 61 | ) 62 | 63 | @check_nan_columns 64 | def fit(self, X, y): 65 | """Fit model specified to multiply imputed dataset. 66 | 67 | Fit a linear regression on multiply imputed datasets. The method first 68 | creates multiply imputed data using the MiceImputer instantiated 69 | when creating an instance of the class. It then runs a linear model on 70 | each m datasets. The linear model comes from sklearn or statsmodels. 71 | Finally, the fit method calculates pooled parameters from the m linear 72 | models. Note that variance for pooled parameters using Rubin's rules 73 | is available for statsmodels only. sklearn does not implement 74 | parameter inference out of the box. Autoimpute sklearn pooling TBD. 75 | 76 | Args: 77 | X (pd.DataFrame): predictors to use. can contain missingness. 78 | y (pd.Series, pd.DataFrame): response. can contain missingness. 79 | 80 | Returns: 81 | self. Instance of the class 82 | """ 83 | 84 | # retain columns incase encoding occurs 85 | self.fit_X_columns = X.columns.tolist() 86 | 87 | # generate the imputation datasets from multiple imputation 88 | # then fit the analysis models on each of the imputed datasets 89 | self.models_ = self._apply_models_to_mi_data( 90 | self.linear_models, X, y 91 | ) 92 | 93 | # generate the fit statistics from each of the m models 94 | self.statistics_ = self._get_stats_from_models(self.models_) 95 | 96 | # still return an instance of the class 97 | return self 98 | 99 | @check_nan_columns 100 | def predict(self, X): 101 | """Make predictions using statistics generated from fit. 102 | 103 | The regression uses the pooled parameters from each of the imputed 104 | datasets to generate a set of single predictions. The pooled params 105 | come from multiply imputed datasets, but the predictions themselves 106 | follow the same rules as an ordinary linear regression. 107 | 108 | Args: 109 | X (pd.DataFrame): data to make predictions using pooled params. 110 | 111 | Returns: 112 | np.array: predictions. 113 | """ 114 | 115 | # validation before prediction 116 | X = self._predict_strategy_validator(self, X) 117 | 118 | # get the alpha and betas, then create linear equation for predictions 119 | alpha = self.statistics_["coefs"].values[0] 120 | betas = self.statistics_["coefs"].values[1:] 121 | preds = alpha + betas.dot(X.T) 122 | return preds 123 | 124 | def summary(self): 125 | """Provide a summary for model parameters, variance, and metrics. 126 | 127 | The summary method brings together the statistics generated from fit 128 | as well as the variance ratios, if available. The statistics are far 129 | more valuable when using statsmodels than sklearn. 130 | 131 | Returns: 132 | pd.DataFrame: summary statistics 133 | """ 134 | 135 | # only possible once we've fit a model with statsmodels 136 | check_is_fitted(self, "statistics_") 137 | sdf = pd.DataFrame(self.statistics_) 138 | sdf.rename(columns={"lambda_": "lambda"}, inplace=True) 139 | return sdf 140 | -------------------------------------------------------------------------------- /autoimpute/analysis/logistic_regressor.py: -------------------------------------------------------------------------------- 1 | """Module containing logistic regression for multiply imputed datasets.""" 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.base import BaseEstimator 6 | from sklearn.linear_model import LogisticRegression 7 | from sklearn.utils.validation import check_is_fitted 8 | from statsmodels.discrete.discrete_model import Logit 9 | from autoimpute.utils import check_nan_columns 10 | from .base_regressor import MiBaseRegressor 11 | 12 | # pylint:disable=attribute-defined-outside-init 13 | # pylint:disable=too-many-locals 14 | 15 | class MiLogisticRegression(MiBaseRegressor, BaseEstimator): 16 | """Logistic Regression wrapper for multiply imputed datasets. 17 | 18 | The MiLogisticRegression class wraps the sklearn and statsmodels libraries 19 | to extend logistic regression to multiply imputed datasets. The class wraps 20 | statsmodels as well as sklearn because sklearn alone does not provide 21 | sufficient functionality to pool estimates under Rubin's rules. sklearn is 22 | for machine learning; therefore, important inference capabilities are 23 | lacking, such as easily calculating std. error estimates for parameters. 24 | If users want inference from regression analysis of multiply imputed 25 | data, utilze the statsmodels implementation in this class instead. 26 | 27 | Attributes: 28 | logistic_models (dict): logistic models used by supported python libs. 29 | """ 30 | 31 | logistic_models = { 32 | "type": "logistic", 33 | "statsmodels": Logit, 34 | "sklearn": LogisticRegression 35 | } 36 | 37 | def __init__(self, mi=None, model_lib="statsmodels", mi_kwgs=None, 38 | model_kwgs=None): 39 | """Create an instance of the Autoimpute MiLogisticRegression class. 40 | 41 | Args: 42 | mi (MiceImputer, Optional): An instance of a MiceImputer. 43 | Default is None. Can create one through `mi_kwgs` instead. 44 | model_lib (str, Optional): library the regressor will use to 45 | implement regression. Options are sklearn and statsmodels. 46 | Default is statsmodels. 47 | mi_kwgs (dict, Optional): keyword args to instantiate 48 | MiceImputer. Default is None. If valid MiceImputer 49 | passed as `mi` argument, then `mi_kwgs` ignored. 50 | model_kwgs (dict, Optional): keyword args to instantiate 51 | regressor. Default is None. 52 | 53 | Returns: 54 | self. Instance of the class. 55 | """ 56 | MiBaseRegressor.__init__( 57 | self, 58 | mi=mi, 59 | model_lib=model_lib, 60 | mi_kwgs=mi_kwgs, 61 | model_kwgs=model_kwgs 62 | ) 63 | 64 | @check_nan_columns 65 | def fit(self, X, y): 66 | """Fit model specified to multiply imputed dataset. 67 | 68 | Fit a logistic regression on multiply imputed datasets. The method 69 | creates multiply imputed data using the MiceImputer instantiated 70 | when creating an instance of the class. It then runs a logistic model 71 | on m datasets. The logistic model comes from sklearn or statsmodels. 72 | Finally, the fit method calculates pooled parameters from m logistic 73 | models. Note that variance for pooled parameters using Rubin's rules 74 | is available for statsmodels only. sklearn does not implement 75 | parameter inference out of the box. 76 | 77 | Args: 78 | X (pd.DataFrame): predictors to use. can contain missingness. 79 | y (pd.Series, pd.DataFrame): response. can contain missingness. 80 | 81 | Returns: 82 | self. Instance of the class 83 | """ 84 | 85 | # retain columns incase encoding occurs 86 | self.fit_X_columns = X.columns.tolist() 87 | 88 | # generate the imputation datasets from multiple imputation 89 | # then fit the analysis models on each of the imputed datasets 90 | self.models_ = self._apply_models_to_mi_data( 91 | self.logistic_models, X, y 92 | ) 93 | 94 | # generate the fit statistics from each of the m models 95 | self.statistics_ = self._get_stats_from_models(self.models_) 96 | 97 | # still return an instance of the class 98 | return self 99 | 100 | def _sigmoid(self, z): 101 | """Private method that applies sigmoid function to input.""" 102 | return 1 / (1 + np.exp(-z)) 103 | 104 | @check_nan_columns 105 | def predict_proba(self, X): 106 | """Predict probabilities of class membership for logistic regression. 107 | 108 | The regression uses the pooled parameters from each of the imputed 109 | datasets to generate a set of single predictions. The pooled params 110 | come from multiply imputed datasets, but the predictions themselves 111 | follow the same rules as an logistic regression. Because this is 112 | logistic regression, the sigmoid function is applied to the result 113 | of the normal equation, giving us probabilities between 0 and 1 for 114 | each prediction. This method returns those probabilities. 115 | 116 | Args: 117 | X (pd.Dataframe): predictors to predict response 118 | 119 | Returns: 120 | np.array: prob of class membership for predicted observations. 121 | """ 122 | 123 | # run validation first 124 | X = self._predict_strategy_validator(self, X) 125 | 126 | # get the alpha and betas, then create linear equation for predictions 127 | alpha = self.statistics_["coefs"].values[0] 128 | betas = self.statistics_["coefs"].values[1:] 129 | return self._sigmoid(alpha + np.dot(X, betas)) 130 | 131 | @check_nan_columns 132 | def predict(self, X, threshold=0.5): 133 | """Make predictions using statistics generated from fit. 134 | 135 | The predict method calls on the predict_proba method, which returns 136 | the probability of class membership for each prediction. These 137 | probabilities range from 0 to 1. Therefore, anything below the set 138 | threshold is assigned to class 0, while anything above the threshold 139 | is assigned to class 1. The deafult threshhold is 0.5, which indicates 140 | a balanced dataset. 141 | 142 | Args: 143 | X (pd.DataFrame): data to make predictions using pooled params. 144 | threshold (float, Optional): boundary for class membership. 145 | Default is 0.5. Values can range from 0 to 1. 146 | 147 | Returns: 148 | np.array: predictions. 149 | """ 150 | pred_probs = self.predict_proba(X) 151 | pred_array = (pred_probs >= threshold).astype(int) 152 | responses = self._response_categories.values 153 | return responses[pred_array] 154 | 155 | def summary(self): 156 | """Provide a summary for model parameters, variance, and metrics. 157 | 158 | The summary method brings together the statistics generated from fit 159 | as well as the variance ratios, if available. The statistics are far 160 | more valuable when using statsmodels than sklearn. 161 | 162 | Returns: 163 | pd.DataFrame: summary statistics 164 | """ 165 | 166 | # only possible once we've fit a model with statsmodels 167 | check_is_fitted(self, "statistics_") 168 | sdf = pd.DataFrame(self.statistics_) 169 | sdf.rename(columns={"lambda_": "lambda"}, inplace=True) 170 | return sdf 171 | -------------------------------------------------------------------------------- /autoimpute/analysis/metrics.py: -------------------------------------------------------------------------------- 1 | """This module devises metrics to compare estimates from analysis models.""" 2 | 3 | import numpy as np 4 | import pandas as pd 5 | 6 | def raw_bias(Q_bar, Q): 7 | """Calculate raw bias between coefficients Q and actual Q. 8 | 9 | Q_bar can be one estimate (scalar) or a vector of estimates. This equation 10 | subtracts the expected Q_bar from Q, element-wise. The result is the bias 11 | of each coefficient from its true value. 12 | 13 | Args: 14 | Q_bar (number, array): single estimate or array of estimates. 15 | Q (number, array): single truth or array of truths. 16 | 17 | Returns: 18 | scalar, array: element-wise difference between estimates and truths. 19 | 20 | Raises: 21 | ValueError: Shape mismatch 22 | ValueError: Q_bar and Q not the same length 23 | """ 24 | 25 | # handle errors first 26 | shape_err = "Q_bar & Q must be scalars or vectors of same length." 27 | if isinstance(Q_bar, pd.DataFrame): 28 | s = len(Q_bar.shape) 29 | if s != 1: 30 | raise ValueError(shape_err) 31 | 32 | if isinstance(Q, pd.DataFrame): 33 | s = len(Q.shape) 34 | if s != 1: 35 | raise ValueError(shape_err) 36 | 37 | if len(Q_bar) != len(Q): 38 | raise ValueError(shape_err) 39 | 40 | # convert any lists to ensure element-wise performed 41 | if isinstance(Q_bar, (tuple, list)): 42 | Q_bar = np.array(Q_bar) 43 | 44 | if isinstance(Q, (tuple, list)): 45 | Q = np.array(Q) 46 | 47 | # perform element-wise subtraction 48 | rb = Q_bar - Q 49 | return rb 50 | 51 | def percent_bias(Q_bar, Q): 52 | """Calculate precent bias between coefficients Q and actual Q. 53 | 54 | Q_bar can be one estimate (scalar) or a vector of estimates. This equation 55 | subtracts the expected Q_bar from Q, element-wise. The result is the bias 56 | of each coefficient from its true value. We then divide this number by 57 | Q itself, again in element-wise fashion, to produce % bias. 58 | 59 | Args: 60 | Q_bar (number, array): single estimate or array of estimates. 61 | Q (number, array): single truth or array of truths. 62 | 63 | Returns: 64 | scalar, array: element-wise difference between estimates and truths. 65 | 66 | Raises: 67 | ValueError: Shape mismatch 68 | ValueError: Q_bar and Q not the same length 69 | """ 70 | # calling this method will validate Q_bar and Q 71 | rb = raw_bias(Q_bar, Q) 72 | 73 | # convert Q if necessary. must re-perform operation 74 | if isinstance(Q, (tuple, list)): 75 | Q = np.array(Q) 76 | 77 | pct_bias = 100 * (abs(rb)/Q) 78 | return pct_bias 79 | -------------------------------------------------------------------------------- /autoimpute/imputations/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the imputations folder from the autoimpute package. 2 | 3 | This module handles imports from the imputations folder that should be 4 | accessible whenever someone imports autoimpute.imputations. The list below 5 | specifies the methods and classes that are currently available on import. 6 | 7 | This module handles `from autoimpute.imputations import *` with the __all__ 8 | variable below. This command imports the main public classes and methods 9 | from autoimpute.imputations. 10 | """ 11 | 12 | from .dataframe import BaseImputer 13 | from .mis_classifier import MissingnessClassifier 14 | from .dataframe import SingleImputer 15 | from .dataframe import MultipleImputer 16 | from .dataframe import MiceImputer 17 | from .deletion import listwise_delete 18 | 19 | __all__ = [ 20 | "BaseImputer", 21 | "MissingnessClassifier", 22 | "SingleImputer", 23 | "MultipleImputer", 24 | "MiceImputer", 25 | "listwise_delete" 26 | ] 27 | -------------------------------------------------------------------------------- /autoimpute/imputations/dataframe/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the dataframe imputations folder from the autoimpute package. 2 | 3 | This module handles imports from the dataframe imputations folder that should 4 | be accessible whenever someone imports autoimpute.imputations.dataframe. 5 | The list below specifies the methods and classes that are available on import. 6 | 7 | This module handles `from autoimpute.imputations.dataframe import *` with the 8 | __all__ variable below. This command imports the public classes and methods 9 | from autoimpute.imputations.dataframe. 10 | """ 11 | 12 | from .base_imputer import BaseImputer 13 | from .single_imputer import SingleImputer 14 | from .multiple_imputer import MultipleImputer 15 | from .mice_imputer import MiceImputer 16 | 17 | __all__ = [ 18 | "BaseImputer", 19 | "SingleImputer", 20 | "MultipleImputer", 21 | "MiceImputer" 22 | ] 23 | -------------------------------------------------------------------------------- /autoimpute/imputations/dataframe/base_imputer.py: -------------------------------------------------------------------------------- 1 | """Module for BaseImputer - a base class for DataFrame imputers. 2 | 3 | This module contains the `BaseImputer`, which is used to abstract away 4 | functionality in both DataFrame imputers. The `BaseImputer` also holds the 5 | methods available for imputation analysis. 6 | """ 7 | 8 | import warnings 9 | from autoimpute.utils import check_strategy_allowed 10 | from autoimpute.imputations import method_names 11 | from ..series import DefaultUnivarImputer, DefaultPredictiveImputer 12 | from ..series import DefaultTimeSeriesImputer 13 | from ..series import MeanImputer, MedianImputer, ModeImputer 14 | from ..series import NormImputer, CategoricalImputer 15 | from ..series import RandomImputer, InterpolateImputer 16 | from ..series import LOCFImputer, NOCBImputer 17 | from ..series import LeastSquaresImputer, StochasticImputer 18 | from ..series import PMMImputer, LRDImputer 19 | from ..series import BinaryLogisticImputer, MultinomialLogisticImputer 20 | from ..series import BayesianLeastSquaresImputer 21 | from ..series import BayesianBinaryLogisticImputer 22 | from ..series import NormUnitVarianceImputer 23 | methods = method_names 24 | 25 | # pylint:disable=attribute-defined-outside-init 26 | # pylint:disable=too-many-arguments 27 | # pylint:disable=too-many-instance-attributes 28 | # pylint:disable=inconsistent-return-statements 29 | 30 | class BaseImputer: 31 | """Building blocks for more advanced DataFrame imputers. 32 | 33 | The BaseImputer is not a stand-alone class and thus serves no purpose 34 | other than as a parent to Imputers. Therefore, the BaseImputer should not 35 | be used directly unless creating an Imputer. That being said, all 36 | DataFrame Imputers should inherit from BaseImputer. It contains base 37 | functionality for any new DataFrame Imputer, and it holds the set of 38 | strategies that make up this imputation library. 39 | 40 | Attributes: 41 | univariate_strategies (dict): univariate imputation methods. 42 | | Key = imputation name; Value = function to perform imputation. 43 | | `univariate default` mean for numerical, mode for categorical. 44 | | `time default` interpolate for numerical, mode for categorical. 45 | | `mean` imputes missing values with the average of the series. 46 | | `median` imputes missing values with the median of the series. 47 | | `mode` imputes missing values with the mode of the series. 48 | | Method handles more than one mode (see ModeImputer for info). 49 | | `random` imputes random choice from set of series unique vals. 50 | | `norm` imputes series w/ random draws from normal distribution. 51 | | Mean and std calculated from observed values of the series. 52 | | `categorical` imputes series using random draws from pmf. 53 | | Proportions calculated from non-missing category instances. 54 | | `interpolate` imputes series using chosen interpolation method. 55 | | Default is linear. See InterpolateImputer for more info. 56 | | `locf` imputes series carrying last observation moving forward. 57 | | `nocb` imputes series carrying next observation moving backward. 58 | | `normal unit variance` imputes using unit variance w/ norm. 59 | predictive_strategies (dict): predictive imputation methods. 60 | | Key = imputation name; Value = function to perform imputation. 61 | | `predictive default` pmm for numerical,logistic for categorical. 62 | | `least squares` predict missing values from linear regression. 63 | | `binary logistic` predict missing values with 2 classes. 64 | | `multinomial logistic` predict missing values with multiclass. 65 | | `stochastic` linear regression+random draw from norm w/ mse std. 66 | | `bayesian least squares` draw from the posterior predictive 67 | | distribution for each missing value, using OLS model. 68 | | `bayesian binary logistic` draw from the posterior predictive 69 | | distribution for each missing value, using logistic model. 70 | | `pmm` imputes series using predictive mean matching. PMM is a 71 | | semi-supervised method using bayesian & hot-deck imputation. 72 | | `lrd` imputes series using local residual draws. LRD is a 73 | | semi-supervised method using bayesian & hot-deck imputation. 74 | """ 75 | univariate_strategies = { 76 | methods.DEFAULT_UNIVAR: DefaultUnivarImputer, 77 | methods.DEFAULT_TIME: DefaultTimeSeriesImputer, 78 | methods.MEAN: MeanImputer, 79 | methods.MEDIAN: MedianImputer, 80 | methods.MODE: ModeImputer, 81 | methods.RANDOM: RandomImputer, 82 | methods.NORM: NormImputer, 83 | methods.CATEGORICAL: CategoricalImputer, 84 | methods.INTERPOLATE: InterpolateImputer, 85 | methods.LOCF: LOCFImputer, 86 | methods.NOCB: NOCBImputer, 87 | methods.NORM_UNIT_VARIANCE: NormUnitVarianceImputer, 88 | } 89 | 90 | predictive_strategies = { 91 | methods.DEFAULT_PRED: DefaultPredictiveImputer, 92 | methods.LS: LeastSquaresImputer, 93 | methods.STOCHASTIC: StochasticImputer, 94 | methods.BINARY_LOGISTIC: BinaryLogisticImputer, 95 | methods.MULTI_LOGISTIC: MultinomialLogisticImputer, 96 | methods.BAYESIAN_LS: BayesianLeastSquaresImputer, 97 | methods.BAYESIAN_BINARY_LOGISTIC: BayesianBinaryLogisticImputer, 98 | methods.PMM: PMMImputer, 99 | methods.LRD: LRDImputer 100 | } 101 | 102 | strategies = {**predictive_strategies, **univariate_strategies} 103 | 104 | visit_sequences = ( 105 | "default", 106 | "left-to-right" 107 | ) 108 | 109 | def __init__(self, strategy, imp_kwgs, visit): 110 | """Initialize the BaseImputer. 111 | 112 | Args: 113 | strategy (str, iter, dict; optional): strategies for imputation. 114 | Default value is str -> `predictive default`. 115 | If str, single strategy broadcast to all series in DataFrame. 116 | If iter, must provide 1 strategy per column. Each method w/in 117 | iterator applies to column with same index value in DataFrame. 118 | If dict, must provide key = column name, value = imputer. 119 | Dict the most flexible and PREFERRED way to create custom 120 | imputation strategies if not using the default. Dict does not 121 | require method for every column; just those specified as keys. 122 | imp_kwgs (dict, optional): keyword arguments for each imputer. 123 | Default is None, which means default imputer created to match 124 | specific strategy. imp_kwgs keys can be either columns or 125 | strategies. If strategies, each column given that strategy is 126 | instantiated with same arguments. 127 | visit (str, None): order to visit columns for imputation. 128 | Default is `default`, which implements `left-to-right`. 129 | More strategies (random, monotone, etc.) TBD. 130 | """ 131 | self.strategy = strategy 132 | self.imp_kwgs = imp_kwgs 133 | self.visit = visit 134 | 135 | @property 136 | def strategy(self): 137 | """Property getter to return the value of the strategy property.""" 138 | return self._strategy 139 | 140 | @strategy.setter 141 | def strategy(self, s): 142 | """Validate the strategy property to ensure it's type and value. 143 | 144 | Class instance only possible if strategy is proper type, as outlined 145 | in the init method. Passes supported strategies and user arg to 146 | helper method, which performs strategy checks. 147 | 148 | Args: 149 | s (str, iter, dict): Strategy passed as arg to class instance. 150 | 151 | Raises: 152 | ValueError: Strategies not valid (not in allowed strategies). 153 | TypeError: Strategy must be a string, tuple, list, or dict. 154 | Both errors raised through helper method `check_strategy_allowed`. 155 | """ 156 | strat_names = self.strategies.keys() 157 | self._strategy = check_strategy_allowed(strat_names, s) 158 | 159 | @property 160 | def imp_kwgs(self): 161 | """Property getter to return the value of imp_kwgs.""" 162 | return self._imp_kwgs 163 | 164 | @imp_kwgs.setter 165 | def imp_kwgs(self, kwgs): 166 | """Validate the imp_kwgs and set default properties. 167 | 168 | The BaseImputer validates the `imp_kwgs` argument. `imp_kwgs` contain 169 | optional keyword arguments for an imputers' strategies or columns. The 170 | argument is optional, and its default is None. 171 | 172 | Args: 173 | kwgs (dict, None): None or dictionary of keywords. 174 | 175 | Raises: 176 | ValueError: imp_kwgs not correctly specified as argument. 177 | """ 178 | if not isinstance(kwgs, (type(None), dict)): 179 | err = "imp_kwgs must be dict of args used to instantiate Imputer." 180 | raise ValueError(err) 181 | self._imp_kwgs = kwgs 182 | 183 | @property 184 | def visit(self): 185 | """Property getter to return the value of the visit property.""" 186 | return self._visit 187 | 188 | @visit.setter 189 | def visit(self, v): 190 | """Validate the visit property to ensure it's type and value. 191 | 192 | Class instance only possible if visit is proper type, as outlined in 193 | the init method. Visit property must be one of valid sequences in the 194 | `visit_sequences` variable. 195 | 196 | Args: 197 | v (str): Visit sequence passed as arg to class instance. 198 | 199 | Raises: 200 | TypeError: visit sequence must be a string. 201 | ValueError: visit sequenece not in `visit_sequences`. 202 | """ 203 | 204 | # deal with type first 205 | if not isinstance(v, str): 206 | err = "visit must be a string specifying visit sequence to use." 207 | raise TypeError(err) 208 | 209 | # deal with value next 210 | if v not in self.visit_sequences: 211 | err = f"visit not valid. Must be one of {self.visit_sequences}" 212 | raise ValueError(err) 213 | 214 | # otherwise, set property for visit 215 | self._visit = v 216 | 217 | def _fit_init_params(self, column, method, kwgs): 218 | """Private method to supply imputation model fit params if any.""" 219 | 220 | # first, handle easy case when no kwargs given 221 | if kwgs is None: 222 | final_params = kwgs 223 | 224 | # next, check if any kwargs for a given Imputer method type 225 | # then, override those parameters if specific column kwargs supplied 226 | if isinstance(kwgs, dict): 227 | initial_params = kwgs.get(method, None) 228 | final_params = kwgs.get(column, initial_params) 229 | 230 | # final params must be None or a dictionary of kwargs 231 | # this additional validation step is crucial to dictionary unpacking 232 | if not isinstance(final_params, (type(None), dict)): 233 | err = "Additional params must be dict of args used to init model." 234 | raise ValueError(err) 235 | return final_params 236 | 237 | def _check_if_single_dummy(self, col, X): 238 | """Private method to check if encoding results in single cat.""" 239 | cats = X.columns.tolist() 240 | if len(cats) == 1: 241 | c = cats[0] 242 | msg = f"{c} only category for feature {col}." 243 | cons = f"Consider removing {col} from dataset." 244 | warnings.warn(f"{msg} {cons}") 245 | -------------------------------------------------------------------------------- /autoimpute/imputations/dataframe/mice_imputer.py: -------------------------------------------------------------------------------- 1 | """This module performs a series of multiple imputations of missing features 2 | in a dataset. 3 | 4 | This module contains one class - the MiceImputer. Use this class to 5 | impute each Series within a DataFrame multiple times using an iteration of fits 6 | and transformations to reach a stable state of imputation each time. This 7 | extension of MultipleImputer makes the same imputation methods available as its 8 | parent class - both univariate and multivatiate. Each method runs `n` times on 9 | its specified column. When `n` passes through the columns are complete, the 10 | MultipleImputer returns the `n` imputed datasets. For each of these `n` 11 | imputations, the method (re)fits and applies imputation to the dataset `k` 12 | times. Typically `k` should be at least 3 to reach a stable state. 13 | Its functioning is based upon the R package MICE 14 | (https://cran.r-project.org/web/packages/mice/) 15 | """ 16 | 17 | from autoimpute.utils import check_nan_columns 18 | from .multiple_imputer import MultipleImputer 19 | 20 | 21 | class MiceImputer(MultipleImputer): 22 | """Techniques to impute Series with missing values multiple times using 23 | repeated fits and applications to reach a stable imputation. 24 | 25 | The MiceImputer class implements multiple imputation, i.e., a series 26 | or repetition of applications of imputation to reach a stable imputation, 27 | similar to the functioning of the R package MICE. 28 | It leverages the methods found in the BaseImputer. This imputer passes 29 | all the work for each imputation to the SingleImputer, but it controls 30 | the arguments each imputer receives. The args are flexible depending on 31 | what the user specifies for each imputation. 32 | 33 | Note that the Imputer allows for one imputation method per column only. 34 | Therefore, the behavior of `strategy` is the same as the SingleImputer, 35 | but the predictors are allowed to change for each imputation. 36 | """ 37 | 38 | def __init__(self, k=3, n=5, 39 | strategy="default predictive", predictors="all", 40 | imp_kwgs=None, seed=None, visit="default", 41 | return_list=False): 42 | """Create an instance of the SeriesImputer class. 43 | 44 | As with sklearn classes, all arguments take default values. Therefore, 45 | SeriesImputer() creates a valid class instance. The instance is 46 | used to set up an imputer and perform checks on arguments. 47 | 48 | Args: 49 | k (int, optional): number of repeated fits and transformations to 50 | apply to reach a stable impution. Default is 3. 51 | Value must be greater than or equal to 1. 52 | n (int, optional): number of imputations to perform. Default is 5. 53 | Value must be greater than or equal to 1. 54 | strategy (str, iter, dict; optional): strategy for single imputer. 55 | Default value is str --> `predictive default`. 56 | See BaseImputer for all available strategies. 57 | If str, single strategy broadcast to all series in DataFrame. 58 | If iter, must provide 1 strategy per column. Each method w/in 59 | iterator applies to column with same index value in DataFrame. 60 | If dict, must provide key = column name, value = imputer. 61 | Dict the most flexible and PREFERRED way to create custom 62 | imputation strategies if not using the default. Dict does not 63 | require method for every column; just those specified as keys. 64 | predictors (str, iter, dict, optional): defaults to all, i.e. 65 | use all predictors. If all, every column will be used for 66 | every class prediction. If a list, subset of columns used for 67 | all predictions. If a dict, specify which columns to use as 68 | predictors for each imputation. Columns not specified in dict 69 | but present in `strategy` receive `all` other cols as preds. 70 | imp_kwgs (dict, optional): keyword arguments for each imputer. 71 | Default is None, which means default imputer created to match 72 | specific strategy. imp_kwgs keys can be either columns or 73 | strategies. If strategies, each column given that strategy is 74 | instantiated with same arguments. When strategy is `default`, 75 | imp_kwgs is ignored. 76 | seed (int, optional): seed setting for reproducible results. 77 | Defualt is None. No validation, but values should be integer. 78 | return_list (bool, optional): return m as list or generator. 79 | Default is False. m imputations returned as generator. More 80 | memory efficient. return as list if return_list=True 81 | """ 82 | self.k = k 83 | MultipleImputer.__init__( 84 | self, 85 | n=n, 86 | strategy=strategy, 87 | predictors=predictors, 88 | imp_kwgs=imp_kwgs, 89 | seed=seed, 90 | visit=visit, 91 | return_list=return_list 92 | ) 93 | 94 | @property 95 | def k(self): 96 | """Property getter to return the value of the k property.""" 97 | return self._k 98 | 99 | @k.setter 100 | def k(self, k_): 101 | """Validate the k property to ensure it's Type and Value. 102 | 103 | Args: 104 | k_ (int): k passed as arg to class instance. 105 | 106 | Raises: 107 | TypeError: k must be an integer. 108 | ValueError: k must be greater than zero. 109 | """ 110 | 111 | # deal with type first 112 | if not isinstance(k_, int): 113 | err = """ 114 | k must be an int specifying number of repeated fits 115 | and transformations in a series of imputations. 116 | """ 117 | raise TypeError(err) 118 | 119 | # then check the value is greater than zero 120 | if k_ < 1: 121 | err = "k > 0. Cannot perform fewer than 1 imputation." 122 | raise ValueError(err) 123 | 124 | # otherwise set the property value for k 125 | self._k = k_ 126 | 127 | def _iterate_imputation(self, X, imp): 128 | """Helper function that iterates self.k times to create a stable imputation 129 | by repeated application and retraining of the imputation models. 130 | Used by transform() 131 | 132 | Args: 133 | X (pd.DataFrame): fit DataFrame to impute 134 | imp (Imputer): Imputer to apply to X 135 | """ 136 | X2 = imp.transform(X, imp_ixs=self.imputed_) 137 | for k in range(self.k - 1): 138 | imp.fit(X2, imp_ixs=self.imputed_) 139 | X2 = imp.transform(X, imp_ixs=self.imputed_, k=k) 140 | return X2 141 | 142 | @check_nan_columns 143 | def transform(self, X): 144 | """Impute each column within a DataFrame using fit imputation methods. 145 | 146 | The transform step performs the actual imputations. Given a dataset 147 | previously fit, `transform` imputes each column with it's respective 148 | imputed values from fit (in the case of inductive) or performs new fit 149 | and transform in one sweep (in the case of transductive). 150 | The transformations and fits are repeatedly applied and refitted self.k 151 | times to reach a stable imputation. 152 | 153 | Args: 154 | X (pd.DataFrame): fit DataFrame to impute. 155 | 156 | Returns: 157 | X (pd.DataFrame): imputed in place or copy of original. 158 | 159 | Raises: 160 | ValueError: same columns must appear in fit and transform. 161 | """ 162 | 163 | # call transform strategy validator before applying transform 164 | self._transform_strategy_validator() 165 | 166 | # make it easy to access the location of the imputed values 167 | self.imputed_ = {} 168 | for column in self._strats.keys(): 169 | imp_ix = X[column][X[column].isnull()].index 170 | self.imputed_[column] = imp_ix.tolist() 171 | 172 | # right now, return a generator by default 173 | # sequential only for now 174 | imputed = ((i[0], self._iterate_imputation(X, i[1])) 175 | for i in self.statistics_.items()) 176 | 177 | if self.return_list: 178 | imputed = list(imputed) 179 | return imputed 180 | -------------------------------------------------------------------------------- /autoimpute/imputations/dataframe/multiple_imputer.py: -------------------------------------------------------------------------------- 1 | """This module performs multiple imputations of missing features in a dataset. 2 | 3 | This module contains one class - the MultipleImputer. Use this class to 4 | impute each Series within a DataFrame multiple times. This class makes 5 | numerous imputation methods available - both univariate and multivatiate. Each 6 | method runs `n` times on its specified column. When `n` passes through the 7 | columns are complete, the MultipleImputer returns the `n` imputed datasets. 8 | """ 9 | 10 | from sklearn.base import BaseEstimator, TransformerMixin 11 | from sklearn.utils.validation import check_is_fitted 12 | from autoimpute.imputations import method_names 13 | from autoimpute.utils import check_nan_columns, check_predictors_fit 14 | from autoimpute.utils import check_strategy_fit 15 | from .base_imputer import BaseImputer 16 | from .single_imputer import SingleImputer 17 | methods = method_names 18 | 19 | # pylint:disable=attribute-defined-outside-init 20 | # pylint:disable=protected-access 21 | # pylint:disable=too-many-arguments 22 | # pylint:disable=unused-argument 23 | # pylint:disable=too-many-instance-attributes 24 | # pylint:disable=arguments-differ 25 | 26 | 27 | class MultipleImputer(BaseImputer, BaseEstimator, TransformerMixin): 28 | """Techniques to impute Series with missing values multiple times. 29 | 30 | The MultipleImputer class applies imputation multiple times. It leverages the 31 | methods found in the BaseImputer. This imputer passes all the work for 32 | each imputation to the SingleImputer, but it controls the arguments 33 | each imputer receives. The args are flexible depending on what the user 34 | specifies for each imputation. 35 | 36 | Note that the Imputer allows for one imputation method per column only. 37 | Therefore, the behavior of `strategy` is the same as the SingleImputer, 38 | but the predictors are allowed to change for each imputation. 39 | """ 40 | 41 | def __init__(self, n=5, strategy="default predictive", predictors="all", 42 | imp_kwgs=None, seed=None, visit="default", return_list=False): 43 | """Create an instance of the MultipleImputer class. 44 | 45 | As with sklearn classes, all arguments take default values. Therefore, 46 | MultipleImputer() creates a valid class instance. The instance is 47 | used to set up an imputer and perform checks on arguments. 48 | 49 | Args: 50 | n (int, optional): number of imputations to perform. Default is 5. 51 | Value must be greater than or equal to 1. 52 | strategy (str, iter, dict; optional): strategy for single imputer. 53 | Default value is str --> `predictive default`. 54 | See BaseImputer for all available strategies. 55 | If str, single strategy broadcast to all series in DataFrame. 56 | If iter, must provide 1 strategy per column. Each method w/in 57 | iterator applies to column with same index value in DataFrame. 58 | If dict, must provide key = column name, value = imputer. 59 | Dict the most flexible and PREFERRED way to create custom 60 | imputation strategies if not using the default. Dict does not 61 | require method for every column; just those specified as keys. 62 | predictors (str, iter, dict, optional): defaults to all, i.e. 63 | use all predictors. If all, every column will be used for 64 | every class prediction. If a list, subset of columns used for 65 | all predictions. If a dict, specify which columns to use as 66 | predictors for each imputation. Columns not specified in dict 67 | but present in `strategy` receive `all` other cols as preds. 68 | imp_kwgs (dict, optional): keyword arguments for each imputer. 69 | Default is None, which means default imputer created to match 70 | specific strategy. imp_kwgs keys can be either columns or 71 | strategies. If strategies, each column given that strategy is 72 | instantiated with same arguments. When strategy is `default`, 73 | imp_kwgs is ignored. 74 | seed (int, optional): seed setting for reproducible results. 75 | Defualt is None. No validation, but values should be integer. 76 | return_list (bool, optional): return m as list or generator. 77 | Default is False. m imputations returned as generator. More 78 | memory efficient. return as list if return_list=True 79 | """ 80 | BaseImputer.__init__( 81 | self, 82 | strategy=strategy, 83 | imp_kwgs=imp_kwgs, 84 | visit=visit 85 | ) 86 | self.n = n 87 | self.predictors = predictors 88 | self.seed = seed 89 | self.return_list = return_list 90 | self.copy = True 91 | 92 | @property 93 | def n(self): 94 | """Property getter to return the value of the n property.""" 95 | return self._n 96 | 97 | @n.setter 98 | def n(self, n_): 99 | """Validate the n property to ensure it's Type and Value. 100 | 101 | Args: 102 | n_ (int): n passed as arg to class instance. 103 | 104 | Raises: 105 | TypeError: n must be an integer. 106 | ValueError: n must be greater than zero. 107 | """ 108 | 109 | # deal with type first 110 | if not isinstance(n_, int): 111 | err = "n must be an integer specifying number of imputations." 112 | raise TypeError(err) 113 | 114 | # then check the value is greater than zero 115 | if n_ < 1: 116 | err = "n > 0. Cannot perform fewer than 1 imputation." 117 | raise ValueError(err) 118 | 119 | # otherwise set the property value for n 120 | self._n = n_ 121 | 122 | def _fit_strategy_validator(self, X): 123 | """Internal helper method to validate strategies appropriate for fit. 124 | 125 | Checks whether strategies match with type of column they are applied 126 | to. If not, error is raised through `check_strategy_fit` method. 127 | """ 128 | 129 | # remove nan columns and store colnames 130 | cols = X.columns.tolist() 131 | self._strats = check_strategy_fit(self.strategy, cols) 132 | 133 | # if predictors is a list... 134 | if isinstance(self.predictors, (tuple, list)): 135 | # and it is not the same list of predictors for every iteration... 136 | if not all([isinstance(x, str) for x in self.predictors]): 137 | len_pred = len(self.predictors) 138 | # raise error if not the correct length 139 | if len_pred != self.n: 140 | err = f"Predictors has {len_pred} items. Need {self.n}" 141 | raise ValueError(err) 142 | # check predictors for each in list 143 | self._preds = [ 144 | check_predictors_fit(p, cols) 145 | for p in self.predictors 146 | ] 147 | # if it is a list, but not a list of objects... 148 | else: 149 | # broadcast predictors 150 | self._preds = check_predictors_fit(self.predictors, cols) 151 | self._preds = [self._preds]*self.n 152 | # if string or dictionary... 153 | else: 154 | # broadcast predictors 155 | self._preds = check_predictors_fit(self.predictors, cols) 156 | self._preds = [self._preds]*self.n 157 | 158 | def _transform_strategy_validator(self): 159 | """Private method to prep for prediction.""" 160 | check_is_fitted(self, "statistics_") 161 | 162 | @check_nan_columns 163 | def fit(self, X, y=None): 164 | """Fit imputation methods to each column within a DataFrame. 165 | 166 | The fit method calclulates the `statistics` necessary to later 167 | transform a dataset (i.e. perform actual imputatations). Inductive 168 | methods calculate statistic on the fit data, then impute new missing 169 | data with that value. All currently supported methods are inductive. 170 | 171 | Args: 172 | X (pd.DataFrame): pandas DataFrame on which imputer is fit. 173 | 174 | Returns: 175 | self: instance of the PredictiveImputer class. 176 | """ 177 | 178 | # first, run the fit strategy validator, then create statistics 179 | self._fit_strategy_validator(X) 180 | self.statistics_ = {} 181 | 182 | # deal with potentially setting seed for each individual predictor 183 | if self.seed is not None: 184 | self._seeds = [self.seed + i for i in range(1, self.n*13, 13)] 185 | else: 186 | self._seeds = [None]*self.n 187 | 188 | # create PredictiveImputers. sequentially only right now 189 | for i in range(1, self.n+1): 190 | imputer = SingleImputer( 191 | strategy=self.strategy, 192 | predictors=self._preds[i-1], 193 | imp_kwgs=self.imp_kwgs, 194 | copy=self.copy, 195 | seed=self._seeds[i-1], 196 | visit=self.visit 197 | ) 198 | imputer.fit(X) 199 | self.statistics_[i] = imputer 200 | 201 | return self 202 | 203 | @check_nan_columns 204 | def transform(self, X, **trans_kwargs): 205 | """Impute each column within a DataFrame using fit imputation methods. 206 | 207 | The transform step performs the actual imputations. Given a dataset 208 | previously fit, `transform` imputes each column with it's respective 209 | imputed values from fit (in the case of inductive) or performs new fit 210 | and transform in one sweep (in the case of transductive). 211 | 212 | Args: 213 | X (pd.DataFrame): fit DataFrame to impute. 214 | **trans_kwargs: dict, optional args for bayesian. 215 | 216 | Returns: 217 | X (pd.DataFrame): imputed in place or copy of original. 218 | 219 | Raises: 220 | ValueError: same columns must appear in fit and transform. 221 | """ 222 | 223 | # call transform strategy validator before applying transform 224 | self._transform_strategy_validator() 225 | 226 | # make it easy to access the location of the imputed values 227 | self.imputed_ = {} 228 | for column in self._strats.keys(): 229 | imp_ix = X[column][X[column].isnull()].index 230 | self.imputed_[column] = imp_ix.tolist() 231 | 232 | # right now, return a generator by default 233 | # sequential only for now 234 | imputed = ((i[0], i[1].transform(X, **trans_kwargs)) 235 | for i in self.statistics_.items()) 236 | if self.return_list: 237 | imputed = list(imputed) 238 | return imputed 239 | 240 | def fit_transform(self, X, y=None, **trans_kwargs): 241 | """Convenience method to fit then transform the same dataset.""" 242 | return self.fit(X, y).transform(X, **trans_kwargs) 243 | -------------------------------------------------------------------------------- /autoimpute/imputations/deletion.py: -------------------------------------------------------------------------------- 1 | """Deletion strategies to handle the missing data in pandas DataFrame.""" 2 | 3 | from autoimpute.utils import check_nan_columns 4 | 5 | @check_nan_columns 6 | def listwise_delete(data, inplace=False, verbose=False): 7 | """Delete all rows from a DataFrame where any missing values exist. 8 | 9 | Deletion is one way to handle missing values. This method removes any 10 | records that have a missing value in any of the features. This package 11 | focuses on imputation, not deletion. That being said, listwise deletion 12 | is a necessary component of any imputation package, as its the default 13 | method most people (and software) use to handle missing data. 14 | 15 | Args: 16 | data (pd.DataFrame): DataFrame used to delete missing rows. 17 | inplace (boolean, optional): perform operation inplace. 18 | Defaults to False. 19 | verbose (boolean, optional): print information to console. 20 | Defaults to False. 21 | 22 | Returns: 23 | pd.DataFrame: rows with missing values removed. 24 | 25 | Raises: 26 | ValueError: columns with all data missing. Raised through decorator. 27 | """ 28 | num_records_before = len(data.index) 29 | if inplace: 30 | data.dropna(inplace=True) 31 | else: 32 | data = data.dropna(inplace=False) 33 | num_records_after = len(data.index) 34 | if verbose: 35 | print(f"Number of records before delete: {num_records_before}") 36 | print(f"Number of records after delete: {num_records_after}") 37 | return data 38 | -------------------------------------------------------------------------------- /autoimpute/imputations/errors.py: -------------------------------------------------------------------------------- 1 | """Private methods for handling errors throughout imputation analysis.""" 2 | 3 | from pandas.api.types import is_string_dtype 4 | from pandas.api.types import is_numeric_dtype 5 | 6 | # ERROR HANDLING 7 | # -------------- 8 | 9 | def _not_num_series(m, s): 10 | """Private method to detect columns of Matrix that are not categorical.""" 11 | if not is_numeric_dtype(s): 12 | t = s.dtype 13 | err = f"{m} not appropriate for Series {s.name} of type {t}." 14 | raise TypeError(err) 15 | 16 | def _not_num_matrix(m, mat): 17 | """Private method to detect columns of Matrix that are not numerical.""" 18 | try: 19 | for each_col in mat: 20 | c = mat[each_col] 21 | _not_num_series(m, c) 22 | except TypeError as te: 23 | err = f"{m} not appropriate for Matrix with non-numerical columns." 24 | raise TypeError(err) from te 25 | 26 | def _not_cat_series(m, s): 27 | """Private method to detect Series that are not categorical.""" 28 | t = s.dtype 29 | if not is_string_dtype(s) and t != "object": 30 | err = f"{m} not appropriate for Series {s.name} of type {t}." 31 | raise TypeError(err) 32 | 33 | def _not_cat_matrix(m, mat): 34 | """Private method to detect columns of Matrix that are not categorical.""" 35 | try: 36 | for each_col in mat: 37 | c = mat[each_col] 38 | _not_cat_series(m, c) 39 | except TypeError as te: 40 | err = f"{m} not appropriate for Matrix with non-categorical columns." 41 | raise TypeError(err) from te 42 | -------------------------------------------------------------------------------- /autoimpute/imputations/helpers.py: -------------------------------------------------------------------------------- 1 | """Private helper methods for the imputations folder.""" 2 | 3 | import logging 4 | import numpy as np 5 | import pandas as pd 6 | from autoimpute.imputations.deletion import listwise_delete 7 | 8 | def _get_observed(predictors, series, verbose=False): 9 | """Private method to test datasets and get observed data.""" 10 | conc = pd.concat([predictors, series], axis=1) 11 | 12 | # perform listwise delete on predictors and series 13 | # resulting data serves as the `observed` data for fit modeling 14 | predictors = listwise_delete(conc, verbose=verbose) 15 | series = predictors.pop(series.name) 16 | return predictors, series 17 | 18 | def _neighbors(x, n, df, choose): 19 | al = len(df.index) 20 | if n > al: 21 | err = "# neighbors greater than # predictions. Reduce neighbor count." 22 | raise ValueError(err) 23 | indexarr = np.argpartition(abs(df["y_pred"] - x), n)[:n] 24 | neighbs = df.loc[indexarr, "y"].values 25 | return choose(neighbs) 26 | 27 | def _local_residuals(x, n, df, choose): 28 | al = len(df.index) 29 | if n > al: 30 | err = "# neighbors greater than # predictions. Reduce neighbor count." 31 | raise ValueError(err) 32 | indexarr = np.argpartition(abs(df["y_pred"] - x), n)[:n] 33 | neighbs = df.loc[indexarr, "y"].values 34 | distances = df.loc[indexarr, "y_pred"].values - x 35 | resids = neighbs + distances 36 | return choose(resids) 37 | 38 | def _pymc_logger(verbose=False): 39 | """Private method to handle pymc logging.""" 40 | progress = 1 41 | if not verbose: 42 | progress = 0 43 | logger = logging.getLogger('pymc') 44 | logger.setLevel(logging.ERROR) 45 | return progress 46 | -------------------------------------------------------------------------------- /autoimpute/imputations/method_names.py: -------------------------------------------------------------------------------- 1 | """Module contains names of methods used in Imputer Classes.""" 2 | 3 | DEFAULT = "default" 4 | DEFAULT_PRED = "default predictive" 5 | DEFAULT_UNIVAR = "default univariate" 6 | DEFAULT_TIME = "default time" 7 | MEAN = "mean" 8 | MEDIAN = "median" 9 | MODE = "mode" 10 | RANDOM = "random" 11 | NORM = "norm" 12 | CATEGORICAL = "categorical" 13 | INTERPOLATE = "interpolate" 14 | LOCF = "locf" 15 | NOCB = "nocb" 16 | LS = "least squares" 17 | BINARY_LOGISTIC = "binary logistic" 18 | MULTI_LOGISTIC = "multinomial logistic" 19 | STOCHASTIC = "stochastic" 20 | BAYESIAN_LS = "bayesian least squares" 21 | BAYESIAN_BINARY_LOGISTIC = "bayesian binary logistic" 22 | PMM = "pmm" 23 | LRD = "lrd" 24 | NONE = "none" 25 | NORM_UNIT_VARIANCE = 'normal unit variance' 26 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the series imputations folder from the autoimpute package. 2 | 3 | This module handles imports from the series imputations folder that should be 4 | accessible whenever someone imports autoimpute.imputations.series. Although 5 | these imputers are stand-alone classes, their direct use is discouraged. 6 | More robust imputers from the dataframe folder delegate work to these imputers 7 | whenever their respective strategies are requested. 8 | 9 | This module handles `from autoimpute.imputations.series import *` with the 10 | __all__ variable below. This command imports the main public classes and 11 | methods from autoimpute.imputations.series. 12 | """ 13 | 14 | from .default import DefaultUnivarImputer 15 | from .default import DefaultTimeSeriesImputer 16 | from .default import DefaultPredictiveImputer 17 | from .mean import MeanImputer 18 | from .median import MedianImputer 19 | from .mode import ModeImputer 20 | from .random import RandomImputer 21 | from .norm import NormImputer 22 | from .categorical import CategoricalImputer 23 | from .ffill import NOCBImputer, LOCFImputer 24 | from .interpolation import InterpolateImputer 25 | from .linear_regression import LeastSquaresImputer, StochasticImputer 26 | from .logistic_regression import BinaryLogisticImputer 27 | from .logistic_regression import MultinomialLogisticImputer 28 | from .bayesian_regression import BayesianLeastSquaresImputer 29 | from .bayesian_regression import BayesianBinaryLogisticImputer 30 | from .pmm import PMMImputer 31 | from .lrd import LRDImputer 32 | from .norm_unit_variance import NormUnitVarianceImputer 33 | 34 | __all__ = [ 35 | "DefaultUnivarImputer", 36 | "DefaultTimeSeriesImputer", 37 | "DefaultPredictiveImputer", 38 | "MeanImputer", 39 | "MedianImputer", 40 | "ModeImputer", 41 | "RandomImputer", 42 | "NormImputer", 43 | "CategoricalImputer", 44 | "NOCBImputer", 45 | "LOCFImputer", 46 | "InterpolateImputer", 47 | "LeastSquaresImputer", 48 | "StochasticImputer", 49 | "BinaryLogisticImputer", 50 | "MultinomialLogisticImputer", 51 | "BayesianLeastSquaresImputer", 52 | "BayesianBinaryLogisticImputer", 53 | "PMMImputer", 54 | "LRDImputer", 55 | "NormUnitVarianceImputer", 56 | ] 57 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/base.py: -------------------------------------------------------------------------------- 1 | """This module implements the abstract base class used for series-imputers. 2 | 3 | The base class is quite simple but important. It specifies the contract for 4 | how series-imputers should behave. All series-imputers should inherit from 5 | this base class to be considered valid Imputers. 6 | """ 7 | 8 | import abc 9 | from sklearn.base import BaseEstimator 10 | 11 | class ISeriesImputer(BaseEstimator, metaclass=abc.ABCMeta): 12 | """ISeriesImputer implements the abstract base class for series-imputers. 13 | 14 | All series imputers should have a fit, impute, and fit_impute method to be 15 | considered valid to build imputation models. The ISeriesImputer is the 16 | contract series-imputers must adhere to.""" 17 | 18 | @abc.abstractmethod 19 | def fit(self, X, y): 20 | """Contract to fit an imputation model. 21 | 22 | Args: 23 | X (pd.Series, pd.Dataframe): data used to build imputation 24 | model. pd.Series if univariate, pd.DataFrame if multivariate. 25 | y (pd.Series, None): column to impute. None if univariate, 26 | pd.Series if multivariate. 27 | 28 | Returns: 29 | self: instance of a class 30 | """ 31 | 32 | @abc.abstractmethod 33 | def impute(self, X): 34 | """Contract to impute using a fit imputation model. 35 | 36 | Args: 37 | X (pd.Series, pd.DataFrame): data to use to generate imputations. 38 | pd.Series if univariate, pd.DataFrame if multivariate. 39 | 40 | Returns: 41 | imputations, scalar or array depending on imputation model. 42 | """ 43 | 44 | @abc.abstractmethod 45 | def fit_impute(self, X, y): 46 | """Convenience method that implements fit & impute in one go. 47 | 48 | Args: 49 | X (pd.Series, pd.Dataframe): data used to build imputation 50 | model. pd.Series if univariate, pd.DataFrame if multivariate. 51 | y (pd.Series, None): column to impute. None if univariate, 52 | pd.Series if multivariate. 53 | 54 | Returns: 55 | imputations, scalar or array depending on imputation model. 56 | """ 57 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/categorical.py: -------------------------------------------------------------------------------- 1 | """This module implements categorical imputation via the CategoricalImputer. 2 | 3 | The CategoricalImputer determines the proportions of discrete features within 4 | observed data. It then samples this distribution to impute missing values. 5 | Dataframe imputers utilize this class when its strategy is requested. Use 6 | SingleImputer or MultipleImputer with strategy = `categorical` to broadcast 7 | the strategy across all the columns in a dataframe, or specify this strategy 8 | for a given column. 9 | """ 10 | 11 | import numpy as np 12 | from sklearn.utils.validation import check_is_fitted 13 | from autoimpute.imputations import method_names 14 | from autoimpute.imputations.errors import _not_cat_series 15 | from .base import ISeriesImputer 16 | methods = method_names 17 | # pylint:disable=attribute-defined-outside-init 18 | # pylint:disable=unnecessary-pass 19 | 20 | class CategoricalImputer(ISeriesImputer): 21 | """Impute missing data w/ draw from dataset's categorical distribution. 22 | 23 | The categorical imputer computes the proportion of observed values for 24 | each category within a discrete dataset. The imputer then samples the 25 | distribution to impute missing values with a respective random draw. The 26 | imputer can be used directly, but such behavior is discouraged. 27 | CategoricalImputer does not have the flexibility / robustness of dataframe 28 | imputers, nor is its behavior identical. Preferred use is 29 | MultipleImputer(strategy="categorical"). 30 | """ 31 | # class variables 32 | strategy = methods.CATEGORICAL 33 | 34 | def __init__(self): 35 | """Create an instance of the CategoricalImputer class.""" 36 | pass 37 | 38 | def fit(self, X, y=None): 39 | """Fit the Imputer to the dataset and calculate proportions. 40 | 41 | Args: 42 | X (pd.Series): Dataset to fit the imputer. 43 | y (None): ignored, None to meet requirements of base class 44 | 45 | Returns: 46 | self. Instance of the class. 47 | """ 48 | _not_cat_series(self.strategy, X) 49 | # get proportions of discrete observed values to sample from 50 | proportions = X.value_counts() / np.sum(~X.isnull()) 51 | self.statistics_ = {"param": proportions, "strategy": self.strategy} 52 | return self 53 | 54 | def impute(self, X): 55 | """Perform imputations using the statistics generated from fit. 56 | 57 | The impute method handles the actual imputation. Transform 58 | constructs a categorical distribution for each feature using the 59 | proportions of observed values from fit. It then imputes missing 60 | values with a random draw from the respective distribution. 61 | 62 | Args: 63 | X (pd.Series): Dataset to impute missing data from fit. 64 | 65 | Returns: 66 | np.array -- imputed dataset. 67 | """ 68 | # check if fitted and identify location of missingness 69 | check_is_fitted(self, "statistics_") 70 | _not_cat_series(self.strategy, X) 71 | ind = X[X.isnull()].index 72 | 73 | # get observed weighted by count of total and sample 74 | param = self.statistics_["param"] 75 | cats = param.index 76 | proportions = param.tolist() 77 | imp = np.random.choice(cats, size=len(ind), p=proportions) 78 | return imp 79 | 80 | def fit_impute(self, X, y=None): 81 | """Convenience method to perform fit and imputation in one go.""" 82 | return self.fit(X, y).impute(X) 83 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/ffill.py: -------------------------------------------------------------------------------- 1 | """This module implements forward & backward imputation via two Imputers. 2 | 3 | The LOCFImputer carries the last observation forward (locf) to impute missing 4 | data in a time series. NOCBImputer carries the next observation backward (nocb) 5 | to impute missing data in a time series. Dataframe imputers utilize these 6 | classes when each's strategy is requested. Use SingleImputer or MultipleImputer 7 | with strategy = `locf` or `nocb` to broadcast either strategy across all the 8 | columns in a dataframe, or specify either strategy for a given column. 9 | """ 10 | 11 | import pandas as pd 12 | from sklearn.utils.validation import check_is_fitted 13 | from autoimpute.imputations import method_names 14 | from .base import ISeriesImputer 15 | methods = method_names 16 | # pylint:disable=attribute-defined-outside-init 17 | # pylint:disable=unnecessary-pass 18 | # pylint:disable=unused-argument 19 | 20 | class LOCFImputer(ISeriesImputer): 21 | """Impute missing values by carrying the last observation forward. 22 | 23 | LOCFImputer carries the last observation forward to impute missing data. 24 | The imputer can be used directly, but such behavior is discouraged. 25 | LOCFImputer does not have the flexibility / robustness of dataframe 26 | imputers, nor is its behavior identical. Preferred use is 27 | MultipleImputer(strategy="locf"). 28 | """ 29 | # class variables 30 | strategy = methods.LOCF 31 | 32 | def __init__(self, start=None): 33 | """Create an instance of the LOCFImputer class. 34 | 35 | Args: 36 | start (any, optional): can be any value to impute first if first 37 | is missing. Default is None, which ends up taking first 38 | observed value found. Can also use "mean" to start with mean 39 | of the series. 40 | 41 | Returns: 42 | self. Instance of class. 43 | """ 44 | self.start = start 45 | 46 | def _handle_start(self, v, X): 47 | "private method to handle start values." 48 | if v is None: 49 | v = X.loc[X.first_valid_index()] 50 | if v == "mean": 51 | v = X.mean() 52 | return v 53 | 54 | def fit(self, X, y=None): 55 | """Fit the Imputer to the dataset. 56 | 57 | Args: 58 | X (pd.Series): Dataset to fit the imputer. 59 | y (None): ignored, None to meet requirements of base class 60 | 61 | 62 | Returns: 63 | self. Instance of the class. 64 | """ 65 | self.statistics_ = {"param": None, "strategy": self.strategy} 66 | return self 67 | 68 | def impute(self, X): 69 | """Perform imputations using the statistics generated from fit. 70 | 71 | The impute method handles the actual imputation. Missing values 72 | in a given dataset are replaced with the last observation carried 73 | forward. 74 | 75 | Args: 76 | X (pd.Series): Dataset to impute missing data from fit. 77 | 78 | Returns: 79 | np.array -- imputed dataset. 80 | """ 81 | # check if fitted then impute with mean if first value 82 | # or impute with observation carried forward otherwise 83 | check_is_fitted(self, "statistics_") 84 | 85 | # handle start... 86 | if pd.isnull(X.iloc[0]): 87 | ix = X.head(1).index[0] 88 | X.fillna( 89 | {ix: self._handle_start(self.start, X)}, inplace=True 90 | ) 91 | return X.fillna(method="ffill", inplace=False) 92 | 93 | def fit_impute(self, X, y=None): 94 | """Convenience method to perform fit and imputation in one go.""" 95 | return self.fit(X, y).impute(X) 96 | 97 | class NOCBImputer(ISeriesImputer): 98 | """Impute missing data by carrying the next observation backward. 99 | 100 | NOCBImputer carries the next observation backward to impute missing data. 101 | The imputer can be used directly, but such behavior is discouraged. 102 | NOCBImputer does not have the flexibility / robustness of dataframe 103 | imputers, nor is its behavior identical. Preferred use is 104 | MultipleImputer(strategy="nocb"). 105 | """ 106 | # class variables 107 | strategy = methods.NOCB 108 | 109 | def __init__(self, end=None): 110 | """Create an instance of the NOCBImputer class. 111 | 112 | Args: 113 | end (any, optional): can be any value to impute end if end 114 | is missing. Default is None, which ends up taking last 115 | observed value found. Can also use "mean" to end with 116 | mean of the series. 117 | 118 | Returns: 119 | self. Instance of class. 120 | """ 121 | self.end = end 122 | 123 | def _handle_end(self, v, X): 124 | "private method to handle end values." 125 | if v is None: 126 | v = X.loc[X.last_valid_index()] 127 | if v == "mean": 128 | v = X.mean() 129 | return v 130 | 131 | def fit(self, X, y=None): 132 | """Fit the Imputer to the dataset and calculate the mean. 133 | 134 | Args: 135 | X (pd.Series): Dataset to fit the imputer 136 | y (None): ignored, None to meet requirements of base class 137 | 138 | 139 | Returns: 140 | self. Instance of the class. 141 | """ 142 | self.statistics_ = {"param": None, "strategy": self.strategy} 143 | return self 144 | 145 | def impute(self, X): 146 | """Perform imputations using the statistics generated from fit. 147 | 148 | The impute method handles the actual imputation. Missing values 149 | in a given dataset are replaced with the next observation carried 150 | backward. 151 | 152 | Args: 153 | X (pd.Series): Dataset to impute missing data from fit. 154 | 155 | Returns: 156 | np.array -- imputed dataset. 157 | """ 158 | # check if fitted then impute with mean if first value 159 | # or impute with observation carried backward otherwise 160 | check_is_fitted(self, "statistics_") 161 | 162 | # handle end... 163 | if pd.isnull(X.iloc[-1]): 164 | ix = X.tail(1).index[0] 165 | X.fillna( 166 | {ix: self._handle_end(self.end, X)}, inplace=True 167 | ) 168 | return X.fillna(method="bfill", inplace=False) 169 | 170 | def fit_impute(self, X, y=None): 171 | """Convenience method to perform fit and imputation in one go.""" 172 | return self.fit(X, y).impute(X) 173 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/interpolation.py: -------------------------------------------------------------------------------- 1 | """This module implements interpolation methods via the InterpolateImputer. 2 | 3 | InterpolateImputer imputes missing data using some interpolation strategies 4 | suppoted by pd.Series.interpolate. Linear is the default strategy, although a 5 | number of additional strategies exist. Dataframe imputers utilize this class 6 | when its strategy is requested. Use SingleImputer or MultipleImputer with 7 | strategy = `interpolate` to broadcast the strategy across all the columns in a 8 | dataframe, or specify this strategy for a given column. 9 | """ 10 | 11 | import pandas as pd 12 | from sklearn.utils.validation import check_is_fitted 13 | from autoimpute.imputations import method_names 14 | from .base import ISeriesImputer 15 | methods = method_names 16 | # pylint:disable=attribute-defined-outside-init 17 | # pylint:disable=unnecessary-pass 18 | # pylint:disable=unused-argument 19 | 20 | class InterpolateImputer(ISeriesImputer): 21 | """Impute missing values using interpolation techniques. 22 | 23 | The InterpolateImputer imputes missing values uses a valid pd.Series 24 | interpolation strategy. See __init__ method docs for supported strategies. 25 | The imputer can be used directly, but such behavior is discouraged. 26 | InterpolateImputer does not have the flexibility / robustness of dataframe 27 | imputers, nor is its behavior identical. Preferred use is 28 | MultipleImputer(strategy="interpolate"). 29 | """ 30 | # class variables 31 | strategy = methods.INTERPOLATE 32 | fill_strategies = ( 33 | "linear", "time", "quadratic", "cubic", 34 | "spline", "barycentric", "polynomial" 35 | ) 36 | 37 | def __init__(self, fill_strategy="linear", 38 | start=None, end=None, order=None): 39 | """Create an instance of the InterpolateImputer class. 40 | 41 | Args: 42 | fill_strategy (str, Optional): type of interpolation to perform 43 | Default is linear. Other strategies supported include: 44 | `time`, `quadratic`, `cubic`, `spline`, `barycentric`, 45 | `polynomial`. 46 | start (int, Optional): value to impute if first number in 47 | Series is missing. Default is None, but first valid used 48 | when required for quadratic, cubic, polynomial. 49 | end (int, Optional): value to impute if last number in 50 | Series is missing. Default is None, but last valid used 51 | when required for quadratic, cubic, polynomial. 52 | order (int, Optional): if strategy is spline or polynomial, 53 | order must be number. Otherwise not considered. 54 | 55 | Returns: 56 | self. Instance of the class. 57 | """ 58 | self.fill_strategy = fill_strategy 59 | self.start = start 60 | self.end = end 61 | self.order = order 62 | 63 | @property 64 | def fill_strategy(self): 65 | """Property getter to return the value of fill_strategy property.""" 66 | return self._fill_strategy 67 | 68 | @fill_strategy.setter 69 | def fill_strategy(self, fs): 70 | """Validate the fill_strategy property and set default parameters. 71 | 72 | Args: 73 | fs (str, Optional): if None, use linear. 74 | 75 | Raises: 76 | ValueError: not a valid fill strategy for InterpolateImputer 77 | """ 78 | if fs not in self.fill_strategies: 79 | err = f"{fs} not a valid fill strategy for InterpolateImputer" 80 | raise ValueError(err) 81 | self._fill_strategy = fs 82 | 83 | def _handle_start(self, v, X): 84 | "private method to handle start values." 85 | if v is None: 86 | v = X.loc[X.first_valid_index()] 87 | if v == "mean": 88 | v = X.mean() 89 | return v 90 | 91 | def _handle_end(self, v, X): 92 | "private method to handle end values." 93 | if v is None: 94 | v = X.loc[X.last_valid_index()] 95 | if v == "mean": 96 | v = X.mean() 97 | return v 98 | 99 | def fit(self, X, y=None): 100 | """Fit the Imputer to the dataset. Nothing to calculate. 101 | 102 | Args: 103 | X (pd.Series): Dataset to fit the imputer. 104 | y (None): ignored, None to meet requirements of base class 105 | 106 | Returns: 107 | self. Instance of the class. 108 | """ 109 | self.statistics_ = {"param": self.fill_strategy, 110 | "strategy": self.strategy} 111 | return self 112 | 113 | def impute(self, X): 114 | """Perform imputations using the statistics generated from fit. 115 | 116 | The impute method handles the actual imputation. Missing values 117 | in a given dataset are replaced with results from interpolation. 118 | 119 | Args: 120 | X (pd.Series): Dataset to impute missing data from fit. 121 | 122 | Returns: 123 | np.array -- imputed dataset. 124 | """ 125 | # check if fitted then impute with interpolation strategy 126 | check_is_fitted(self, "statistics_") 127 | imp = self.statistics_["param"] 128 | 129 | # setting defaults if no value passed for start and last 130 | # quadratic, cubic, and polynomial require first and last 131 | if imp in ("quadratic", "cubic", "polynomial"): 132 | # handle start and end... 133 | if pd.isnull(X.iloc[0]): 134 | ix = X.head(1).index[0] 135 | X.fillna( 136 | {ix: self._handle_start(self.start, X)}, inplace=True 137 | ) 138 | if pd.isnull(X.iloc[-1]): 139 | ix = X.tail(1).index[0] 140 | X.fillna( 141 | {ix: self._handle_end(self.end, X)}, inplace=True 142 | ) 143 | 144 | # handling for methods that need order 145 | num_observed = min(6, X.count()) 146 | if imp in ("polynomial", "spline"): 147 | if self.order is None or self.order >= num_observed: 148 | err = f"Order must be between 1 and {num_observed-1}" 149 | raise ValueError(err) 150 | 151 | # finally, perform interpolation 152 | return X.interpolate(method=imp, 153 | limit=None, 154 | limit_direction="both", 155 | inplace=False, 156 | order=self.order) 157 | 158 | def fit_impute(self, X, y=None): 159 | """Convenience method to perform fit and imputation in one go.""" 160 | return self.fit(X, y).impute(X) 161 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/linear_regression.py: -------------------------------------------------------------------------------- 1 | """This module implements least squares and stochastic imputation. 2 | 3 | This module contains the LeastSquaresImputer and the StochasticImputer. Both 4 | use least squares to find a line of best fit and fill imputations with the 5 | predictions from the line. Stochastic adds random error to each prediction. 6 | Dataframe imputers utilize this class when its strategy is requested. Use 7 | SingleImputer or MultipleImputer with strategy = `least squares` to broadcast 8 | the strategy across all the columns in a dataframe, or specify this strategy 9 | for a given column. 10 | """ 11 | 12 | from numpy import sqrt 13 | from scipy.stats import norm 14 | from sklearn.utils.validation import check_is_fitted 15 | from sklearn.linear_model import LinearRegression 16 | from sklearn.metrics import mean_squared_error 17 | from autoimpute.imputations import method_names 18 | from autoimpute.imputations.errors import _not_num_series 19 | from .base import ISeriesImputer 20 | methods = method_names 21 | # pylint:disable=attribute-defined-outside-init 22 | # pylint: 23 | 24 | class LeastSquaresImputer(ISeriesImputer): 25 | """Impute missing values using predictions from least squares regression. 26 | 27 | The LeastSquaresImputer produces predictions using the least squares 28 | methodology. The prediction from the line of best fit given a set of 29 | predictors become the imputations. To implement least squares, the imputer 30 | wraps the sklearn LinearRegression class. The imputer can be used 31 | directly, but such behavior is discouraged. LeastSquaresImputer does not 32 | have the flexibility / robustness of dataframe imputers, nor is its 33 | behavior identical. Preferred use is 34 | MultipleImputer(strategy="least squares"). 35 | """ 36 | # class variables 37 | strategy = methods.LS 38 | 39 | def __init__(self, **kwargs): 40 | """Create an instance of the LeastSquaresImputer class. 41 | 42 | Args: 43 | **kwargs: keyword arguments passed to LinearRegression 44 | 45 | """ 46 | self.lm = LinearRegression(**kwargs) 47 | 48 | def fit(self, X, y): 49 | """Fit the Imputer to the dataset by fitting linear model. 50 | 51 | Args: 52 | X (pd.Dataframe): dataset to fit the imputer. 53 | y (pd.Series): response, which is eventually imputed. 54 | 55 | Returns: 56 | self. Instance of the class. 57 | """ 58 | _not_num_series(self.strategy, y) 59 | self.lm.fit(X, y) 60 | self.statistics_ = {"strategy": self.strategy} 61 | return self 62 | 63 | def impute(self, X): 64 | """Generate imputations using predictions from the fit linear model. 65 | 66 | The impute method returns the values for imputation. Missing values 67 | in a given dataset are replaced with the predictions from the least 68 | squares regression line of best fit. This transform method returns 69 | those predictions. 70 | 71 | Args: 72 | X (pd.DataFrame): predictors to determine imputed values. 73 | 74 | Returns: 75 | np.array: imputed dataset. 76 | """ 77 | # check if fitted then predict with least squares 78 | check_is_fitted(self, "statistics_") 79 | imp = self.lm.predict(X) 80 | return imp 81 | 82 | def fit_impute(self, X, y): 83 | """Fit impute method to generate imputations where y is missing. 84 | 85 | Args: 86 | X (pd.Dataframe): predictors in the dataset. 87 | y (pd.Series): response w/ missing values to impute. 88 | 89 | Returns: 90 | np.array: imputed dataset. 91 | """ 92 | # transform occurs with records from X where y is missing 93 | miss_y_ix = y[y.isnull()].index 94 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 95 | 96 | class StochasticImputer(ISeriesImputer): 97 | """Impute missing values adding error to least squares regression preds. 98 | 99 | The StochasticImputer predicts using the least squares methodology. The 100 | imputer then samples from the regression's error distribution and adds the 101 | random draw to the prediction. This draw adds the stochastic element to 102 | the imputations. The imputer can be used directly, but such behavior is 103 | discouraged. StochasticImputer does not have the flexibility / robustness 104 | of dataframe imputers, nor is its behavior identical. Preferred use is 105 | MultipleImputer(strategy="stochastic"). 106 | """ 107 | # class variables 108 | strategy = methods.STOCHASTIC 109 | 110 | def __init__(self, **kwargs): 111 | """Create an instance of the StochasticImputer class. 112 | 113 | Args: 114 | **kwargs: keyword arguments passed to LinearRegression. 115 | 116 | """ 117 | self.lm = LinearRegression(**kwargs) 118 | 119 | def fit(self, X, y): 120 | """Fit the Imputer to the dataset by fitting linear model. 121 | 122 | The fit step also generates predictions on the observed data. These 123 | predictions are necessary to derive the mean_squared_error, which is 124 | passed as a parameter to the impute phase. The MSE is used to create 125 | the normal error distribution from which the imptuer draws. 126 | 127 | Args: 128 | X (pd.Dataframe): dataset to fit the imputer. 129 | y (pd.Series): response, which is eventually imputed. 130 | 131 | Returns: 132 | self. Instance of the class. 133 | """ 134 | _not_num_series(self.strategy, y) 135 | self.lm.fit(X, y) 136 | preds = self.lm.predict(X) 137 | mse = mean_squared_error(y, preds) 138 | self.statistics_ = {"param": mse, "strategy": self.strategy} 139 | return self 140 | 141 | def impute(self, X): 142 | """Generate imputations using predictions from the fit linear model. 143 | 144 | The impute method returns the values for imputation. Missing values 145 | in a given dataset are replaced with the predictions from the least 146 | squares regression line of best fit plus a random draw from the normal 147 | error distribution. 148 | 149 | Args: 150 | X (pd.DataFrame): predictors to determine imputed values. 151 | 152 | Returns: 153 | np.array: imputed dataset. 154 | """ 155 | # check if fitted then predict with least squares 156 | check_is_fitted(self, "statistics_") 157 | mse = self.statistics_["param"] 158 | preds = self.lm.predict(X) 159 | 160 | # add random draw from normal dist w/ mean squared error 161 | # from observed model. This makes lm stochastic 162 | mse_dist = norm.rvs(loc=0, scale=sqrt(mse), size=len(preds)) 163 | imp = preds + mse_dist 164 | return imp 165 | 166 | def fit_impute(self, X, y): 167 | """Fit impute method to generate imputations where y is missing. 168 | 169 | Args: 170 | X (pd.Dataframe): predictors in the dataset. 171 | y (pd.Series): response w/ missing values to impute 172 | 173 | Returns: 174 | np.array: imputated dataset. 175 | """ 176 | # transform occurs with records from X where y is missing 177 | miss_y_ix = y[y.isnull()].index 178 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 179 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/logistic_regression.py: -------------------------------------------------------------------------------- 1 | """This module implements logistic regression imputation. 2 | 3 | This module contains the BinaryLogisticImputer and the MultiLogisticImputer. 4 | Both use logistic regression to generate class predictions that become values 5 | for imputations of missing data. Binary is optimized to deal with two classes, 6 | while Multi is optimized to deal with multiple classes. Dataframe imputers 7 | utilize these classes when each's strategy is requested. Use SingleImputer or 8 | MultipleImputer with strategy = `binary logistic` or `multinomial logistic` 9 | to broadcast either strategy across all the columns in a dataframe, or specify 10 | either strategy for a given column. 11 | """ 12 | 13 | import warnings 14 | from pandas import Series 15 | from sklearn.utils.validation import check_is_fitted 16 | from sklearn.linear_model import LogisticRegression 17 | from autoimpute.imputations import method_names 18 | from .base import ISeriesImputer 19 | methods = method_names 20 | # pylint:disable=attribute-defined-outside-init 21 | # pylint: 22 | 23 | class BinaryLogisticImputer(ISeriesImputer): 24 | """Impute missing values w/ predictions from binary logistic regression. 25 | 26 | The BinaryLogisticImputer produces predictions using logsitic regression 27 | with two classes. The class predictions given a set of predictors become 28 | the imputations. To implement logistic regression, the imputer wraps the 29 | sklearn LogisticRegression class with a default solver (liblinear). The 30 | imputer can be used directly, but such behavior is discouraged. 31 | BinaryLogisticImputer does not have the flexibility / robustness of 32 | dataframe imputers, nor is its behavior identical. Preferred use is 33 | MultipleImputer(strategy="binary logistic"). 34 | """ 35 | # class variables 36 | strategy = methods.BINARY_LOGISTIC 37 | 38 | def __init__(self, **kwargs): 39 | """Create an instance of the BinaryLogisticImputer class. 40 | 41 | Args: 42 | **kwargs: keyword arguments passed to LogisticRegresion. 43 | 44 | """ 45 | self.solver = kwargs.pop("solver", "liblinear") 46 | self.glm = LogisticRegression(solver=self.solver, **kwargs) 47 | 48 | def fit(self, X, y): 49 | """Fit the Imputer to the dataset by fitting logistic model. 50 | 51 | Args: 52 | X (pd.Dataframe): dataset to fit the imputer. 53 | y (pd.Series): response, which is eventually imputed. 54 | 55 | Returns: 56 | self. Instance of the class. 57 | """ 58 | y = y.astype("category").cat 59 | y_cat_l = len(y.codes.unique()) 60 | if y_cat_l > 2: 61 | err = "Binary requires 2 categories. Use multinomial instead." 62 | raise ValueError(err) 63 | self.glm.fit(X, y.codes) 64 | self.statistics_ = {"param": y.categories, "strategy": self.strategy} 65 | return self 66 | 67 | def impute(self, X): 68 | """Generate imputations using predictions from the fit logistic model. 69 | 70 | The impute method returns the values for imputation. Missing values 71 | in a given dataset are replaced with the predictions from the logistic 72 | regression class specification. 73 | 74 | Args: 75 | X (pd.DataFrame): predictors to determine imputed values. 76 | 77 | Returns: 78 | np.array: imputed dataset. 79 | """ 80 | # check if fitted then predict with logistic 81 | check_is_fitted(self, "statistics_") 82 | labels = self.statistics_["param"] 83 | preds = self.glm.predict(X) 84 | 85 | # map category codes back to actual labels 86 | # then impute the actual labels to keep categories in tact 87 | label_dict = {i:j for i, j in enumerate(labels.values)} 88 | imp = Series(preds).replace(label_dict, inplace=False) 89 | return imp.values 90 | 91 | def fit_impute(self, X, y): 92 | """Fit impute method to generate imputations where y is missing. 93 | 94 | Args: 95 | X (pd.Dataframe): predictors in the dataset. 96 | y (pd.Series): response w/ missing values to impute. 97 | 98 | Returns: 99 | np.array: imputed dataset. 100 | """ 101 | # transform occurs with records from X where y is missing 102 | miss_y_ix = y[y.isnull()].index 103 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 104 | 105 | class MultinomialLogisticImputer(ISeriesImputer): 106 | """Impute missing values w/ preds from multinomial logistic regression. 107 | 108 | The MultinomialLogisticImputer produces predictions w/ logsitic regression 109 | with more than two classes. Class predictions given a set of predictors 110 | become the imputations. To implement logistic regression, the imputer 111 | wraps the sklearn LogisticRegression class with a default solver (saga) 112 | and default `multi_class` set to multinomial. The imputer can be used 113 | directly, but such behavior is discouraged. MultinomialLogisticImputer 114 | does not have the flexibility / robustness of dataframe imputers, nor is 115 | its behavior identical. Preferred use is 116 | MultipleImputer(strategy="multinomial logistic"). 117 | """ 118 | # class variables 119 | strategy = methods.MULTI_LOGISTIC 120 | 121 | def __init__(self, **kwargs): 122 | """Create an instance of the MultiLogisticImputer class. 123 | 124 | Args: 125 | **kwargs: keyword arguments passed to LogisticRegression. 126 | 127 | """ 128 | self.solver = kwargs.pop("solver", "saga") 129 | self.multiclass = kwargs.pop("multi_class", "multinomial") 130 | self.glm = LogisticRegression( 131 | solver=self.solver, 132 | multi_class=self.multiclass, 133 | **kwargs 134 | ) 135 | 136 | def fit(self, X, y): 137 | """Fit the Imputer to the dataset by fitting logistic model. 138 | 139 | Args: 140 | X (pd.Dataframe): dataset to fit the imputer. 141 | y (pd.Series): response, which is eventually imputed. 142 | 143 | Returns: 144 | self. Instance of the class. 145 | """ 146 | y = y.astype("category").cat 147 | y_cat_l = len(y.codes.unique()) 148 | if y_cat_l == 2: 149 | w = "Multiple categories (c) expected. Use binary instead if c=2." 150 | warnings.warn(w) 151 | self.glm.fit(X, y.codes) 152 | self.statistics_ = {"param": y.categories, "strategy": self.strategy} 153 | return self 154 | 155 | def impute(self, X): 156 | """Generate imputations using predictions from the fit logistic model. 157 | 158 | The impute method returns the values for imputation. Missing values 159 | in a given dataset are replaced with the predictions from the logistic 160 | regression class specification. 161 | 162 | Args: 163 | X (pd.DataFrame): predictors to determine imputed values. 164 | 165 | Returns: 166 | np.array: imputed dataset. 167 | """ 168 | # check if fitted then predict with logistic 169 | check_is_fitted(self, "statistics_") 170 | labels = self.statistics_["param"] 171 | preds = self.glm.predict(X) 172 | 173 | # map category codes back to actual labels 174 | # then impute the actual labels to keep categories in tact 175 | label_dict = {i:j for i, j in enumerate(labels.values)} 176 | imp = Series(preds).replace(label_dict, inplace=False) 177 | return imp.values 178 | 179 | def fit_impute(self, X, y): 180 | """Fit impute method to generate imputations where y is missing. 181 | 182 | Args: 183 | X (pd.Dataframe): predictors in the dataset. 184 | y (pd.Series): response w/ missing values to impute. 185 | 186 | Returns: 187 | np.array: imputed dataset. 188 | """ 189 | # transform occurs with records from X where y is missing 190 | miss_y_ix = y[y.isnull()].index 191 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 192 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/lrd.py: -------------------------------------------------------------------------------- 1 | """This module implements local residual draws via the LRDImputer. 2 | 3 | This module contains the LRDImputer, which takes local residual draws to 4 | impute missing values. LRD is similar to PMM, but residuals are added onto 5 | the selected value for imputation. Dataframe imputers utilize this class when 6 | its strategy is requested. Use SingleImputer or MultipleImputer with 7 | strategy = `lrd` to broadcast the strategy across all the columns in a 8 | dataframe, or specify this strategy for a given column. 9 | """ 10 | 11 | import numpy as np 12 | import pymc as pm 13 | from pandas import DataFrame 14 | from scipy.stats import multivariate_normal 15 | from sklearn.linear_model import LinearRegression 16 | from sklearn.utils.validation import check_is_fitted 17 | from autoimpute.imputations import method_names 18 | from autoimpute.imputations.errors import _not_num_series 19 | from autoimpute.imputations.helpers import _local_residuals 20 | from .base import ISeriesImputer 21 | methods = method_names 22 | # pylint:disable=attribute-defined-outside-init 23 | # pylint:disable=no-member 24 | # pylint:disable=unused-variable 25 | 26 | class LRDImputer(ISeriesImputer): 27 | """Impute missing values using local residual draws. 28 | 29 | The LRDImputer produces predictions using a combination of bayesian 30 | approach to least squares and least squares itself. For each missing value 31 | LRD finds the `n` closest neighbors from a least squares regression 32 | prediction set, and samples from the corresponding true values for y of 33 | each of those `n` predictions. The imputation is the resulting sample plus 34 | the residual, or the distance between the prediction and the neighbor. 35 | To implement bayesian least squares, the imputer utlilizes the pymc 36 | library. The imputer can be used directly, but such behavior is 37 | discouraged. LRDImputer does not have the flexibility / robustness of 38 | dataframe imputers, nor is its behavior identical. Preferred use is 39 | MultipleImputer(strategy="lrd"). 40 | """ 41 | # class variables 42 | strategy = methods.LRD 43 | 44 | def __init__(self, **kwargs): 45 | """Create an instance of the LRDImputer class. 46 | 47 | The class requires multiple arguments necessary to create priors for 48 | a bayesian linear regression equation and least squares itself. 49 | Therefore, LRD arguments include all of those seen in bayesian least 50 | squares and least squares itself. New parameters include `neighbors`, 51 | or the number of neighbors that LRD uses to sample observed. 52 | 53 | Args: 54 | **kwargs: default keyword arguments for lm & bayesian analysis. 55 | Note - kwargs popped for default arguments defined below. 56 | Next set of kwargs popped and sent to linear regression. 57 | Rest of kwargs passed as params to sampling (see pymc). 58 | am (float, Optional): mean of alpha prior. Default 0. 59 | asd (float, Optional): std. deviation of alpha prior. Default 10. 60 | bm (float, Optional): mean of beta priors. Default 0. 61 | bsd (float, Optional): std. deviation of beta priors. Default 10. 62 | sig (float, Optional): parameter of sigma prior. Default 1. 63 | sample (int, Optional): number of posterior samples per chain. 64 | Default = 1000. More samples, longer to run, but better 65 | approximation of the posterior & chance of convergence. 66 | tune (int, Optional): parameter for tuning. Draws done in addition 67 | to sample. Default = 1000. 68 | init (str, Optional): MCMC algo to use for posterior sampling. 69 | Default = 'auto'. See pymc docs for more info on choices. 70 | fill_value (str, Optional): How to draw from the posterior to 71 | create imputations. Default is "random". 'random' and 'mean' 72 | supported for explicit options. 73 | neighbors (int, Optional): number of neighbors. Default is 5. 74 | Value should be greater than 0 and less than # observed, 75 | although anything greater than 10-20 generally too high 76 | unless dataset is massive. 77 | fit_intercept (bool, Optional): sklearn LinearRegression param. 78 | copy_x (bool, Optional): sklearn LinearRegression param. 79 | n_jobs (int, Optional): sklearn LinearRegression param. 80 | """ 81 | self.am = kwargs.pop("am", None) 82 | self.asd = kwargs.pop("asd", 10) 83 | self.bm = kwargs.pop("bm", None) 84 | self.bsd = kwargs.pop("bsd", 10) 85 | self.sig = kwargs.pop("sig", 1) 86 | self.sample = kwargs.pop("sample", 1000) 87 | self.tune = kwargs.pop("tune", 1000) 88 | self.init = kwargs.pop("init", "auto") 89 | self.fill_value = kwargs.pop("fill_value", "random") 90 | self.neighbors = kwargs.pop("neighbors", 5) 91 | self.fit_intercept = kwargs.pop("fit_intercept", True) 92 | self.copy_x = kwargs.pop("copy_x", True) 93 | self.n_jobs = kwargs.pop("n_jobs", None) 94 | self.lm = LinearRegression( 95 | fit_intercept=self.fit_intercept, 96 | copy_X=self.copy_x, 97 | n_jobs=self.n_jobs 98 | ) 99 | self.sample_kwargs = kwargs 100 | 101 | def fit(self, X, y): 102 | """Fit the Imputer to the dataset by fitting bayesian and LS model. 103 | 104 | Args: 105 | X (pd.Dataframe): dataset to fit the imputer. 106 | y (pd.Series): response, which is eventually imputed. 107 | 108 | Returns: 109 | self. Instance of the class. 110 | """ 111 | _not_num_series(self.strategy, y) 112 | nc = len(X.columns) 113 | 114 | # get predictions for the data, which will be used for "closest" vals 115 | y_pred = self.lm.fit(X, y).predict(X) 116 | y_df = DataFrame({"y": y, "y_pred": y_pred}) 117 | 118 | # calculate bayes and use appropriate means for alpha and beta priors 119 | # here we specify the point estimates from the linear regression as the 120 | # means for the priors. This will greatly speed up posterior sampling 121 | # and help ensure that convergence occurs 122 | if self.am is None: 123 | self.am = self.lm.intercept_ 124 | if self.bm is None: 125 | self.bm = self.lm.coef_ 126 | 127 | # initialize model for bayesian linear reg. Default vals for priors 128 | # assume data is scaled and centered. Convergence can struggle or fail 129 | # if not the case and proper values for the priors are not specified 130 | # separately, also assumes each beta is normal and "independent" 131 | # while betas likely not independent, this is technically a rule of OLS 132 | with pm.Model() as fit_model: 133 | alpha = pm.Normal("alpha", self.am, self.asd) 134 | beta = pm.Normal("beta", self.bm, self.bsd, shape=nc) 135 | sigma = pm.HalfCauchy("σ", self.sig) 136 | mu = alpha+beta.dot(X.T) 137 | score = pm.Normal("score", mu, sigma, observed=y) 138 | params = {"model": fit_model, "y_obs": y_df} 139 | self.statistics_ = {"param": params, "strategy": self.strategy} 140 | return self 141 | 142 | def impute(self, X): 143 | """Generate imputations using predictions from the fit bayesian model. 144 | 145 | The transform method returns the values for imputation. Missing values 146 | in a given dataset are replaced with the random selection from the LRD 147 | process. Again, LRD imputes actually observed values, and the observed 148 | values are selected by finding the closest least squares predictions 149 | to a given prediction from the bayesian model. 150 | 151 | Args: 152 | X (pd.DataFrame): predictors to determine imputed values. 153 | 154 | Returns: 155 | np.array: imputed dataset. 156 | """ 157 | # check if fitted then predict with least squares 158 | check_is_fitted(self, "statistics_") 159 | model = self.statistics_["param"]["model"] 160 | df = self.statistics_["param"]["y_obs"] 161 | df = df.reset_index(drop=True) 162 | 163 | # generate posterior distribution for alpha, beta coefficients 164 | with model: 165 | tr = pm.sample( 166 | self.sample, 167 | tune=self.tune, 168 | init=self.init, 169 | **self.sample_kwargs 170 | ) 171 | self.trace_ = tr 172 | 173 | # support for pymc - handling InferenceData obj instead of MultiTrace 174 | # we have to compress chains ourselves w/ InferenceData obj (xarray) 175 | post = tr.posterior 176 | alpha_, beta_ = post.alpha.values, post.beta.values 177 | chain, draws, beta_dim = beta_.shape 178 | beta_ = beta_.reshape(chain*draws, beta_dim) 179 | 180 | # sample random alpha from alpha posterior distribution 181 | alpha_bayes = np.random.choice(alpha_.ravel()) 182 | 183 | # get the mean and covariance of the multivariate betas 184 | # betas assumed multivariate normal by linear reg rules 185 | # sample beta w/ cov structure to create realistic variability 186 | beta_means, beta_cov = beta_.mean(0), np.cov(beta_.T) 187 | beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs()) 188 | 189 | # predictions for missing y, using bayes alpha + coeff samples 190 | # use these preds for nearest neighbor search from reg results 191 | # neighbors are nearest from prediction model fit on observed 192 | # imputed values are actual y vals corresponding to nearest neighbors 193 | # therefore, this is a form of "hot-deck" imputation 194 | y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T) 195 | n_ = self.neighbors 196 | if X.columns.size == 1: 197 | y_pred_bayes = y_pred_bayes[0] 198 | if self.fill_value == "mean": 199 | imp = [_local_residuals(x, n_, df, np.mean) for x in y_pred_bayes] 200 | elif self.fill_value == "random": 201 | choice = np.random.choice 202 | imp = [_local_residuals(x, n_, df, choice) for x in y_pred_bayes] 203 | else: 204 | err = f"{self.fill_value} must be `mean` or `random`." 205 | raise ValueError(err) 206 | 207 | # finally, set last class values and return imputations 208 | self.y_pred = y_pred_bayes 209 | self.alphas = alpha_bayes 210 | self.betas = beta_bayes 211 | return imp 212 | 213 | def fit_impute(self, X, y): 214 | """Fit impute method to generate imputations where y is missing. 215 | 216 | Args: 217 | X (pd.Dataframe): predictors in the dataset. 218 | y (pd.Series): response w/ missing values to impute. 219 | 220 | Returns: 221 | np.array: imputed dataset. 222 | """ 223 | # transform occurs with records from X where y is missing 224 | miss_y_ix = y[y.isnull()].index 225 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 226 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/mean.py: -------------------------------------------------------------------------------- 1 | """This module implements mean imputation via the MeanImputer. 2 | 3 | The MeanImputer imputes missing data with the mean of observed data. 4 | Dataframe imputers utilize this class when its strategy is requested. Use 5 | SingleImputer or MultipleImputer with strategy = `mean` to broadcast the 6 | strategy across all the columns in a dataframe, or specify this strategy 7 | for a given column. 8 | """ 9 | 10 | from sklearn.utils.validation import check_is_fitted 11 | from autoimpute.imputations import method_names 12 | from autoimpute.imputations.errors import _not_num_series 13 | from .base import ISeriesImputer 14 | methods = method_names 15 | # pylint:disable=attribute-defined-outside-init 16 | # pylint:disable=unnecessary-pass 17 | 18 | class MeanImputer(ISeriesImputer): 19 | """Impute missing values with the mean of the observed data. 20 | 21 | This imputer imputes missing values with the mean of observed data. 22 | The imputer can be used directly, but such behavior is discouraged. 23 | MeanImputer does not have the flexibility / robustness of dataframe 24 | imputers, nor is its behavior identical. Preferred use is 25 | MultipleImputer(strategy="mean"). 26 | """ 27 | # class variables 28 | strategy = methods.MEAN 29 | 30 | def __init__(self): 31 | """Create an instance of the MeanImputer class.""" 32 | pass 33 | 34 | def fit(self, X, y): 35 | """Fit the Imputer to the dataset and calculate the mean. 36 | 37 | Args: 38 | X (pd.Series): Dataset to fit the imputer. 39 | y (None): ignored, None to meet requirements of base class 40 | 41 | Returns: 42 | self. Instance of the class. 43 | """ 44 | _not_num_series(self.strategy, X) 45 | mu = X.mean() 46 | self.statistics_ = {"param": mu, "strategy": self.strategy} 47 | return self 48 | 49 | def impute(self, X): 50 | """Perform imputations using the statistics generated from fit. 51 | 52 | The impute method handles the actual imputation. Missing values 53 | in a given dataset are replaced with the respective mean from fit. 54 | 55 | Args: 56 | X (pd.Series): Dataset to impute missing data from fit. 57 | 58 | Returns: 59 | float -- imputed dataset. 60 | """ 61 | # check if fitted then impute with mean 62 | check_is_fitted(self, "statistics_") 63 | _not_num_series(self.strategy, X) 64 | imp = self.statistics_["param"] 65 | return imp 66 | 67 | def fit_impute(self, X, y=None): 68 | """Convenience method to perform fit and imputation in one go.""" 69 | return self.fit(X, y).impute(X) 70 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/median.py: -------------------------------------------------------------------------------- 1 | """This module implements median imputation via the MedianImputer. 2 | 3 | The MedianImputer imputes missing data with the median of observed data. 4 | Dataframe imputers utilize this class when its strategy is requested. Use 5 | SingleImputer or MultipleImputer with strategy = `median` to broadcast the 6 | strategy across all the columns in a dataframe, or specify this strategy 7 | for a given column. 8 | """ 9 | 10 | from sklearn.utils.validation import check_is_fitted 11 | from autoimpute.imputations import method_names 12 | from autoimpute.imputations.errors import _not_num_series 13 | from .base import ISeriesImputer 14 | methods = method_names 15 | # pylint:disable=attribute-defined-outside-init 16 | # pylint:disable=unnecessary-pass 17 | 18 | class MedianImputer(ISeriesImputer): 19 | """Impute missing values with the median of the observed data. 20 | 21 | This imputer imputes missing values with the median of observed data. 22 | The imputer can be used directly, but such behavior is discouraged. 23 | MedianImputer does not have the flexibility / robustness of dataframe 24 | imputers, nor is its behavior identical. Preferred use is 25 | MultipleImputer(strategy="median"). 26 | """ 27 | # class variables 28 | strategy = methods.MEDIAN 29 | 30 | def __init__(self): 31 | """Create an instance of the MedianImputer class.""" 32 | pass 33 | 34 | def fit(self, X, y=None): 35 | """Fit the Imputer to the dataset and calculate the median. 36 | 37 | Args: 38 | X (pd.Series): Dataset to fit the imputer. 39 | y (None): ignored, None to meet requirements of base class 40 | 41 | Returns: 42 | self. Instance of the class. 43 | """ 44 | _not_num_series(self.strategy, X) 45 | median = X.median() 46 | self.statistics_ = {"param": median, "strategy": self.strategy} 47 | return self 48 | 49 | def impute(self, X): 50 | """Perform imputations using the statistics generated from fit. 51 | 52 | The impute method handles the actual imputation. Missing values 53 | in a given dataset are replaced with the respective median from fit. 54 | 55 | Args: 56 | X (pd.Series): Dataset to impute missing data from fit. 57 | 58 | Returns: 59 | float -- imputed dataset. 60 | """ 61 | # check is fitted then impute with median 62 | check_is_fitted(self, "statistics_") 63 | _not_num_series(self.strategy, X) 64 | imp = self.statistics_["param"] 65 | return imp 66 | 67 | def fit_impute(self, X, y=None): 68 | """Convenience method to perform fit and imputation in one go.""" 69 | return self.fit(X, y).impute(X) 70 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/mode.py: -------------------------------------------------------------------------------- 1 | """This module implements mode imputation via the ModeImputer. 2 | 3 | The ModeImputer uses the mode of observed data to impute missing values. 4 | Dataframe imputers utilize this class when its strategy is requested. Use 5 | SingleImputer or MultipleImputer with strategy = `mode` to broadcast the 6 | strategy across all the columns in a dataframe, or specify this strategy 7 | for a given column. 8 | """ 9 | 10 | import numpy as np 11 | import pandas as pd 12 | from sklearn.utils.validation import check_is_fitted 13 | from autoimpute.imputations import method_names 14 | from .base import ISeriesImputer 15 | methods = method_names 16 | # pylint:disable=attribute-defined-outside-init 17 | 18 | class ModeImputer(ISeriesImputer): 19 | """Impute missing values with the mode of the observed data. 20 | 21 | The mode imputer calculates the mode of the observed dataset and uses 22 | it to impute missing observations. In the case where there are more than 23 | one mode, the user can supply a `fill_strategy` to choose the mode. 24 | The imputer can be used directly, but such behavior is discouraged. 25 | ModeImputer does not have the flexibility / robustness of dataframe 26 | imputers, nor is its behavior identical. Preferred use is 27 | MultipleImputer(strategy="mode"). 28 | """ 29 | # class variables 30 | strategy = methods.MODE 31 | fill_strategies = (None, "first", "last", "random") 32 | 33 | def __init__(self, fill_strategy=None): 34 | """Create an instance of the ModeImputer class. 35 | 36 | Args: 37 | fill_strategy (str, Optional): strategy to pick mode, if multiple. 38 | Default is None, which means first mode taken. 39 | Options include None, first, last, random. 40 | First, None -> select first of modes. 41 | Last -> select the last of modes. 42 | Random -> randomly sample from modes with replacement. 43 | """ 44 | self.fill_strategy = fill_strategy 45 | 46 | @property 47 | def fill_strategy(self): 48 | """Property getter to return the value of fill_strategy property.""" 49 | return self._fill_strategy 50 | 51 | @fill_strategy.setter 52 | def fill_strategy(self, fs): 53 | """Validate the fill_strategy property and set default parameters. 54 | 55 | Args: 56 | fs (str, None): if None, use first mode. 57 | 58 | Raises: 59 | ValueError: not a valid fill strategy for ModeImputer. 60 | """ 61 | if fs not in self.fill_strategies: 62 | err = f"{fs} not a valid fill strategy for ModeImputer" 63 | raise ValueError(err) 64 | self._fill_strategy = fs 65 | 66 | def fit(self, X, y=None): 67 | """Fit the Imputer to the dataset and calculate the mode. 68 | 69 | Args: 70 | X (pd.Series): Dataset to fit the imputer. 71 | y (None): ignored, None to meet requirements of base class 72 | 73 | Returns: 74 | self. Instance of the class. 75 | """ 76 | mode = X.mode().values 77 | self.statistics_ = {"param": mode, "strategy": self.strategy} 78 | return self 79 | 80 | def impute(self, X): 81 | """Perform imputations using the statistics generated from fit. 82 | 83 | This method handles the actual imputation. Missing values in a given 84 | dataset are replaced with the mode observed from fit. Note that there 85 | can be more than one mode. If more than one mode, use the 86 | fill_strategy to determine how to use the modes. 87 | 88 | Args: 89 | X (pd.Series): Dataset to impute missing data from fit. 90 | 91 | Returns: 92 | float or np.array -- imputed dataset. 93 | """ 94 | # check is fitted and identify locations of missingness 95 | check_is_fitted(self, "statistics_") 96 | ind = X[X.isnull()].index 97 | 98 | # get the number of modes 99 | imp = self.statistics_["param"] 100 | 101 | # default imputation is to pick first, such as scipy does 102 | if self.fill_strategy is None: 103 | imp = imp[0] 104 | 105 | # picking the first of the modes when fill_strategy = first 106 | if self.fill_strategy == "first": 107 | imp = imp[0] 108 | 109 | # picking the last of the modes when fill_strategy = last 110 | if self.fill_strategy == "last": 111 | imp = imp[-1] 112 | 113 | # sampling when strategy is random 114 | if self.fill_strategy == "random": 115 | num_modes = len(imp) 116 | # check if more modes 117 | if num_modes == 1: 118 | imp = imp[0] 119 | else: 120 | samples = np.random.choice(imp, len(ind)) 121 | imp = pd.Series(samples, index=ind).values 122 | 123 | # finally, fill in the right fill values for missing X 124 | return imp 125 | 126 | def fit_impute(self, X, y=None): 127 | """Convenience method to perform fit and imputation in one go.""" 128 | return self.fit(X, y).impute(X) 129 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/norm.py: -------------------------------------------------------------------------------- 1 | """This module implements norm imputation via the NormImputer. 2 | 3 | The NormImputer imputes missing data with random draws from a construted 4 | normal distribution. Dataframe imputers utilize this class when its strategy 5 | is requested. Use SingleImputer or MultipleImputer with strategy = `norm` to 6 | broadcast the strategy across all the columns in a dataframe, or specify this 7 | strategy for a given column. 8 | """ 9 | 10 | from scipy.stats import norm 11 | from sklearn.utils.validation import check_is_fitted 12 | from autoimpute.imputations import method_names 13 | from autoimpute.imputations.errors import _not_num_series 14 | from .base import ISeriesImputer 15 | methods = method_names 16 | # pylint:disable=attribute-defined-outside-init 17 | # pylint:disable=unnecessary-pass 18 | 19 | class NormImputer(ISeriesImputer): 20 | """Impute missing data with draws from normal distribution. 21 | 22 | The NormImputer constructs a normal distribution using the sample mean and 23 | variance of the observed data. The imputer then randomly samples from this 24 | distribution to impute missing data. The imputer can be used directly, but 25 | such behavior is discouraged. NormImputer does not have the flexibility / 26 | robustness of dataframe imputers, nor is its behavior identical. 27 | Preferred use is MultipleImputer(strategy="norm"). 28 | """ 29 | # class variables 30 | strategy = methods.NORM 31 | 32 | def __init__(self): 33 | """Create an instance of the NormImputer class.""" 34 | pass 35 | 36 | def fit(self, X, y=None): 37 | """Fit Imputer to dataset and calculate mean and sample variance. 38 | 39 | Args: 40 | X (pd.Series): Dataset to fit the imputer. 41 | y (None): ignored, None to meet requirements of base class 42 | 43 | Returns: 44 | self. Instance of the class. 45 | """ 46 | 47 | # get the moments for the normal distribution of feature X 48 | _not_num_series(self.strategy, X) 49 | moments = (X.mean(), X.std()) 50 | self.statistics_ = {"param": moments, "strategy": self.strategy} 51 | return self 52 | 53 | def impute(self, X): 54 | """Perform imputations using the statistics generated from fit. 55 | 56 | The transform method handles the actual imputation. It constructs a 57 | normal distribution using the sample mean and variance from fit. 58 | It then imputes missing values with a random draw from the respective 59 | distribution. 60 | 61 | Args: 62 | X (pd.Series): Dataset to impute missing data from fit. 63 | 64 | Returns: 65 | np.array -- imputed dataset. 66 | """ 67 | 68 | # check if fitted and identify location of missingness 69 | check_is_fitted(self, "statistics_") 70 | _not_num_series(self.strategy, X) 71 | ind = X[X.isnull()].index 72 | 73 | # create normal distribution and sample from it 74 | imp_mean, imp_std = self.statistics_["param"] 75 | imp = norm(imp_mean, imp_std).rvs(size=len(ind)) 76 | return imp 77 | 78 | def fit_impute(self, X, y): 79 | """Convenience method to perform fit and imputation in one go.""" 80 | return self.fit(X, y=None).impute(X) 81 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/norm_unit_variance.py: -------------------------------------------------------------------------------- 1 | """This module implements normal imputation with constant unit variance single imputation 2 | via the NormUnitVarianceImputer. 3 | 4 | The NormUnitVarianceImputer imputes missing data assuming that the 5 | single column is normally distributed with a-priori known constant unit 6 | variance. Use SingleImputer or MultipleImputer with strategy=`norm_const_variance` 7 | to broadcast the strategy across all the columns in a dataframe, 8 | or specify this strategy for a given column. 9 | """ 10 | 11 | from scipy import stats 12 | import pandas as pd 13 | import numpy as np 14 | from sklearn.utils.validation import check_is_fitted 15 | from autoimpute.imputations import method_names 16 | from autoimpute.imputations.errors import _not_num_series 17 | from .base import ISeriesImputer 18 | methods = method_names 19 | # pylint:disable=attribute-defined-outside-init 20 | # pylint:disable=unnecessary-pass 21 | 22 | class NormUnitVarianceImputer(ISeriesImputer): 23 | """Impute missing values assuming normally distributed 24 | data with unknown mean and *known* variance. 25 | """ 26 | # class variables 27 | strategy = methods.NORM_UNIT_VARIANCE 28 | 29 | def __init__(self): 30 | """Create an instance of the NormUnitVarianceImputer class.""" 31 | pass 32 | 33 | def fit(self, X, y): 34 | """Fit the Imputer to the dataset and calculate the mean. 35 | 36 | Args: 37 | X (pd.Series): Dataset to fit the imputer. 38 | y (None): ignored, None to meet requirements of base class 39 | 40 | Returns: 41 | self. Instance of the class. 42 | """ 43 | _not_num_series(self.strategy, X) 44 | mu = X.mean() # mean of observed data 45 | self.statistics_ = {"param": mu, "strategy": self.strategy} 46 | return self 47 | 48 | def impute(self, X): 49 | """Perform imputations using the statistics generated from fit. 50 | 51 | The impute method handles the actual imputation. Missing values 52 | in a given dataset are replaced with the respective mean from fit. 53 | 54 | Args: 55 | X (pd.Series): Dataset to impute missing data from fit. 56 | 57 | Returns: 58 | np.array -- imputed dataset. 59 | """ 60 | # check if fitted then impute with mean 61 | check_is_fitted(self, "statistics_") 62 | _not_num_series(self.strategy, X) 63 | omu = self.statistics_["param"] # mean of observed data 64 | idx = X.isnull() # missing data 65 | nO = sum(~idx) # number of observed 66 | m = sum(idx) # number to impute 67 | muhatk = stats.norm(omu,np.sqrt(1/nO)) 68 | # imputation cross-terms *NOT* uncorrelated 69 | Ymi=stats.multivariate_normal(np.ones(m)*muhatk.rvs(), 70 | np.ones((m,m))/nO+np.eye(m)).rvs() 71 | out = X.copy() 72 | out[idx] = Ymi 73 | return out 74 | 75 | def fit_impute(self, X, y=None): 76 | """Convenience method to perform fit and imputation in one go.""" 77 | return self.fit(X, y).impute(X) 78 | 79 | if __name__ == '__main__': 80 | from autoimpute.imputations import SingleImputer 81 | si=SingleImputer('normal unit variance') 82 | Yo=stats.norm(0,1).rvs(100) 83 | df = pd.DataFrame(columns=['Yo'],index=range(200),dtype=float) 84 | df.loc[range(100),'Yo'] = Yo 85 | si.fit_transform(df) 86 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/pmm.py: -------------------------------------------------------------------------------- 1 | """This module implements predictive mean matching via the PMMImputer. 2 | 3 | This module contains the PMMImputer, which implements predictive mean matching 4 | to impute missing values. Predictive mean matching is a semi-supervised, 5 | hot-deck technique to impute missing values. Dataframe imputers utilize this 6 | class when its strategy is requested. Use SingleImputer or MultipleImputer 7 | with strategy = `pmm` to broadcast the strategy across all the columns in a 8 | dataframe, or specify this strategy for a given column. 9 | """ 10 | 11 | import numpy as np 12 | import pymc as pm 13 | from pandas import DataFrame 14 | from scipy.stats import multivariate_normal 15 | from sklearn.linear_model import LinearRegression 16 | from sklearn.utils.validation import check_is_fitted 17 | from autoimpute.imputations import method_names 18 | from autoimpute.imputations.helpers import _neighbors 19 | from autoimpute.imputations.errors import _not_num_series 20 | from .base import ISeriesImputer 21 | methods = method_names 22 | # pylint:disable=attribute-defined-outside-init 23 | # pylint:disable=no-member 24 | # pylint:disable=unused-variable 25 | 26 | class PMMImputer(ISeriesImputer): 27 | """Impute missing values using predictive mean matching. 28 | 29 | The PMMIMputer produces predictions using a combination of bayesian 30 | approach to least squares and least squares itself. For each missing value 31 | PMM finds the `n` closest neighbors from a least squares regression 32 | prediction set, and samples from the corresponding true values for y of 33 | each of those `n` predictions. The imputation is the resulting sample. 34 | To implement bayesian least squares, the imputer utlilizes the pymc 35 | library. The imputer can be used directly, but such behavior is 36 | discouraged. PmmImputer does not have the flexibility / robustness of 37 | dataframe imputers, nor is its behavior identical. Preferred use is 38 | MultipleImputer(strategy="pmm"). 39 | """ 40 | # class variables 41 | strategy = methods.PMM 42 | 43 | def __init__(self, **kwargs): 44 | """Create an instance of the PMMImputer class. 45 | 46 | The class requires multiple arguments necessary to create priors for 47 | a bayesian linear regression equation and least squares itself. 48 | Therefore, PMM arguments include all of those seen in bayesian least 49 | squares and least squares itself. New parameters include `neighbors`, 50 | or the number of neighbors that PMM uses to sample observed. 51 | 52 | Args: 53 | **kwargs: default keyword arguments for lm & bayesian analysis. 54 | Note - kwargs popped for default arguments defined below. 55 | Next set of kwargs popped and sent to linear regression. 56 | Rest of kwargs passed as params to sampling (see pymc). 57 | am (float, Optional): mean of alpha prior. Default 0. 58 | asd (float, Optional): std. deviation of alpha prior. Default 10. 59 | bm (float, Optional): mean of beta priors. Default 0. 60 | bsd (float, Optional): std. deviation of beta priors. Default 10. 61 | sig (float, Optional): parameter of sigma prior. Default 1. 62 | sample (int, Optional): number of posterior samples per chain. 63 | Default = 1000. More samples, longer to run, but better 64 | approximation of the posterior & chance of convergence. 65 | tune (int, Optional): parameter for tuning. Draws done in addition 66 | to sample. Default = 1000. 67 | init (str, Optional): MCMC algo to use for posterior sampling. 68 | Default = 'auto'. See pymc docs for more info on choices. 69 | fill_value (str, Optional): How to draw from the posterior to 70 | create imputations. Default is "random". 'random' and 'mean' 71 | supported for explicit options. 72 | neighbors (int, Optional): number of neighbors. Default is 5. 73 | Value should be greater than 0 and less than # observed, 74 | although anything greater than 10-20 generally too high 75 | unless dataset is massive. 76 | fit_intercept (bool, Optional): sklearn LinearRegression param. 77 | normalize (bool, Optional): sklearn LinearRegression param. 78 | copy_x (bool, Optional): sklearn LinearRegression param. 79 | n_jobs (int, Optional): sklearn LinearRegression param. 80 | """ 81 | self.am = kwargs.pop("am", None) 82 | self.asd = kwargs.pop("asd", 10) 83 | self.bm = kwargs.pop("bm", None) 84 | self.bsd = kwargs.pop("bsd", 10) 85 | self.sig = kwargs.pop("sig", 1) 86 | self.sample = kwargs.pop("sample", 1000) 87 | self.tune = kwargs.pop("tune", 1000) 88 | self.init = kwargs.pop("init", "auto") 89 | self.fill_value = kwargs.pop("fill_value", "random") 90 | self.neighbors = kwargs.pop("neighbors", 5) 91 | self.fit_intercept = kwargs.pop("fit_intercept", True) 92 | self.copy_x = kwargs.pop("copy_x", True) 93 | self.n_jobs = kwargs.pop("n_jobs", None) 94 | self.lm = LinearRegression( 95 | fit_intercept=self.fit_intercept, 96 | copy_X=self.copy_x, 97 | n_jobs=self.n_jobs 98 | ) 99 | self.sample_kwargs = kwargs 100 | 101 | def fit(self, X, y): 102 | """Fit the Imputer to the dataset by fitting bayesian and LS model. 103 | 104 | Args: 105 | X (pd.Dataframe): dataset to fit the imputer. 106 | y (pd.Series): response, which is eventually imputed. 107 | 108 | Returns: 109 | self. Instance of the class. 110 | """ 111 | _not_num_series(self.strategy, y) 112 | nc = len(X.columns) 113 | 114 | # get predictions for the data, which will be used for "closest" vals 115 | y_pred = self.lm.fit(X, y).predict(X) 116 | y_df = DataFrame({"y": y, "y_pred": y_pred}) 117 | 118 | # calculate bayes and use appropriate means for alpha and beta priors 119 | # here we specify the point estimates from the linear regression as the 120 | # means for the priors. This will greatly speed up posterior sampling 121 | # and help ensure that convergence occurs 122 | if self.am is None: 123 | self.am = self.lm.intercept_ 124 | if self.bm is None: 125 | self.bm = self.lm.coef_ 126 | 127 | # initialize model for bayesian linear reg. Default vals for priors 128 | # assume data is scaled and centered. Convergence can struggle or fail 129 | # if not the case and proper values for the priors are not specified 130 | # separately, also assumes each beta is normal and "independent" 131 | # while betas likely not independent, this is technically a rule of OLS 132 | with pm.Model() as fit_model: 133 | alpha = pm.Normal("alpha", self.am, self.asd) 134 | beta = pm.Normal("beta", self.bm, self.bsd, shape=nc) 135 | sigma = pm.HalfCauchy("σ", self.sig) 136 | mu = alpha+beta.dot(X.T) 137 | score = pm.Normal("score", mu, sigma, observed=y) 138 | params = {"model": fit_model, "y_obs": y_df} 139 | self.statistics_ = {"param": params, "strategy": self.strategy} 140 | return self 141 | 142 | def impute(self, X): 143 | """Generate imputations using predictions from the fit bayesian model. 144 | 145 | The transform method returns the values for imputation. Missing values 146 | in a given dataset are replaced with the random selection from the PMM 147 | process. Again, PMM imputes actually observed values, and the observed 148 | values are selected by finding the closest least squares predictions 149 | to a given prediction from the bayesian model. 150 | 151 | Args: 152 | X (pd.DataFrame): predictors to determine imputed values. 153 | 154 | Returns: 155 | np.array: imputed dataset. 156 | """ 157 | # check if fitted then predict with least squares 158 | check_is_fitted(self, "statistics_") 159 | model = self.statistics_["param"]["model"] 160 | df = self.statistics_["param"]["y_obs"] 161 | df = df.reset_index(drop=True) 162 | 163 | # generate posterior distribution for alpha, beta coefficients 164 | with model: 165 | tr = pm.sample( 166 | self.sample, 167 | tune=self.tune, 168 | init=self.init, 169 | **self.sample_kwargs 170 | ) 171 | self.trace_ = tr 172 | 173 | # support for pymc - handling InferenceData obj instead of MultiTrace 174 | # we have to compress chains ourselves w/ InferenceData obj (xarray) 175 | post = tr.posterior 176 | alpha_, beta_ = post.alpha.values, post.beta.values 177 | chain, draws, beta_dim = beta_.shape 178 | beta_ = beta_.reshape(chain*draws, beta_dim) 179 | 180 | # sample random alpha from alpha posterior distribution 181 | alpha_bayes = np.random.choice(alpha_.ravel()) 182 | 183 | # get the mean and covariance of the multivariate betas 184 | # betas assumed multivariate normal by linear reg rules 185 | # sample beta w/ cov structure to create realistic variability 186 | beta_means, beta_cov = beta_.mean(0), np.cov(beta_.T) 187 | beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs()) 188 | 189 | # predictions for missing y, using bayes alpha + coeff samples 190 | # use these preds for nearest neighbor search from reg results 191 | # neighbors are nearest from prediction model fit on observed 192 | # imputed values are actual y vals corresponding to nearest neighbors 193 | # therefore, this is a form of "hot-deck" imputation 194 | y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T) 195 | n_ = self.neighbors 196 | if X.columns.size == 1: 197 | y_pred_bayes = y_pred_bayes[0] 198 | if self.fill_value == "mean": 199 | imp = [_neighbors(x, n_, df, np.mean) for x in y_pred_bayes] 200 | elif self.fill_value == "random": 201 | choice = np.random.choice 202 | imp = [_neighbors(x, n_, df, choice) for x in y_pred_bayes] 203 | else: 204 | err = f"{self.fill_value} must be `mean` or `random`." 205 | raise ValueError(err) 206 | 207 | # finally, set last class values and return imputations 208 | self.y_pred = y_pred_bayes 209 | self.alphas = alpha_bayes 210 | self.betas = beta_bayes 211 | return imp 212 | 213 | def fit_impute(self, X, y): 214 | """Fit impute method to generate imputations where y is missing. 215 | 216 | Args: 217 | X (pd.Dataframe): predictors in the dataset. 218 | y (pd.Series): response w/ missing values to impute. 219 | 220 | Returns: 221 | np.array: imputed dataset. 222 | """ 223 | # transform occurs with records from X where y is missing 224 | miss_y_ix = y[y.isnull()].index 225 | return self.fit(X, y).impute(X.loc[miss_y_ix]) 226 | -------------------------------------------------------------------------------- /autoimpute/imputations/series/random.py: -------------------------------------------------------------------------------- 1 | """This module implements random imputation via the RandomImputer. 2 | 3 | The RandomImputer imputes missing data using a random draw with replacement 4 | from the observed data. Dataframe imputers utilize this class when its 5 | strategy is requested. Use SingleImputer or MultipleImputer with 6 | strategy = `random` to broadcast the strategy across all the columns in a 7 | dataframe, or specify this strategy for a given column. 8 | """ 9 | 10 | import numpy as np 11 | from sklearn.utils.validation import check_is_fitted 12 | from autoimpute.imputations import method_names 13 | from .base import ISeriesImputer 14 | methods = method_names 15 | # pylint:disable=attribute-defined-outside-init 16 | # pylint:disable=unnecessary-pass 17 | 18 | class RandomImputer(ISeriesImputer): 19 | """Impute missing data using random draws from observed data. 20 | 21 | The RandomImputer samples with replacement from observed data. The imputer 22 | can be used directly, but such behavior is discouraged. RandomImputer does 23 | not have the flexibility / robustness of dataframe imputers, nor is its 24 | behavior identical. Preferred use is MultipleImputer(strategy="random"). 25 | """ 26 | # class variables 27 | strategy = methods.RANDOM 28 | 29 | def __init__(self): 30 | """Create an instance of the RandomImputer class.""" 31 | pass 32 | 33 | def fit(self, X, y=None): 34 | """Fit the Imputer to the dataset and get unique observed to sample. 35 | 36 | Args: 37 | X (pd.Series): Dataset to fit the imputer. 38 | y (None): ignored, None to meet requirements of base class 39 | 40 | Returns: 41 | self. Instance of the class. 42 | """ 43 | 44 | # determine set of observed values to sample from 45 | random = list(set(X[~X.isnull()])) 46 | self.statistics_ = {"param": random, "strategy": self.strategy} 47 | return self 48 | 49 | def impute(self, X): 50 | """Perform imputations using the statistics generated from fit. 51 | 52 | The transform method handles the actual imputation. Each missing value 53 | in a given dataset is replaced with a random draw from unique set of 54 | observed values determined during the fit stage. 55 | 56 | Args: 57 | X (pd.Series): Dataset to impute missing data from fit. 58 | 59 | Returns: 60 | np.array -- imputed dataset 61 | """ 62 | # check if fitted and identify location of missingness 63 | check_is_fitted(self, "statistics_") 64 | ind = X[X.isnull()].index 65 | 66 | # get the observed values and sample from them 67 | param = self.statistics_["param"] 68 | imp = np.random.choice(param, len(ind)) 69 | return imp 70 | 71 | def fit_impute(self, X, y=None): 72 | """Convenience method to perform fit and imputation in one go.""" 73 | return self.fit(X, y).impute(X) 74 | -------------------------------------------------------------------------------- /autoimpute/utils/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the utils lib from the autoimpute package. 2 | 3 | This module handles imports from the utils directory that should be accessible 4 | whenever someone imports autoimpute.utils. The imports include methods for 5 | checks & validations as well as functions to explore patterns in missing data. 6 | 7 | This module handles `from autoimpute.utils import *` with the __all__ variable 8 | below. This command imports the main public methods from autoimpute.utils. 9 | """ 10 | 11 | from .checks import check_data_structure, check_missingness 12 | from .checks import check_nan_columns, check_strategy_allowed 13 | from .checks import check_strategy_fit, check_predictors_fit 14 | from .patterns import md_pairs, md_pattern, md_locations 15 | from .patterns import inbound, outbound, influx, outflux, flux 16 | from .patterns import proportions, nullility_cov, nullility_corr 17 | 18 | __all__ = [ 19 | "check_data_structure", 20 | "check_missingness", 21 | "check_nan_columns", 22 | "check_strategy_allowed", 23 | "check_strategy_fit", 24 | "check_predictors_fit", 25 | "md_pairs", 26 | "md_pattern", 27 | "md_locations", 28 | "inbound", 29 | "outbound", 30 | "influx", 31 | "outflux", 32 | "flux", 33 | "proportions", 34 | "nullility_cov", 35 | "nullility_corr" 36 | ] 37 | -------------------------------------------------------------------------------- /autoimpute/utils/dataframes.py: -------------------------------------------------------------------------------- 1 | """Sample DataFrames utilized for examples and tests.""" 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.datasets import make_regression 6 | # pylint:disable=missing-docstring 7 | 8 | # missingness lambdas 9 | eq_miss = lambda x: np.random.choice([x, np.nan], 1)[0] 10 | val_miss = lambda x: np.random.choice([x, np.nan], 1, p=[x/100, 1-x/100])[0] 11 | 12 | # strategies used for imputation 13 | num_strategies = ["mean", "median", "mode", "random", "norm", "interpolate",'normal unit variance'] 14 | cat_strategies = ["mode", "categorical"] 15 | time_strategies = ["interpolate", "locf", "nocb"] 16 | 17 | # Numerical DataFrame with different % missingness per column 18 | df_num = pd.DataFrame() 19 | df_num["A"] = np.random.choice(np.arange(90, 100), 1000) 20 | df_num["B"] = np.random.choice(np.arange(50, 100), 1000) 21 | df_num["C"] = np.random.choice(np.arange(1, 100), 1000) 22 | df_num["A"] = df_num["A"].apply(eq_miss) 23 | df_num["B"] = df_num["B"].apply(val_miss) 24 | 25 | # Numerical DataFrame with column names for missingness classifier 26 | df_mis_classifier = pd.DataFrame() 27 | df_mis_classifier["a"] = np.random.choice(np.arange(90, 100), 1000) 28 | df_mis_classifier["k"] = np.random.choice(np.arange(50, 100), 1000) 29 | df_mis_classifier["c"] = np.random.choice(np.arange(1, 100), 1000) 30 | df_mis_classifier["a"] = df_mis_classifier["a"].apply(eq_miss) 31 | df_mis_classifier["k"] = df_mis_classifier["k"].apply(val_miss) 32 | 33 | # Mixed DataFrame with different % missingness per column & some dependence 34 | df_mix = pd.DataFrame() 35 | df_mix["gender"] = np.random.choice(["Male", "Female", None], 500) 36 | df_mix["salary"] = np.random.choice(np.arange(20, 100), 500) 37 | df_mix["age"] = np.random.choice( 38 | [10, 20, 30, 40, 50, 60, 70], 500, 39 | p=[0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1] 40 | ) 41 | df_mix["amm"] = np.random.choice(np.arange(0, 100), 500) 42 | 43 | for each in df_mix.index: 44 | s = df_mix.loc[each, "salary"] 45 | if df_mix.loc[each, "age"] > 50: 46 | s = np.random.choice([s, np.nan], p=[0.3, 0.7]) 47 | elif df_mix.loc[each, "age"] < 30: 48 | s = np.random.choice([s, np.nan], p=[0.4, 0.6]) 49 | 50 | # DataFrame with numerical feature, `np.nan` column, `None` column 51 | df_col_miss = pd.DataFrame() 52 | df_col_miss["A"] = np.random.choice(np.arange(90, 100), 1000) 53 | df_col_miss["A"] = df_col_miss["A"].apply(eq_miss) 54 | df_col_miss["B"] = np.random.choice([np.nan], 1000) 55 | df_col_miss["C"] = np.random.choice([None], 1000) 56 | 57 | # DataFrame to test all missing 58 | df_all_miss = pd.DataFrame({ 59 | "A":[None, None], 60 | "B": [np.nan, np.nan] 61 | }) 62 | 63 | # DataFrame to test time series 64 | df_ts_num = pd.DataFrame({ 65 | "date": pd.to_datetime(['2018-01-04', '2018-01-05', '2018-01-06', 66 | '2018-01-07', '2018-01-08', '2018-01-09']), 67 | "date_tm1": pd.to_datetime(['2017-01-04', '2017-01-05', '2017-01-06', 68 | '2017-01-07', '2017-01-08', '2017-01-09']), 69 | "values": [np.nan, 20, np.nan, 30, 40, np.nan], 70 | "values_tm1": [np.nan, 15, np.nan, 25, 50, 70] 71 | }) 72 | 73 | # DataFrame to test default with added time column and cat column 74 | df_ts_mixed = pd.DataFrame({ 75 | "date": pd.to_datetime(['2018-01-04', '2018-01-05', '2018-01-06', 76 | '2018-01-07', '2018-01-08', '2018-01-09']), 77 | "values": [271238, 329285, np.nan, 260260, 263711, np.nan], 78 | "cats": ["red", None, "green", "green", "red", "green"] 79 | }) 80 | 81 | # bayesian regression testing 82 | sc = lambda x, d: (x-d.mean())/d.std() 83 | mis = lambda x: np.random.choice([x, np.nan], 1, p=[0.8, 0.2])[0] 84 | X, y = make_regression(n_samples=1000, n_features=3, noise=0.50) 85 | df_bayes_reg = pd.DataFrame({"x1": X[:, 0], "x2": X[:, 1], 86 | "x3": X[:, 2], "y": y}) 87 | df_bayes_reg.x1 = df_bayes_reg.x1.apply(lambda x: sc(x, df_bayes_reg.x1)) 88 | df_bayes_reg.x2 = df_bayes_reg.x2.apply(lambda x: sc(x, df_bayes_reg.x2)) 89 | df_bayes_reg.x3 = df_bayes_reg.x3.apply(lambda x: sc(x, df_bayes_reg.x1)) 90 | df_bayes_reg.y = df_bayes_reg.y.apply(mis) 91 | 92 | # bayesian logistic testing 93 | def trans_binary(c): 94 | m = df_bayes_log.y.mean() 95 | if not pd.isnull(c): 96 | if c > m: 97 | return "male" 98 | return "female" 99 | 100 | df_bayes_log = df_bayes_reg.copy() 101 | df_bayes_log.y = df_bayes_log.y.apply(trans_binary) 102 | 103 | # partial dependence test 104 | df_partial_dependence = pd.DataFrame( 105 | {'A':np.random.uniform(0,1,100), 106 | 'B':np.random.uniform(0,1,100)} 107 | ) 108 | df_partial_dependence['B'][df_partial_dependence['B'] < 0.25] = np.nan 109 | df_partial_dependence['C'] = df_partial_dependence['B'] * 2 110 | df_partial_dependence['C'][df_partial_dependence['C'] < 0.7] = np.nan 111 | -------------------------------------------------------------------------------- /autoimpute/utils/helpers.py: -------------------------------------------------------------------------------- 1 | """Helper functions used throughout other methods in automipute.utils.""" 2 | 3 | import warnings 4 | import numpy as np 5 | import pandas as pd 6 | 7 | def _sq_output(data, cols, square=False): 8 | """Private method to turn unlabeled data into a DataFrame.""" 9 | if not isinstance(data, pd.DataFrame): 10 | data = pd.DataFrame(data, columns=cols) 11 | if square: 12 | data.index = data.columns 13 | return data 14 | 15 | def _index_output(data, index): 16 | """Private method to transform data to DataFrame and set the index.""" 17 | if not isinstance(data, pd.DataFrame): 18 | data = pd.DataFrame(data, index=index) 19 | return data 20 | 21 | def _nan_col_dropper(data): 22 | """Private method to drop columns w/ all missing values from DataFrame.""" 23 | cb = set(data.columns.tolist()) 24 | data.dropna(axis=1, how='all', inplace=True) 25 | ca = set(data.columns.tolist()) 26 | cdiff = cb.difference(ca) 27 | if cdiff: 28 | wrn = f"{cdiff} dropped from DataFrame because all rows missing." 29 | warnings.warn(wrn) 30 | return data, cdiff 31 | 32 | def _one_hot_encode(X, used_columns=None): 33 | """Private method to handle one hot encoding for categoricals.""" 34 | cats = X.select_dtypes(include=("object",)).columns.size 35 | if cats > 0: 36 | X_temp = pd.get_dummies(X, drop_first=True, dtype=float) 37 | if used_columns is None: 38 | used_columns = X_temp.columns 39 | if len(X_temp.columns) != len(used_columns): 40 | one_hot = pd.get_dummies(X, dtype=float) 41 | # if wasn't in `used_columns`, then it's the first category 42 | to_drop = set(one_hot.columns).difference(used_columns) 43 | one_hot.drop(to_drop, axis=1, inplace=True) 44 | # if wasn't in `one_hot`, there were no instances of this category 45 | to_add = set(used_columns).difference(one_hot.columns) 46 | X_temp = one_hot.assign(**{col:0 for col in to_add}) 47 | X_temp = X_temp.reindex(columns=used_columns, copy=False) 48 | X = X_temp 49 | return X 50 | -------------------------------------------------------------------------------- /autoimpute/visuals/__init__.py: -------------------------------------------------------------------------------- 1 | """Manage the visuals lib from the autoimpute package. 2 | 3 | This module handles imports from the visuals directory that should be 4 | accessible whenever someone imports autoimpute.visuals. The imports include 5 | methods for visual analysis of missing data, from exploration to analysis. 6 | 7 | This module handles `from autoimpute.visuals import *` with the __all__ 8 | variable below. This command imports the main public methods from 9 | autoimpute.visuals. 10 | """ 11 | 12 | from .utils import plot_md_locations, plot_md_percent 13 | from .utils import plot_nullility_corr, plot_nullility_dendogram 14 | from .imputations import plot_imp_scatter, plot_imp_dists, plot_imp_boxplots 15 | from .imputations import plot_imp_swarm, plot_imp_strip 16 | 17 | __all__ = [ 18 | "plot_md_locations", 19 | "plot_md_percent", 20 | "plot_nullility_corr", 21 | "plot_nullility_dendogram", 22 | "plot_imp_scatter", 23 | "plot_imp_dists", 24 | "plot_imp_boxplots", 25 | "plot_imp_swarm", 26 | "plot_imp_strip" 27 | ] 28 | -------------------------------------------------------------------------------- /autoimpute/visuals/helpers.py: -------------------------------------------------------------------------------- 1 | """Helper functions used throughout other methods in automipute.visuals.""" 2 | 3 | import pandas as pd 4 | import seaborn as sns 5 | from autoimpute.imputations import MultipleImputer 6 | 7 | #pylint:disable=unnecessary-lambda 8 | 9 | def _fully_complete(data): 10 | """Private method to exit plotting and raise error if no data missing.""" 11 | if not pd.isnull(data).sum().any(): 12 | err = "No data is missing in any column. Cannot generate plot." 13 | raise ValueError(err) 14 | 15 | def _default_plot_args(**kwargs): 16 | """Private method to set up the default plot style arguments.""" 17 | rc = {} 18 | rc["figure.figsize"] = kwargs.pop("figsize", (12, 8)) 19 | context = kwargs.pop("context", "talk") 20 | sns.set(context=context, rc=rc) 21 | 22 | def _validate_data(d, mi, imp_col=None): 23 | """Private helper method to validate data vs multiple imputations. 24 | 25 | Args: 26 | d (list): dataset returned from multiple imputation. 27 | mi (MultipleImputer): multiple imputer used to generate d. 28 | imp_col (str): column to plot. Should be a column with imputations. 29 | 30 | Raises: 31 | ValueError: d should be list of tuples returned from mi transform. 32 | ValueError: mi should be instance of MultipleImputer used to produce d. 33 | ValueError: mi should have imputed_ attribute after transformation. 34 | ValueError: Number of imputations should equal length of the dataset. 35 | ValueError: Columns in each imputed data should be the same. 36 | ValueError: Colums in each imputed data should be same as mi.imputed_. 37 | ValueError: imp_col must be in both datasets and mi.imputed_ keys. 38 | """ 39 | 40 | if not isinstance(d, list): 41 | err = "d should be list of tuples returned from mi transform." 42 | raise ValueError(err) 43 | 44 | if not isinstance(mi, MultipleImputer): 45 | err = "mi should be instance of MultipleImputer used to produce d." 46 | raise ValueError(err) 47 | 48 | if not hasattr(mi, "imputed_"): 49 | err = "mi should have imputed_ attribute after transformation." 50 | raise ValueError(err) 51 | 52 | # names of values from expressions we need to test 53 | num_imps = len(d) 54 | num_mi = mi.n 55 | imp_cols = set(mi.imputed_.keys()) 56 | sets_ = [set(d[i][1].columns.tolist()) for i in range(len(d))] 57 | diff_d = any(list(map(lambda i: sets_[0].difference(i), sets_[1:]))) 58 | diff_m = len(sets_[0].difference(imp_cols)) 59 | 60 | if num_imps != num_mi: 61 | err = "Number of imputations should equal length of the dataset." 62 | raise ValueError(err) 63 | 64 | if diff_d: 65 | err = "Columns in each imputed dataset should be the same." 66 | raise ValueError(err) 67 | 68 | if diff_m > 0: 69 | err = "Colums w/in each imputed dataset should be same as mi.imputed_" 70 | raise ValueError(err) 71 | 72 | if not imp_col is None: 73 | imp_in_d = imp_col in sets_[0] 74 | imp_in_mi = imp_col in imp_cols 75 | if not all([imp_in_d, imp_in_mi]): 76 | err = "imp_col must be in both datasets and mi.imputed_ keys." 77 | raise ValueError(err) 78 | 79 | def _get_observed(d, mi, imp_col): 80 | """Private helper method to get observed data after imputation.""" 81 | _validate_data(d, mi, imp_col) 82 | obs = set(d[0][1].index).difference(mi.imputed_[imp_col]) 83 | return list(obs) 84 | 85 | def _plot_imp_dists_helper(d, hist_imputed, imp_col, ax=None, l="Imputed"): 86 | """Private helper method to plot distribution of imputed data.""" 87 | for each in d: 88 | sns.distplot( 89 | each[1][imp_col], hist=hist_imputed, ax=ax, 90 | label=f"{imp_col} Imp {each[0]}" 91 | ).set(xlabel=l) 92 | 93 | def _validate_kwgs(kwgs): 94 | """Private helper method to validate kwargs arguments.""" 95 | if not isinstance(kwgs, (type(None), dict)): 96 | err = "kwgs must be None or a dictionary of keyword arguments" 97 | raise ValueError(err) 98 | 99 | def _melt_df(d, mi, imp_col): 100 | """Private helper method to melt dataframe vertically.""" 101 | datasets_added = [] 102 | for each in d: 103 | e = each[1].copy() 104 | e["imp_num"] = f"Imp {each[0]}" 105 | e["imputed"] = "no" 106 | imputed = mi.imputed_[imp_col] 107 | e.loc[imputed, "imputed"] = "yes" 108 | datasets_added.append(e) 109 | datasets_merged = pd.concat(datasets_added) 110 | return datasets_merged 111 | -------------------------------------------------------------------------------- /autoimpute/visuals/utils.py: -------------------------------------------------------------------------------- 1 | """Visualizations to explore missingness within a dataset. 2 | 3 | This module is a wrapper around the excellent missingno library, which 4 | provides a number of plots to explore missingness within a dataset. This 5 | wrapper handles some basic plot style setting and error handling for the user 6 | that missingno handles differently. The reason we wrap missingno is to fine 7 | tune the package and apply it directly to autoimpute. 8 | """ 9 | 10 | import missingno as msno 11 | from autoimpute.utils import check_data_structure 12 | from .helpers import _fully_complete, _default_plot_args 13 | 14 | @check_data_structure 15 | def plot_md_locations(data, **kwargs): 16 | """Plot the locations where data is missing within a DataFrame. 17 | 18 | Args: 19 | data (pd.DataFrame): DataFrame to plot. 20 | **kwargs: Keyword arguments for plot. Passed to missingno.matrix. 21 | 22 | Returns: 23 | matplotlib.axes._subplots.AxesSubplot: missingness location plot. 24 | 25 | Raises: 26 | TypeError: if data is not a DataFrame. Error raised through decorator. 27 | """ 28 | _default_plot_args(**kwargs) 29 | msno.matrix(data, **kwargs) 30 | 31 | @check_data_structure 32 | def plot_md_percent(data, **kwargs): 33 | """Plot the percentage of missing data by column within a DataFrame. 34 | 35 | Args: 36 | data (pd.DataFrame): DataFrame to plot. 37 | **kwargs: Keyword arguments for plot. Passed to missingno.bar. 38 | 39 | Returns: 40 | matplotlib.axes._subplots.AxesSubplot: missingness percent plot. 41 | 42 | Raises: 43 | TypeError: if data is not a DataFrame. Error raised through decorator. 44 | """ 45 | _default_plot_args(**kwargs) 46 | msno.bar(data, **kwargs) 47 | 48 | @check_data_structure 49 | def plot_nullility_corr(data, **kwargs): 50 | """Plot the nullility correlation of missing data within a DataFrame. 51 | 52 | Args: 53 | data (pd.DataFrame): DataFrame to plot. 54 | **kwargs: Keyword arguments for plot. Passed to missingno.heatmap. 55 | 56 | Returns: 57 | matplotlib.axes._subplots.AxesSubplot: nullility correlation plot. 58 | 59 | Raises: 60 | TypeError: if data is not a DataFrame. Error raised through decorator. 61 | ValueError: dataset fully observed. Raised through helper method. 62 | """ 63 | _fully_complete(data) 64 | _default_plot_args(**kwargs) 65 | msno.heatmap(data, **kwargs) 66 | 67 | @check_data_structure 68 | def plot_nullility_dendogram(data, **kwargs): 69 | """Plot the nullility dendogram of missing data within a DataFrame. 70 | 71 | Args: 72 | data (pd.DataFrame): DataFrame to plot. 73 | **kwargs: Keyword arguments for plot. Passed to missingno.dendogram. 74 | 75 | Returns: 76 | matplotlib.axes._subplots.AxesSubplot: nullility dendogram plot. 77 | 78 | Raises: 79 | TypeError: if data is not a DataFrame. Error raised through decorator. 80 | ValueError: dataset fully observed. Raised through helper method. 81 | """ 82 | _fully_complete(data) 83 | _default_plot_args(**kwargs) 84 | msno.dendrogram(data, **kwargs) 85 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | SOURCEDIR = source 8 | BUILDDIR = build 9 | 10 | # Put it first so that "make" without argument is like "make help". 11 | help: 12 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 13 | 14 | .PHONY: help Makefile 15 | 16 | # Catch-all target: route all unknown targets to Sphinx using the new 17 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 18 | %: Makefile 19 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # pylint:disable=missing-docstring 2 | # pylint:disable=redefined-builtin 3 | # pylint:disable=too-few-public-methods 4 | # -*- coding: utf-8 -*- 5 | # 6 | # Configuration file for the Sphinx documentation builder. 7 | # 8 | # This file does only contain a selection of the most common options. For a 9 | # full list see the documentation: 10 | # http://www.sphinx-doc.org/en/master/config 11 | 12 | # -- Path setup -------------------------------------------------------------- 13 | 14 | # If extensions (or modules to document with autodoc) are in another directory, 15 | # add these directories to sys.path here. If the directory is relative to the 16 | # documentation root, use os.path.abspath to make it absolute, like shown here. 17 | # 18 | import os 19 | import sys 20 | import mock 21 | sys.path.insert(0, os.path.abspath("../..")) 22 | 23 | # must mock modules that contain C code or autodoc won't install libs 24 | MOCK_MODULES = [ 25 | "numpy", 26 | "numpy.ma", 27 | "pandas", 28 | "pandas.api.types", 29 | "matplotlib", 30 | "matplotlib.pyplot", 31 | "matplotlib.pylab", 32 | "seaborn", 33 | "missingno", 34 | "scipy", 35 | "scipy.stats", 36 | "scipy.cluster", 37 | "sklearn", 38 | "sklearn.base", 39 | "sklearn.metrics", 40 | "sklearn.utils.validation", 41 | "sklearn.linear_model", 42 | "statsmodels", 43 | "statsmodels.api", 44 | "statsmodels.discrete", 45 | "statsmodels.discrete.discrete_model", 46 | "pymc", 47 | "xgboost" 48 | ] 49 | for mod_name in MOCK_MODULES: 50 | sys.modules[mod_name] = mock.Mock() 51 | 52 | # dealing with mock metaclass issue 53 | class BaseEstimatorClass: 54 | pass 55 | 56 | class TransformerMixinClass: 57 | pass 58 | 59 | class ClassifierMixinClass: 60 | pass 61 | 62 | setattr(sys.modules["sklearn.base"], "BaseEstimator", BaseEstimatorClass) 63 | setattr(sys.modules["sklearn.base"], "TransformerMixin", TransformerMixinClass) 64 | setattr(sys.modules["sklearn.base"], "ClassifierMixin", ClassifierMixinClass) 65 | setattr(sys.modules["numpy"], "__version__", "1.15.4") 66 | 67 | 68 | # -- Project information ----------------------------------------------------- 69 | 70 | project = "Autoimpute" 71 | copyright = "2022, Joseph Kearney, Shahid Barkat" 72 | author = "Joseph Kearney, Shahid Barkat" 73 | 74 | # The short X.Y version 75 | version = "0.13.0" 76 | # The full version, including alpha/beta/rc tags 77 | release = "" 78 | 79 | 80 | # -- General configuration --------------------------------------------------- 81 | 82 | # If your documentation needs a minimal Sphinx version, state it here. 83 | # 84 | needs_sphinx = '1.8' 85 | 86 | # Add any Sphinx extension module names here, as strings. They can be 87 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 88 | # ones. 89 | extensions = [ 90 | "sphinx.ext.autodoc", 91 | "sphinx.ext.intersphinx", 92 | "sphinx.ext.todo", 93 | "sphinx.ext.mathjax", 94 | "sphinx.ext.ifconfig", 95 | "sphinx.ext.viewcode", 96 | "sphinx.ext.githubpages", 97 | "sphinx.ext.napoleon", 98 | "m2r" 99 | ] 100 | 101 | # Add any paths that contain templates here, relative to this directory. 102 | templates_path = ["_templates"] 103 | 104 | # The suffix(es) of source filenames. 105 | # You can specify multiple suffix as a list of string: 106 | # 107 | source_suffix = [".rst", ".md"] 108 | #source_suffix = '.rst' 109 | 110 | # The master toctree document. 111 | master_doc = "index" 112 | 113 | # The language for content autogenerated by Sphinx. Refer to documentation 114 | # for a list of supported languages. 115 | # 116 | # This is also used if you do content translation via gettext catalogs. 117 | # Usually you set "language" from the command line for these cases. 118 | language = None 119 | 120 | # List of patterns, relative to source directory, that match files and 121 | # directories to ignore when looking for source files. 122 | # This pattern also affects html_static_path and html_extra_path. 123 | exclude_patterns = [] 124 | 125 | # The name of the Pygments (syntax highlighting) style to use. 126 | pygments_style = None 127 | 128 | 129 | # -- Options for HTML output ------------------------------------------------- 130 | 131 | # The theme to use for HTML and HTML Help pages. See the documentation for 132 | # a list of builtin themes. 133 | # 134 | html_theme = "sphinx_rtd_theme" 135 | 136 | # Theme options are theme-specific and customize the look and feel of a theme 137 | # further. For a list of options available for each theme, see the 138 | # documentation. 139 | # 140 | # html_theme_options = {} 141 | 142 | # Add any paths that contain custom static files (such as style sheets) here, 143 | # relative to this directory. They are copied after the builtin static files, 144 | # so a file named "default.css" will overwrite the builtin "default.css". 145 | html_static_path = ["_static"] 146 | 147 | # Custom sidebar templates, must be a dictionary that maps document names 148 | # to template names. 149 | # 150 | # The default sidebars (for documents that don't match any pattern) are 151 | # defined by theme itself. Builtin themes are using these templates by 152 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', 153 | # 'searchbox.html']``. 154 | # 155 | # html_sidebars = {} 156 | 157 | 158 | # -- Options for HTMLHelp output --------------------------------------------- 159 | 160 | # Output file base name for HTML help builder. 161 | htmlhelp_basename = "Autoimputedoc" 162 | 163 | 164 | # -- Options for LaTeX output ------------------------------------------------ 165 | 166 | latex_elements = { 167 | # The paper size ('letterpaper' or 'a4paper'). 168 | # 169 | # 'papersize': 'letterpaper', 170 | 171 | # The font size ('10pt', '11pt' or '12pt'). 172 | # 173 | # 'pointsize': '10pt', 174 | 175 | # Additional stuff for the LaTeX preamble. 176 | # 177 | # 'preamble': '', 178 | 179 | # Latex figure (float) alignment 180 | # 181 | # 'figure_align': 'htbp', 182 | } 183 | 184 | # Grouping the document tree into LaTeX files. List of tuples 185 | # (source start file, target name, title, 186 | # author, documentclass [howto, manual, or own class]). 187 | latex_documents = [ 188 | (master_doc, "Autoimpute.tex", "Autoimpute Documentation", 189 | "Joseph Kearney, Shahid Barkat", "manual"), 190 | ] 191 | 192 | 193 | # -- Options for manual page output ------------------------------------------ 194 | 195 | # One entry per manual page. List of tuples 196 | # (source start file, name, description, authors, manual section). 197 | man_pages = [ 198 | (master_doc, "autoimpute", "Autoimpute Documentation", 199 | [author], 1) 200 | ] 201 | 202 | 203 | # -- Options for Texinfo output ---------------------------------------------- 204 | 205 | # Grouping the document tree into Texinfo files. List of tuples 206 | # (source start file, target name, title, author, 207 | # dir menu entry, description, category) 208 | texinfo_documents = [ 209 | (master_doc, "Autoimpute", "Autoimpute Documentation", 210 | author, "Autoimpute", "One line description of project.", 211 | "Miscellaneous"), 212 | ] 213 | 214 | 215 | # -- Options for Epub output ------------------------------------------------- 216 | 217 | # Bibliographic Dublin Core info. 218 | epub_title = project 219 | 220 | # The unique identifier of the text. This can be a ISBN number 221 | # or the project homepage. 222 | # 223 | # epub_identifier = '' 224 | 225 | # A unique identification for the text. 226 | # 227 | # epub_uid = '' 228 | 229 | # A list of files that should not be packed into the epub file. 230 | epub_exclude_files = ["search.html"] 231 | 232 | 233 | # -- Extension configuration ------------------------------------------------- 234 | 235 | # -- Options for intersphinx extension --------------------------------------- 236 | 237 | # Example configuration for intersphinx: refer to the Python standard library. 238 | intersphinx_mapping = {"https://docs.python.org/": None} 239 | 240 | # -- Options for todo extension ---------------------------------------------- 241 | 242 | # If true, `todo` and `todoList` produce output, else they produce nothing. 243 | todo_include_todos = True 244 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | Welcome to Autoimpute! 2 | ====================== 3 | 4 | .. image:: https://badge.fury.io/py/autoimpute.svg 5 | :target: https://badge.fury.io/py/autoimpute 6 | :alt: PyPI version 7 | 8 | .. image:: https://github.com/kearnz/autoimpute/actions/workflows/build.yml/badge.svg 9 | :target: https://github.com/kearnz/autoimpute/actions 10 | :alt: Build Status 11 | 12 | .. image:: https://readthedocs.org/projects/autoimpute/badge/?version=latest 13 | :target: https://autoimpute.readthedocs.io/en/latest/?badge=latest 14 | :alt: Documentation Status 15 | 16 | .. image:: https://img.shields.io/badge/License-MIT-blue.svg 17 | :target: https://lbesson.mit-license.org/ 18 | :alt: MIT license 19 | 20 | .. image:: https://img.shields.io/badge/python-3.8+-blue.svg 21 | :target: https://www.python.org/downloads/release/python-380/ 22 | :alt: Python 3.8+ 23 | 24 | Autoimpute_ is a Python package for analysis and implementation of Imputation Methods. 25 | 26 | .. _Autoimpute: https://pypi.org/project/autoimpute/ 27 | 28 | This page introduces users to the package and documents its features. 29 | 30 | Check out the package on github_, or head to our website_ to walk through some tutorials. 31 | 32 | .. _github: https://github.com/kearnz/autoimpute 33 | .. _website: https://kearnz.github.io/autoimpute-tutorials/ 34 | 35 | .. toctree:: 36 | :caption: User Guide 37 | :titlesonly: 38 | 39 | Getting Started 40 | Utility Methods 41 | Deletion and Imputation Strategies 42 | Single, Multiple, and Mice Imputers 43 | Missingness Classifier 44 | Analysis Models 45 | Visualization Methods 46 | 47 | We are actively developing `Autoimpute`, so sometimes the docs fall behind. The `README` is always up to date, as is the master branch. Therefore, consult those first. If you find discrepancies between the docs and the package, please still let us know! 48 | -------------------------------------------------------------------------------- /docs/source/user_guide/analysis.rst: -------------------------------------------------------------------------------- 1 | Analysis Models 2 | =============== 3 | 4 | This section documents analysis models within ``Autoimpute`` and their respective diagnostics. 5 | 6 | The ``MiLinearRegression`` and ``MiLogisticRegression`` extend linear and logistic regression to multiply imputed datasets. Under the hood, each regression class uses a ``MiceImputer`` to handle missing data prior to supervised analysis. Users of each regression class can tweak the underlying ``MiceImputer`` through the ``mi_kwgs`` argument or pass a pre-configured instance to the ``mi`` argument (recommended). 7 | 8 | Users can also specify whether the classes should use ``sklearn`` or ``statsmodels`` to implement linear or logistic regression. The default is ``statsmodels``. When used, end users get more detailed parameter diagnostics for regression on multiply imputed data. 9 | 10 | Finally, this section provides diagnostic helper methods to assess bias of parameters from a regression model. 11 | 12 | Linear Regression for Multiply Imputed Data 13 | ------------------------------------------- 14 | 15 | .. autoclass:: autoimpute.analysis.MiLinearRegression 16 | :special-members: 17 | :members: 18 | 19 | 20 | Logistic Regression for Multiply Imputed Data 21 | --------------------------------------------- 22 | 23 | .. autoclass:: autoimpute.analysis.MiLogisticRegression 24 | :special-members: 25 | :members: 26 | 27 | 28 | Diagnostics 29 | ----------- 30 | 31 | .. autofunction:: autoimpute.analysis.raw_bias 32 | 33 | .. autofunction:: autoimpute.analysis.percent_bias 34 | -------------------------------------------------------------------------------- /docs/source/user_guide/getting_started.rst: -------------------------------------------------------------------------------- 1 | Getting Started 2 | =============== 3 | 4 | The sections below provide a high level overview of the ``Autoimpute`` package. This page takes you through installation, dependencies, main features, imputation methods supported, and basic usage of the package. It also provides links to get in touch with the authors, review our lisence, and review how to contribute. 5 | 6 | Installation 7 | ------------ 8 | 9 | 10 | * Download ``Autoimpute`` with ``pip install autoimpute``. 11 | * If ``pip`` cached an older version, try ``pip install --no-cache-dir --upgrade autoimpute``. 12 | * If you want to work with the development branch, use the script below: 13 | 14 | *Development* 15 | 16 | .. code-block:: sh 17 | 18 | git clone -b dev --single-branch https://github.com/kearnz/autoimpute.git 19 | cd autoimpute 20 | python setup.py install 21 | 22 | Versions and Dependencies 23 | ------------------------- 24 | 25 | 26 | * Python 3.8+ 27 | * Dependencies: 28 | 29 | * ``numpy`` 30 | * ``scipy`` 31 | * ``pandas`` 32 | * ``statsmodels`` 33 | * ``scikit-learn`` 34 | * ``xgboost`` 35 | * ``pymc`` 36 | * ``seaborn`` 37 | * ``missingno`` 38 | 39 | *A note for Windows Users*\ : 40 | 41 | * Autoimpute 0.13.0+ has not been tested on windows and can't verify support for pymc. Historically we've had some issues with pymc on windows. 42 | * Autoimpute works on Windows but users may have trouble with pymc for bayesian methods. `(See discourse) `_ 43 | * Users may receive a runtime error ``‘can’t pickle fortran objects’`` when sampling using multiple chains. 44 | * There are a couple of things to do to try to overcome this error: 45 | 46 | * Reinstall theano and pymc. Make sure to delete .theano cache in your home folder. 47 | * Upgrade joblib in the process, which is reponsible for generating the error (pymc uses joblib under the hood). 48 | * Set ``cores=1`` in ``pm.sample``. This should be a last resort, as it means posterior sampling will use 1 core only. Not using multiprocessing will slow down bayesian imputation methods significantly. 49 | 50 | * Reach out and let us know if you've worked through this issue successfully on Windows and have a better solution! 51 | 52 | Main Features 53 | ------------- 54 | 55 | 56 | * Utility functions and basic visualizations to explore missingness patterns 57 | * Missingness classifier and automatic missing data test set generator 58 | * Native handling for categorical variables (as predictors and targets of imputation) 59 | * Single and multiple imputation classes for ``pandas`` ``DataFrames`` 60 | * Analysis methods and pooled parameter inference using multiply imputed datasets 61 | * Numerous imputation methods, as specified in the table below: 62 | 63 | Imputation Methods Supported 64 | ---------------------------- 65 | 66 | .. list-table:: 67 | :header-rows: 1 68 | 69 | * - Univariate 70 | - Multivariate 71 | - Time Series / Interpolation 72 | * - Mean 73 | - Linear Regression 74 | - Linear 75 | * - Median 76 | - Binomial Logistic Regression 77 | - Quadratic 78 | * - Mode 79 | - Multinomial Logistic Regression 80 | - Cubic 81 | * - Random 82 | - Stochastic Regression 83 | - Polynomial 84 | * - Norm 85 | - Bayesian Linear Regression 86 | - Spline 87 | * - Categorical 88 | - Bayesian Binary Logistic Regression 89 | - Time-weighted 90 | * - 91 | - Predictive Mean Matching 92 | - Next Obs Carried Backward 93 | * - 94 | - Local Residual Draws 95 | - Last Obs Carried Forward 96 | 97 | Example Usage 98 | ------------- 99 | 100 | Autoimpute is designed to be user friendly and flexible. When performing imputation, Autoimpute fits directly into ``scikit-learn`` machine learning projects. Imputers inherit from sklearn's ``BaseEstimator`` and ``TransformerMixin`` and implement ``fit`` and ``transform`` methods, making them valid Transformers in an sklearn pipeline. 101 | 102 | Right now, there are two ``Imputer`` classes we'll work with: 103 | 104 | .. code-block:: python 105 | 106 | from autoimpute.imputations import SingleImputer, MultipleImputer, MiceImputer 107 | si = SingleImputer() # pass through data once 108 | mi = MultipleImputer() # pass through data multiple times 109 | mice = MiceImputer() # pass through data multiple times and iteratively optimize imputations in each column 110 | 111 | Imputations can be as simple as: 112 | 113 | .. code-block:: python 114 | 115 | # simple example using default instance of MiceImputer 116 | imp = MiceImputer() 117 | 118 | # fit transform returns a generator by default, calculating each imputation method lazily 119 | imp.fit_transform(data) 120 | 121 | Or quite complex, such as: 122 | 123 | .. code-block:: python 124 | 125 | # create a complex instance of the MiceImputer 126 | # Here, we specify strategies by column and predictors for each column 127 | # We also specify what additional arguments any `pmm` strategies should take 128 | imp = MiceImputer( 129 | n=10, 130 | strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"}, 131 | predictors={"salary": "all", "gender": ["salary", "education", "weight"]}, 132 | imp_kwgs={"pmm": {"fill_value": "random"}}, 133 | visit="left-to-right", 134 | return_list=True 135 | ) 136 | 137 | # Because we set return_list=True, imputations are done all at once, not evaluated lazily. 138 | # This will return M*N, where M is the number of imputations and N is the size of original dataframe. 139 | imp.fit_transform(data) 140 | 141 | Autoimpute also extends supervised machine learning methods from ``scikit-learn`` and ``statsmodels`` to apply them to multiply imputed datasets (using the ``MiceImputer`` under the hood). Right now, Autoimpute supports linear regression and binary logistic regression. Additional supervised methods are currently under development. 142 | 143 | As with Imputers, Autoimpute's analysis methods can be simple or complex: 144 | 145 | .. code-block:: python 146 | 147 | from autoimpute.analysis import MiLinearRegression 148 | 149 | # By default, use statsmodels OLS and MiceImputer() 150 | simple_lm = MiLinearRegression() 151 | 152 | # fit the model on each multiply imputed dataset and pool parameters 153 | simple_lm.fit(X_train, y_train) 154 | 155 | # get summary of fit, which includes pooled parameters under Rubin's rules 156 | # also provides diagnostics related to analysis after multiple imputation 157 | simple_lm.summary() 158 | 159 | # make predictions on a new dataset using pooled parameters 160 | predictions = simple_lm.predict(X_test) 161 | 162 | # Control both the regression used and the MiceImputer itself 163 | multiple_imputer_arguments = dict( 164 | n=3, 165 | strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"}, 166 | predictors={"salary": "all", "gender": ["salary", "education", "weight"]}, 167 | imp_kwgs={"pmm": {"fill_value": "random"}}, 168 | scaler=StandardScaler(), 169 | visit="left-to-right", 170 | verbose=True 171 | ) 172 | complex_lm = MiLinearRegression( 173 | model_lib="sklearn", # use sklearn linear regression 174 | mi_kwgs=multiple_imputer_arguments # control the multiple imputer 175 | ) 176 | 177 | # fit the model on each multiply imputed dataset 178 | complex_lm.fit(X_train, y_train) 179 | 180 | # get summary of fit, which includes pooled parameters under Rubin's rules 181 | # also provides diagnostics related to analysis after multiple imputation 182 | complex_lm.summary() 183 | 184 | # make predictions on new dataset using pooled parameters 185 | predictions = complex_lm.predict(X_test) 186 | 187 | Note that we can also pass a pre-specified ``MiceImputer`` to either analysis model instead of using ``mi_kwgs``. The option is ours, and it's a matter of preference. If we pass a pre-specified ``MiceImputer``\ , anything in ``mi_kwgs`` is ignored, although the ``mi_kwgs`` argument is still validated. 188 | 189 | .. code-block:: python 190 | 191 | from autoimpute.imputations import MiceImputer 192 | from autoimpute.analysis import MiLinearRegression 193 | 194 | # create a multiple imputer first 195 | custom_imputer = MiceImputer(n=3, strategy="pmm", return_list=True) 196 | 197 | # pass the imputer to a linear regression model 198 | complex_lm = MiLinearRegression(mi=custom_imputer, model_lib="statsmodels") 199 | 200 | # proceed the same as the previous examples 201 | complex_lm.fit(X_train, y_train).predict(X_test) 202 | complex_lm.summary() 203 | 204 | For a deeper understanding of how the package works and its features, see our `tutorials website `_. 205 | 206 | Creators and Maintainers 207 | ------------------------ 208 | 209 | 210 | * Joseph Kearney – `@kearnz `_ 211 | * Shahid Barkat - `@shabarka `_ 212 | 213 | See the `Authors `_ page to get in touch! 214 | 215 | License 216 | ------- 217 | 218 | Distributed under the MIT license. See `LICENSE `_ for more information. 219 | 220 | Contributing 221 | ------------ 222 | 223 | Guidelines for contributing to our project. See `CONTRIBUTING `_ for more information. 224 | 225 | Contributor Code of Conduct 226 | --------------------------- 227 | 228 | Adapted from Contributor Covenant, version 1.0.0. See `Code of Conduct `_ for more information. 229 | -------------------------------------------------------------------------------- /docs/source/user_guide/imputers.rst: -------------------------------------------------------------------------------- 1 | DataFrame Imputers 2 | ================== 3 | 4 | This section documents the DataFrame Imputers within ``Autoimpute``. 5 | 6 | DataFrame Imputers are the primary feature of the package. The ``SingleImputer`` imputes each column within a DataFrame one time, while the ``MultipleImputer`` imputes each column within a DataFrame multiple times using independent runs. Under the hood, the ``MultipleImputer`` actually creates separate instances of the ``SingleImputer`` to handle each run. The ``MiceImputer`` takes the ``MultipleImputer`` one step futher, iteratively improving imputations in each column ``k`` times for each ``n`` runs the 7 | ``MultipleImputer`` performs. 8 | 9 | The base class for all imputers is the ``BaseImputer``. While you should not use the ``BaseImputer`` directly unless you're creating your own imputer class, you should understand what it provides the other imputers. The ``BaseImputer`` also contains the strategy "key-value store", or the methods that ``autoimpute`` currently supports. 10 | 11 | Base Imputer 12 | ------------ 13 | 14 | .. autoclass:: autoimpute.imputations.BaseImputer 15 | :special-members: 16 | :members: 17 | 18 | Single Imputer 19 | -------------- 20 | 21 | .. autoclass:: autoimpute.imputations.SingleImputer 22 | :special-members: 23 | :members: 24 | 25 | Multiple Imputer 26 | ---------------- 27 | 28 | .. autoclass:: autoimpute.imputations.MultipleImputer 29 | :special-members: 30 | :members: 31 | 32 | Mice Imputer 33 | ------------- 34 | 35 | .. autoclass:: autoimpute.imputations.MiceImputer 36 | :special-members: 37 | :members: 38 | -------------------------------------------------------------------------------- /docs/source/user_guide/missingness.rst: -------------------------------------------------------------------------------- 1 | Missingness Classifier 2 | ====================== 3 | 4 | .. automodule:: autoimpute.imputations.mis_classifier 5 | :special-members: 6 | :members: 7 | -------------------------------------------------------------------------------- /docs/source/user_guide/strategies.rst: -------------------------------------------------------------------------------- 1 | Deletion and Imputation Strategies 2 | ================================== 3 | 4 | This section documents deletion and imputation strategies within ``Autoimpute``. 5 | 6 | Deletion is implemented through a single function, ``listwise_delete``, documented below. 7 | 8 | Imputation strategies are implemented as classes. The authors of this package refer to these classes as "series-imputers". Each series-imputer maps to an imputation method - either univariate or multivariate - that imputes missing values within a pandas Series. The imputation methods are the workhorses of the DataFrame Imputers, the ``SingleImputer``, ``MultipleImputer``, and ``MiceImputer``. Refer to the :doc:`imputers documentation` for more information on the DataFrame Imputers. 9 | 10 | For more information regarding the relationship between DataFrame Imputers and series-imputers, refer to the following tutorial_. The tutorial covers series-imputers in detail as well as the design patterns behind ``AutoImpute`` Imputers. 11 | 12 | .. _tutorial: https://kearnz.github.io/autoimpute-tutorials/imputer-mechanics-II.html 13 | 14 | Deletion Methods 15 | ---------------- 16 | 17 | .. autofunction:: autoimpute.imputations.listwise_delete 18 | 19 | 20 | Imputation Strategies 21 | --------------------- 22 | 23 | .. automodule:: autoimpute.imputations.series 24 | :special-members: 25 | :members: 26 | -------------------------------------------------------------------------------- /docs/source/user_guide/utils.rst: -------------------------------------------------------------------------------- 1 | Utility Methods 2 | =============== 3 | 4 | .. automodule:: autoimpute.utils.patterns 5 | :members: 6 | -------------------------------------------------------------------------------- /docs/source/user_guide/visuals.rst: -------------------------------------------------------------------------------- 1 | Visualization Methods 2 | ===================== 3 | 4 | This section documents visualization methods within ``Autoimpute``. 5 | 6 | Visualization methods support all functionality within ``Autoimpute``, from missing data exploration to imputation analysis. The documentation below breaks down each visualization method and groups them into their respsective categories. The categories represent other modules within ``Autoimpute``. 7 | 8 | NOTE: The visualization module is currently under development. While the functions outlined below are stable in ``0.12.x``, they might change thereafterr. 9 | 10 | Utility 11 | ------- 12 | 13 | ``Autoimpute`` comes with a number of :doc:`utility methods` to examine missing data before imputation takes place. This package supports these methods with a number of visualization techniques to explore patterns within missing data. The primary techniques wrap the excellent `missingno package `_. ``Autoimpute`` simply leverages ``missingno`` to make its offerings familiar in this packages' API design. The methods appear below: 14 | 15 | .. autofunction:: autoimpute.visuals.plot_md_locations 16 | 17 | .. autofunction:: autoimpute.visuals.plot_md_percent 18 | 19 | .. autofunction:: autoimpute.visuals.plot_nullility_corr 20 | 21 | .. autofunction:: autoimpute.visuals.plot_nullility_dendogram 22 | 23 | Imputation 24 | ---------- 25 | 26 | Two main classes within ``Autoimpute`` are the :doc:`SingleImputer and MultipleImputer`. The visualization module within this package contains a number of techniques to visually assess the quality and performance of these imputers. The important methods appear below: 27 | 28 | .. autofunction:: autoimpute.visuals.helpers._validate_data 29 | 30 | .. autofunction:: autoimpute.visuals.plot_imp_scatter 31 | 32 | .. autofunction:: autoimpute.visuals.plot_imp_dists 33 | 34 | .. autofunction:: autoimpute.visuals.plot_imp_boxplots 35 | 36 | .. autofunction:: autoimpute.visuals.plot_imp_swarm 37 | 38 | .. autofunction:: autoimpute.visuals.plot_imp_strip 39 | -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | ;command-line options by default 3 | addopts = -v 4 | ;designate tests as prefix only 5 | python_functions = test_* 6 | ;filter deprecation warnings for collections in 3.8, import stuff, and logistic reg UserWarning 7 | filterwarnings = 8 | ignore:.*importing the ABCs*:DeprecationWarning 9 | ignore:Importing from numpy.testing.nosetester*:DeprecationWarning 10 | ignore:the imp module*:DeprecationWarning 11 | ignore:Multiple categories (c)*:UserWarning 12 | ignore::sklearn.exceptions.ConvergenceWarning -------------------------------------------------------------------------------- /requirements.readthedocs.txt: -------------------------------------------------------------------------------- 1 | mock 2 | m2r -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.22.0 2 | scipy>=1.6.0 3 | pandas>=1.3.0 4 | statsmodels>=0.12.0 5 | xgboost>=0.81 6 | scikit-learn>=1.0.0 7 | arviz>=0.11.0 8 | pymc>=4.1.0 9 | seaborn>=0.9.0 10 | missingno>=0.4.1 11 | pytest 12 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | license-file = LICENSE 4 | 5 | [pylint] 6 | exclude = 7 | docs/* 8 | tests/* 9 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | """Setup autoimpute package.""" 2 | 3 | import io 4 | import os 5 | from setuptools import find_packages, setup 6 | # pylint:disable=exec-used 7 | 8 | # Package meta-data 9 | NAME = "autoimpute" 10 | DESCRIPTION = "Imputation Methods in Python" 11 | URL = "https://github.com/kearnz/autoimpute" 12 | EMAIL = "josephkearney14@gmail.com, shahidbarkat@gmail.com" 13 | AUTHOR = "Joseph Kearney, Shahid Barkat" 14 | REQUIRES_PYTHON = ">=3.8.0" 15 | INLCUDE_PACKAGE_DATA = True 16 | LICENSE = "MIT" 17 | VERSION = None 18 | REQUIRED = [ 19 | "numpy", 20 | "scipy", 21 | "pandas", 22 | "statsmodels", 23 | "xgboost", 24 | "scikit-learn", 25 | "pymc", 26 | "seaborn", 27 | "missingno" 28 | ] 29 | CLASSIFIERS = [ 30 | "Development Status :: 3 - Alpha", 31 | "Intended Audience :: Science/Research", 32 | "Intended Audience :: Developers", 33 | "License :: OSI Approved :: MIT License", 34 | "Programming Language :: Python", 35 | "Programming Language :: Python :: 3", 36 | "Programming Language :: Python :: 3.8", 37 | "Programming Language :: Python :: 3.9", 38 | "Operating System :: OS Independent", 39 | "Topic :: Software Development", 40 | "Topic :: Scientific/Engineering" 41 | ] 42 | EXTRAS = {} 43 | 44 | here = os.path.abspath(os.path.dirname(__file__)) 45 | 46 | # Import the README and use it as the long description. 47 | try: 48 | with io.open(os.path.join(here, "README.md"), encoding="utf-8") as f: 49 | long_description = f.read() 50 | except FileNotFoundError: 51 | long_description = DESCRIPTION 52 | 53 | # Load the package's __version__.py module as a dictionary. 54 | about = {} 55 | if not VERSION: 56 | with open(os.path.join(here, NAME, "__version__.py")) as f: 57 | exec(f.read(), about) 58 | else: 59 | about["__version__"] = VERSION 60 | 61 | # Setup specification 62 | setup( 63 | name=NAME, 64 | version=about["__version__"], 65 | description=DESCRIPTION, 66 | long_description=long_description, 67 | long_description_content_type="text/markdown", 68 | author=AUTHOR, 69 | author_email=EMAIL, 70 | python_requires=REQUIRES_PYTHON, 71 | url=URL, 72 | packages=find_packages(exclude=("tests",)), 73 | install_requires=REQUIRED, 74 | extras_require=EXTRAS, 75 | include_package_data=INLCUDE_PACKAGE_DATA, 76 | license=LICENSE, 77 | classifiers=CLASSIFIERS, 78 | ) 79 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/__init__.py -------------------------------------------------------------------------------- /tests/test_imputations/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_imputations/__init__.py -------------------------------------------------------------------------------- /tests/test_imputations/test_mice_imputer.py: -------------------------------------------------------------------------------- 1 | """Tests written to ensure the MiceImputer in the imputations package works. 2 | Note that this also tests the MultipleImputer, which really just passes to 3 | the SingleImputer. SingleImputer has tests, some of which are the same as here. 4 | 5 | Tests use the pytest library. The tests in this module ensure the following: 6 | - `test_stochastic_predictive_imputer` test stochastic strategy. 7 | - `test_bayesian_reg_imputer` test bayesian regression strategy. 8 | - `test_bayesian_logistic_imputer` test bayesian logistic strategy. 9 | - `test_pmm_lrd_imputer` test pmm and lrd strategy. 10 | - `test_normal_unit_variance_imputer` test unit variance imputer 11 | """ 12 | 13 | import pytest 14 | from autoimpute.imputations import MiceImputer 15 | from autoimpute.utils import dataframes 16 | dfs = dataframes 17 | # pylint:disable=len-as-condition 18 | # pylint:disable=pointless-string-statement 19 | 20 | def test_stochastic_predictive_imputer(): 21 | """Test stochastic works for numerical columns of PredictiveImputer.""" 22 | # generate linear, then stochastic 23 | imp_p = MiceImputer(strategy={"A":"least squares"}) 24 | imp_s = MiceImputer(strategy={"A":"stochastic"}) 25 | # make sure both work 26 | _ = imp_p.fit_transform(dfs.df_num) 27 | _ = imp_s.fit_transform(dfs.df_num) 28 | assert imp_p.imputed_["A"] == imp_s.imputed_["A"] 29 | 30 | def test_bayesian_reg_imputer(): 31 | """Test bayesian works for numerical column of PredictiveImputer.""" 32 | # test designed first - test kwargs and params 33 | imp_b = MiceImputer(strategy={"y":"bayesian least squares"}, 34 | imp_kwgs={"y":{"fill_value": "random", 35 | "am": 11, "cores": 2}}) 36 | imp_b.fit_transform(dfs.df_bayes_reg) 37 | # test on numerical in general 38 | imp_n = MiceImputer(strategy="bayesian least squares") 39 | imp_n.fit_transform(dfs.df_num) 40 | 41 | def test_bayesian_logistic_imputer(): 42 | """Test bayesian works for binary column of PredictiveImputer.""" 43 | imp_b = MiceImputer(strategy={"y":"bayesian binary logistic"}, 44 | imp_kwgs={"y":{"fill_value": "random"}}) 45 | imp_b.fit_transform(dfs.df_bayes_log) 46 | 47 | def test_pmm_lrd_imputer(): 48 | """Test pmm and lrd work for numerical column of PredictiveImputer.""" 49 | # test pmm first - test kwargs and params 50 | imp_pmm = MiceImputer(strategy={"y":"pmm"}, 51 | imp_kwgs={"y": {"fill_value": "random", 52 | "copy_x": False}}) 53 | imp_pmm.fit_transform(dfs.df_bayes_reg) 54 | 55 | # test lrd second - test kwargs and params 56 | imp_lrd = MiceImputer(strategy={"y":"lrd"}, 57 | imp_kwgs={"y": {"fill_value": "random", 58 | "copy_x": False}}) 59 | imp_lrd.fit_transform(dfs.df_bayes_reg) 60 | 61 | def test_normal_unit_variance_imputer(): 62 | """Test normal unit variance imputer for numerical column""" 63 | imp_pmm = MiceImputer(strategy={"y":"normal unit variance"},) 64 | imp_pmm.fit_transform(dfs.df_bayes_reg) 65 | 66 | def test_partial_dependence_imputer(): 67 | """Test to ensure that edge case for partial dependence whandled""" 68 | imp = MiceImputer(strategy='stochastic') 69 | imp.fit_transform(dfs.df_partial_dependence) 70 | -------------------------------------------------------------------------------- /tests/test_imputations/test_missingness_classifier.py: -------------------------------------------------------------------------------- 1 | """Tests written to ensure the SingleImputer in the imputations package works. 2 | 3 | Tests use the pytest library. The tests in this module ensure the following: 4 | - `test_missing_classifier` tests that bug in issue 56 is fixed 5 | """ 6 | 7 | from autoimpute.imputations import MissingnessClassifier 8 | from autoimpute.utils import dataframes 9 | dfs = dataframes 10 | # pylint:disable=len-as-condition 11 | # pylint:disable=pointless-string-statement 12 | 13 | 14 | def test_single_missing_column(): 15 | """Test that the missingness classifier works correctly""" 16 | imp = MissingnessClassifier() 17 | imp.fit_predict(dfs.df_mis_classifier) 18 | imp.fit_predict_proba(dfs.df_mis_classifier) 19 | -------------------------------------------------------------------------------- /tests/test_imputations/test_single_imputer.py: -------------------------------------------------------------------------------- 1 | """Tests written to ensure the SingleImputer in the imputations package works. 2 | 3 | Tests use the pytest library. The tests in this module ensure the following: 4 | - `test_single_missing_column` throw error when any column fully missing. 5 | - `test_bad_strategy` throw error if strategy is not allowed. 6 | - `test_imputer_strategies_not_allowed` test strategies misspecified. 7 | - `test_wrong_numerical_type` test valid strategy but not for numerical. 8 | - `test_wrong_categorical_type` test valid strategy but not for categorical. 9 | - `test_default_single_imputer` tests the simplest implementation: defaults. 10 | - `test_numerical_univar_imputers` test all numerical strategies. 11 | - `test_categorical_univar_imputers` test all categorical strategies. 12 | - `test_stochastic_predictive_imputer` test stochastic strategy. 13 | - `test_bayesian_reg_imputer` test bayesian regression strategy. 14 | - `test_bayesian_logistic_imputer` test bayesian logistic strategy. 15 | - `test_pmm_lrd_imputer` test pmm and lrd strategy. 16 | - `test_normal_unit_variance_imputer` test unit variance imputer 17 | """ 18 | 19 | import pytest 20 | from autoimpute.imputations import SingleImputer 21 | from autoimpute.utils import dataframes 22 | dfs = dataframes 23 | # pylint:disable=len-as-condition 24 | # pylint:disable=pointless-string-statement 25 | 26 | def test_single_missing_column(): 27 | """Test that the imputer removes columns that are fully missing.""" 28 | with pytest.raises(ValueError): 29 | imp = SingleImputer() 30 | imp.fit_transform(dfs.df_col_miss) 31 | 32 | def test_bad_strategy(): 33 | """Test that strategies not supported throw a ValueError.""" 34 | with pytest.raises(ValueError): 35 | imp = SingleImputer(strategy="not_a_strategy") 36 | imp.fit_transform(dfs.df_num) 37 | 38 | def bad_imputers(): 39 | """Test supported strategies but improper column specification.""" 40 | # example with too few strategies specified for given DataFrame 41 | bad_list = SingleImputer(strategy=["mean", "median"]) 42 | # example with incorrect keys for given DataFrame 43 | bad_keys = SingleImputer(strategy={"X":"mean", "B":"median", "C":"mode"}) 44 | return [bad_list, bad_keys] 45 | 46 | @pytest.mark.parametrize("imp", bad_imputers()) 47 | def test_imputer_strategies_not_allowed(imp): 48 | """Test bad imputers""" 49 | with pytest.raises(ValueError): 50 | imp.fit_transform(dfs.df_num) 51 | 52 | def test_wrong_numerical_type(): 53 | """Test supported strategies but improper column type for strategy.""" 54 | num_for_cat = SingleImputer(strategy={"cats": "mean"}) 55 | with pytest.raises(TypeError): 56 | num_for_cat.fit_transform(dfs.df_ts_mixed) 57 | 58 | def test_wrong_categorical_type(): 59 | """Test supported strategies but improper column type for strategy.""" 60 | cat_for_num = SingleImputer(strategy="categorical") 61 | with pytest.raises(TypeError): 62 | cat_for_num.fit_transform(dfs.df_num) 63 | 64 | def test_default_single_imputer(): 65 | """Test the _default method and results for SingleImputer().""" 66 | imp = SingleImputer() 67 | # test df_num first 68 | # ----------------- 69 | # all strategies should default to pmm 70 | imp.fit_transform(dfs.df_num) 71 | for imputer in imp.statistics_.values(): 72 | strat = imputer.statistics_["strategy"] 73 | assert strat == "pmm" 74 | 75 | # test df_ts_mixed next 76 | # --------------- 77 | # datetime col should default to none 78 | # numerical col should default to mean 79 | # categorical col should default to mean 80 | imp.fit_transform(dfs.df_mix) 81 | gender_imputer = imp.statistics_["gender"] 82 | salary_imputer = imp.statistics_["salary"] 83 | assert salary_imputer.statistics_["strategy"] == "pmm" 84 | assert gender_imputer.statistics_["strategy"] == "multinomial logistic" 85 | 86 | def test_numerical_univar_imputers(): 87 | """Test numerical methods when not using the _default.""" 88 | for num_strat in dfs.num_strategies: 89 | imp = SingleImputer(strategy=num_strat) 90 | imp.fit_transform(dfs.df_num) 91 | for imputer in imp.statistics_.values(): 92 | strat = imputer.statistics_["strategy"] 93 | assert strat == num_strat 94 | 95 | def test_categorical_univar_imputers(): 96 | """Test categorical methods when not using the _default.""" 97 | for cat_strat in dfs.cat_strategies: 98 | imp = SingleImputer(strategy={"cats": cat_strat}) 99 | imp.fit_transform(dfs.df_ts_mixed) 100 | for imputer in imp.statistics_.values(): 101 | strat = imputer.statistics_["strategy"] 102 | assert strat == cat_strat 103 | 104 | def test_stochastic_predictive_imputer(): 105 | """Test stochastic works for numerical columns of PredictiveImputer.""" 106 | # generate linear, then stochastic 107 | imp_p = SingleImputer(strategy={"A":"least squares"}) 108 | imp_s = SingleImputer(strategy={"A":"stochastic"}) 109 | # make sure both work 110 | _ = imp_p.fit_transform(dfs.df_num) 111 | _ = imp_s.fit_transform(dfs.df_num) 112 | assert imp_p.imputed_["A"] == imp_s.imputed_["A"] 113 | 114 | def test_bayesian_reg_imputer(): 115 | """Test bayesian works for numerical column of PredictiveImputer.""" 116 | # test designed first - test kwargs and params 117 | imp_b = SingleImputer(strategy={"y":"bayesian least squares"}, 118 | imp_kwgs={"y":{"fill_value": "random", 119 | "am": 11, "cores": 2}}) 120 | imp_b.fit_transform(dfs.df_bayes_reg) 121 | # test on numerical in general 122 | imp_n = SingleImputer(strategy="bayesian least squares") 123 | imp_n.fit_transform(dfs.df_num) 124 | 125 | def test_bayesian_logistic_imputer(): 126 | """Test bayesian works for binary column of PredictiveImputer.""" 127 | imp_b = SingleImputer(strategy={"y":"bayesian binary logistic"}, 128 | imp_kwgs={"y":{"fill_value": "random"}}) 129 | imp_b.fit_transform(dfs.df_bayes_log) 130 | 131 | def test_pmm_lrd_imputer(): 132 | """Test pmm and lrd work for numerical column of PredictiveImputer.""" 133 | # test pmm first - test kwargs and params 134 | imp_pmm = SingleImputer(strategy={"y":"pmm"}, 135 | imp_kwgs={"y": {"fill_value": "random", 136 | "copy_x": False}}) 137 | imp_pmm.fit_transform(dfs.df_bayes_reg) 138 | 139 | # test lrd second - test kwargs and params 140 | imp_lrd = SingleImputer(strategy={"y":"lrd"}, 141 | imp_kwgs={"y": {"fill_value": "random", 142 | "copy_x": False}}) 143 | imp_lrd.fit_transform(dfs.df_bayes_reg) 144 | 145 | def test_normal_unit_variance_imputer(): 146 | """Test normal unit variance imputer for numerical column""" 147 | imp_pmm = SingleImputer(strategy={"y":"normal unit variance"},) 148 | imp_pmm.fit_transform(dfs.df_bayes_reg) 149 | 150 | def test_partial_dependence_imputer(): 151 | """Test to ensure that edge case for partial dependence whandled""" 152 | imp = SingleImputer(strategy='stochastic') 153 | imp.fit_transform(dfs.df_partial_dependence) 154 | -------------------------------------------------------------------------------- /tests/test_utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_utils/__init__.py -------------------------------------------------------------------------------- /tests/test_utils/test_checks.py: -------------------------------------------------------------------------------- 1 | """Tests written to ensure the decorators in the utils package work correctly. 2 | 3 | Tests use the pytest library. The tests in this module ensure the following: 4 | - `check_data_structure` requires pandas DataFrame. 5 | - `check_data_structure` raises errors for any other type of data structure. 6 | - `check_missingness` enforces datasets have observed and missing values. 7 | - `check_missingness` raises errors for fully missing datasets. 8 | - `check_missingness` raises errors for time series missing in datasets. 9 | - `remove_nan_columns` removes columns if the entire column is missing. 10 | """ 11 | 12 | import pytest 13 | import numpy as np 14 | import pandas as pd 15 | from autoimpute.utils.checks import check_data_structure, check_missingness 16 | from autoimpute.utils.checks import check_nan_columns 17 | 18 | @check_data_structure 19 | def check_data(data): 20 | """Helper function to test data structure decorator.""" 21 | return data 22 | 23 | @check_missingness 24 | def check_miss(data): 25 | """Helper function to test missingness decorator.""" 26 | return data 27 | 28 | @check_nan_columns 29 | def check_nan_cols(data): 30 | """Helper function to test removal of NaN columns.""" 31 | return data 32 | 33 | def data_structures_not_allowed(): 34 | """Types that should throw an error for `check_data_structure`.""" 35 | str_ = "string" 36 | int_ = 1 37 | float_ = 1.0 38 | set_ = set([1, 2, 3]) 39 | dict_ = dict(a=str_, b=int_, c=float_) 40 | list_ = [str_, int_, float_] 41 | tuple_ = tuple(list_) 42 | arr_ = np.array([1, 2, 3, 4]) 43 | ser_ = pd.Series({"a": arr_}) 44 | return [str_, int_, float_, set_, dict_, list_, tuple_, arr_, ser_] 45 | 46 | def data_stuctures_allowed(): 47 | """Types that should not throw an error and should return a valid array.""" 48 | df_ = pd.DataFrame({ 49 | "A": [1, 2, 3, 4], 50 | "B": ["a", "b", "c", "d"] 51 | }) 52 | return [df_] 53 | 54 | def missingness_not_allowed(): 55 | """Can't impute datasets that are fully complete or incomplete.""" 56 | df_none = pd.DataFrame({ 57 | "A": [np.nan, np.nan, np.nan], 58 | "B": [None, None, None] 59 | }) 60 | #df_ts = pd.DataFrame({ 61 | # "date": ["2018-05-01", "2018-05-02", "2018-05-03", 62 | # "2018-05-04", "2018-05-05", "2018-05-06", 63 | # "2018-05-07", "2018-05-08", "2018-05-09"], 64 | # "stats": [3, 4, np.nan, 15, 7, np.nan, 26, 25, 62] 65 | #}) 66 | #df_ts["date"] = pd.to_datetime(df_ts["date"], utc=True) 67 | #df_ts.loc[[1, 3], "date"] = np.nan 68 | return [df_none] 69 | 70 | @pytest.mark.parametrize("ds", data_structures_not_allowed()) 71 | def test_data_structures_not_allowed(ds): 72 | """Ensure data structure helper raises TypeError for disallowed types. 73 | 74 | Utilizes the pytest.mark.parametize method to run test on numerous data 75 | structures. Those data structures are returned from the helper method 76 | `data_structures_not_allowed()`, which returns a list of data structures. 77 | Each item within this list should cause this method to throw an error. 78 | Each item in the list takes the on the "ds" name in pytest. 79 | 80 | Args: 81 | ds (any -> iterator): any data structure within an iterator. `ds` is 82 | alias each item in the iterator takes when being tested. 83 | 84 | Returns: 85 | None: raises errors when improper types passed. 86 | 87 | Raises: 88 | TypeError: data structure `ds` is not allowed. 89 | """ 90 | with pytest.raises(TypeError): 91 | check_data(ds) 92 | 93 | @pytest.mark.parametrize("ds", data_stuctures_allowed()) 94 | def test_data_structures_allowed(ds): 95 | """Ensure data structure helper returns expected types. 96 | 97 | Utilizes the pytest.mark.parametize method to run test on numerous data 98 | structures. Those data structures are returned from the helper method 99 | `data_structures_allowed()`, which right now returns a DataFrame/Series. 100 | 101 | Args: 102 | ds (any -> iterator): any data structure within an iterator. `ds` is 103 | alias each item in the iterator takes when being tested. 104 | 105 | Returns: 106 | None: asserts that the appropriate type has been passed. 107 | """ 108 | assert isinstance(ds, pd.DataFrame) 109 | 110 | @pytest.mark.parametrize("ds", missingness_not_allowed()) 111 | def test_missingness_not_allowed(ds): 112 | """Ensure missingness helper raises ValueError for fully missing DataFrame. 113 | 114 | Also utilizes the pytest.mark.parametize method to run test. Tests run on 115 | items in iterator returned from `missingness_not_allowed()`, which right 116 | now returns a fully missing DataFrame and a time series DataFrame with 117 | missingness in the time series itself. 118 | 119 | Args: 120 | ds (any -> iterator): any data structure within an iterator. `ds` is 121 | alias each item in the iterator takes when being tested. 122 | 123 | Returns: 124 | None: raises error because DataFrame is fully missing. 125 | 126 | Raises: 127 | ValueError: if the DataFrame is fully missing. 128 | """ 129 | with pytest.raises(ValueError): 130 | check_miss(ds) 131 | 132 | def test_nan_columns(): 133 | """Check missing columns throw error with check_nan_columns decorator. 134 | 135 | The `check_nan_columns` decorator should throw an error if any columns 136 | in the dataframe have all values missing. This test simulates data in a 137 | DataFrame where two of the columns have all missing values. The DataFrame 138 | is below. In this case, `B` and `C` should generate an error because they 139 | contain all missing values. Error message should capture both columns. 140 | 141 | Args: 142 | None: DataFrame hard-coded internally. 143 | 144 | Returns: 145 | None: asserts that fully missing columns generate error. 146 | """ 147 | df = pd.DataFrame({ 148 | "A": [1, np.nan, 3, 4], 149 | "B": [None, None, None, None], 150 | "C": [np.nan, np.nan, np.nan, np.nan], 151 | "D": ["a", "b", None, "d"] 152 | }) 153 | assert pd.isnull(df["B"]).all() 154 | assert pd.isnull(df["C"]).all() 155 | with pytest.raises(ValueError): 156 | check_nan_cols(df) 157 | -------------------------------------------------------------------------------- /tests/test_utils/test_patterns.py: -------------------------------------------------------------------------------- 1 | """Tests written to ensure the patterns in the utils package work correctly. 2 | 3 | The methods tested below are ports from MICE, an excellent R package for 4 | handling missing data. The author of MICE (Van Buuren) is also the author of 5 | Flexible Imputation of Missing Data (FIMD). The `df_general` variable below 6 | is a simulation of the "general" pattern in section 4.1. The subsequent 7 | dataframes are hard coded results of patterns from FIMD. They are used to 8 | verify that this implementation in Python is working as expected. The methods 9 | being tested (i.e. inbound, outbound, flux, etc.) all use `df_general` to 10 | calculate their respective statistic. This allows for comparison to results 11 | from FIMD, section 4.1. 12 | 13 | Tests use the pytest library. The tests in this module ensure the following: 14 | - `test_md_locations` checks missingness identified properly as 1/0. 15 | - `test_md_pattern` checks against result from MICE md.pattern. 16 | - `test_md_pairs` checks against result from MICE md.pairs 17 | - `test_inbound` checks against inbound calc in 4.1 (no explicit method) 18 | - `test_outbound` checks against outbound calc in 4.1 (no explicit method) 19 | - `test_flux` checks against MICE flux. 20 | """ 21 | 22 | import numpy as np 23 | import pandas as pd 24 | from autoimpute.utils.patterns import md_locations, md_pairs, md_pattern 25 | from autoimpute.utils.patterns import inbound, outbound, flux 26 | 27 | df_general = pd.DataFrame({ 28 | "A": [1, 5, 9, 6, 12, 11, np.nan, np.nan], 29 | "B": [2, 4, 3, 6, 11, np.nan, np.nan, np.nan], 30 | "C": [-1, 1, np.nan, np.nan, np.nan, -1, 1, 0] 31 | }) 32 | 33 | df_pattern = pd.DataFrame({ 34 | "count": [2, 3, 1, 2], 35 | "A": [1, 1, 1, 0], 36 | "B": [1, 1, 0, 0], 37 | "C": [1, 0, 1, 1], 38 | "nmis": [0, 1, 1, 2] 39 | }) 40 | 41 | dict_pairs = {} 42 | ci = ["A", "B", "C"] 43 | create_df = lambda v: pd.DataFrame(v, columns=ci, index=ci) 44 | dict_pairs["rr"] = create_df([[6, 5, 3], [5, 5, 2], [3, 2, 5]]) 45 | dict_pairs["rm"] = create_df([[0, 1, 3], [0, 0, 3], [2, 3, 0]]) 46 | dict_pairs["mr"] = create_df([[0, 0, 2], [1, 0, 3], [3, 3, 0]]) 47 | dict_pairs["mm"] = create_df([[2, 2, 0], [2, 3, 0], [0, 0, 3]]) 48 | 49 | df_inbound = create_df([[0, 1/3, 1], [0, 0, 1], [1, 1, 0]]).T 50 | df_outbound = create_df([[0, 0, 0.4], [1/6, 0, 0.6], [0.5, 0.6, 0]]).T 51 | 52 | df_flux = pd.DataFrame({ 53 | "pobs": [0.75, 0.625, 0.625], 54 | "influx": [0.125, 0.250, 0.375], 55 | "outflux": [0.5, 0.375, 0.625] 56 | }, index=ci) 57 | 58 | def test_md_locations(): 59 | """Test to ensure that missingness locations are identified. 60 | 61 | Missingness locations should equal np.isnan for each col. 62 | Assert that md_locations returns a DataFrame, and then check 63 | that each column equals what is expected from np.isnan. 64 | 65 | Args: 66 | None: DataFrame for testing created internally. 67 | 68 | Returns: 69 | None: asserts locations for missingness are as expected. 70 | """ 71 | md_loc = md_locations(df_general) 72 | assert isinstance(md_loc, pd.DataFrame) 73 | assert all(md_loc["A"] == np.isnan(df_general["A"])) 74 | assert all(md_loc["B"] == np.isnan(df_general["B"])) 75 | assert all(md_loc["C"] == np.isnan(df_general["C"])) 76 | 77 | def test_md_pattern(): 78 | """Test that missing data pattern equal to expected results. 79 | 80 | `df_pattern` name assigned to DataFrame in module's scope that contains 81 | the expected pattern from VB 4.1 `md.pattern()` example in R. Result 82 | is hard coded, and python version tested with assertions below. 83 | 84 | Args: 85 | None: DataFrame for testing created internally. 86 | 87 | Returns: 88 | None: asserts missingness patterns are as expected. 89 | """ 90 | md_pat = md_pattern(df_general) 91 | assert isinstance(md_pat, pd.DataFrame) 92 | assert all(md_pat["count"] == df_pattern["count"]) 93 | assert all(md_pat[["A", "B", "C"]] == df_pattern[["A", "B", "C"]]) 94 | assert all(md_pat["nmis"] == df_pattern["nmis"]) 95 | 96 | def test_md_pairs(): 97 | """Test that missing data pairs equal to expected results. 98 | 99 | `dict_pairs` contains 4 keys - one for each pair expected. The pairs 100 | are `rr`, `mr`, `rm`, and `mm`. Pair types described in the docstrings 101 | of the md_pairs method in the utils.patterns module. Missing data pairs 102 | should equal expected pairs from VB 4.1 `md.pairs() in R. 103 | 104 | Args: 105 | None: Pairs for testing created internally. 106 | 107 | Returns: 108 | None: asserts pairs are as expected. 109 | """ 110 | md_pair = md_pairs(df_general) 111 | assert isinstance(md_pair, dict) 112 | assert all(md_pair["rr"] == dict_pairs["rr"]) 113 | assert all(md_pair["mr"] == dict_pairs["mr"]) 114 | assert all(md_pair["rm"] == dict_pairs["rm"]) 115 | assert all(md_pair["mm"] == dict_pairs["mm"]) 116 | 117 | def test_inbound(): 118 | """Test that the inbound statistic equal to expected results. 119 | 120 | `df_inbound` contains hard-coded expected result. Tested against the 121 | inbound function, which takes `df_general` as an input. 122 | 123 | Args: 124 | None: DataFrame for testing created internally. 125 | 126 | Returns: 127 | None: asserts inbound statistic results are as expected. 128 | """ 129 | inbound_ = inbound(df_general) 130 | assert isinstance(inbound_, pd.DataFrame) 131 | assert all(inbound_["A"] == df_inbound["A"]) 132 | assert all(inbound_["B"] == df_inbound["B"]) 133 | assert all(inbound_["C"] == df_inbound["C"]) 134 | 135 | def test_outbound(): 136 | """Test that the outbound statistic equal to expected results. 137 | 138 | `df_outbound` contains hard-coded expected result. Tested against the 139 | outbound function, which takes `df_general` as an input. 140 | 141 | Args: 142 | None: DataFrame for testing created internally. 143 | 144 | Returns: 145 | None: asserts outbound statistic results are as expected. 146 | """ 147 | outbound_ = outbound(df_general) 148 | print(df_inbound) 149 | assert isinstance(outbound_, pd.DataFrame) 150 | assert all(outbound_["A"] == df_outbound["A"]) 151 | assert all(outbound_["B"] == df_outbound["B"]) 152 | assert all(outbound_["C"] == df_outbound["C"]) 153 | 154 | def test_flux(): 155 | """Test that the flux coeffs and proportions equal to expected results. 156 | 157 | `df_flux` contains hard-coded expected result. Tested against the 158 | influx, outflux, and proportions functions, which all take `df_general` 159 | as an input. 160 | 161 | Args: 162 | None: DataFrame for testing created internally. 163 | 164 | Returns: 165 | None: asserts pobs, influx, and outflux results are as expected. 166 | """ 167 | flux_ = flux(df_general) 168 | assert isinstance(flux_, pd.DataFrame) 169 | assert all(flux_["pobs"] == df_flux["pobs"]) 170 | assert all(flux_["influx"] == df_flux["influx"]) 171 | assert all(flux_["outflux"] == df_flux["outflux"]) 172 | -------------------------------------------------------------------------------- /tests/test_visuals/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_visuals/__init__.py --------------------------------------------------------------------------------