├── .github
    ├── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   ├── feature_request.md
    │   └── pull_request.md
    └── workflows
    │   └── build.yml
├── .gitignore
├── .pylintrc
├── AUTHORS.rst
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── autoimpute
    ├── __init__.py
    ├── __version__.py
    ├── analysis
    │   ├── __init__.py
    │   ├── base_regressor.py
    │   ├── linear_regressor.py
    │   ├── logistic_regressor.py
    │   └── metrics.py
    ├── imputations
    │   ├── __init__.py
    │   ├── dataframe
    │   │   ├── __init__.py
    │   │   ├── base_imputer.py
    │   │   ├── mice_imputer.py
    │   │   ├── multiple_imputer.py
    │   │   └── single_imputer.py
    │   ├── deletion.py
    │   ├── errors.py
    │   ├── helpers.py
    │   ├── method_names.py
    │   ├── mis_classifier.py
    │   └── series
    │   │   ├── __init__.py
    │   │   ├── base.py
    │   │   ├── bayesian_regression.py
    │   │   ├── categorical.py
    │   │   ├── default.py
    │   │   ├── ffill.py
    │   │   ├── interpolation.py
    │   │   ├── linear_regression.py
    │   │   ├── logistic_regression.py
    │   │   ├── lrd.py
    │   │   ├── mean.py
    │   │   ├── median.py
    │   │   ├── mode.py
    │   │   ├── norm.py
    │   │   ├── norm_unit_variance.py
    │   │   ├── pmm.py
    │   │   └── random.py
    ├── utils
    │   ├── __init__.py
    │   ├── checks.py
    │   ├── dataframes.py
    │   ├── helpers.py
    │   └── patterns.py
    └── visuals
    │   ├── __init__.py
    │   ├── helpers.py
    │   ├── imputations.py
    │   └── utils.py
├── docs
    ├── Makefile
    └── source
    │   ├── conf.py
    │   ├── index.rst
    │   └── user_guide
    │       ├── analysis.rst
    │       ├── getting_started.rst
    │       ├── imputers.rst
    │       ├── missingness.rst
    │       ├── strategies.rst
    │       ├── utils.rst
    │       └── visuals.rst
├── pytest.ini
├── requirements.readthedocs.txt
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests
    ├── __init__.py
    ├── test_imputations
        ├── __init__.py
        ├── test_mice_imputer.py
        ├── test_missingness_classifier.py
        └── test_single_imputer.py
    ├── test_utils
        ├── __init__.py
        ├── test_checks.py
        └── test_patterns.py
    └── test_visuals
        └── __init__.py


/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | Name: **Bug Report**  
 2 | About: **Create a bug report to help us fix what's broken in Autoimpute!**  
 3 | Labels: **bug** 
 4 | 
 5 | **Before Reporting**  
 6 | - Make sure you're trying to impute a `pandas DataFrame`. Other data structures not yet supported.
 7 | - Make sure the `DataFrame` doesn't have any columns fully missing. Remove them before imputing.
 8 | - Check active and resolved issues to ensure that someone has not already come across the same issue.
 9 | 
10 | **Describe the bug**  
11 | - A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**  
14 | Discuss steps to reproduce the behavior. Minimum relevant information includes:  
15 | 1. Operating system & Python version
16 | 2. Method or class used 
17 | 3. Description of the data (# features, feature data types, etc.)
18 | 4. Arguments passed to method or class in question
19 | 5. Name of the error
20 | 
21 | **Expected behavior**  
22 | - A clear and concise description of what you expected to happen.
23 | 
24 | **Code Samples & Screenshots**
25 | - If possible, provide the actual code that generated the bug 
26 | - If applicable, add screenshots to help explain your problem
27 | - Screenshots of the traceback are especially helpful
28 | - Information about the DataFrame used is very helpful as well. Consider sharing `df.info()`
29 | 
30 | **Additional Information to Provide:**  
31 |  - Operating System
32 |  - Python Version
33 |  - Anaconda Distribution & Package Versions


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | Name: **Feature Request**  
 2 | About: **Suggest an idea for what would make Autoimpute even better!**  
 3 | Labels: **enhancement**  
 4 | 
 5 | **Is your feature request related to a problem? Please describe.**  
 6 | - A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]  
 7 | 
 8 | **Is your feature request something completely new? Please describe.**  
 9 | - A clear and concise description of what the new feature is.  
10 | - This description should include where it fits within `autoimpute`  
11 |     - (i.e. an imputation method, a visualization, a utility, etc.).  
12 | - In addition, describe why the new feature is necessary.   
13 |     - What does it bring to the package?  
14 |     - What will it improve or add?  
15 | 
16 | **Describe the solution you'd like**  
17 | - A clear and concise description of what you want to happen.  
18 | 
19 | **Describe alternatives you've considered**  
20 | - A clear and concise description of any alternative solutions or features you've considered.  
21 | 
22 | **Additional context**  
23 | - Add any other context or screenshots about the feature request here.  
24 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/pull_request.md:
--------------------------------------------------------------------------------
 1 | Name: **Pull Request**  
 2 | About: **Request to Merge your code into Autoimpute!**   
 3 | 
 4 | To make sure the pull request gets approved, you must follow the steps below:
 5 | 
 6 | 1. Raise a bug report, create an issue, or request a new feature. Follow the templates for those to get started.
 7 | 2. Once the authors review, develop the agreed upon solution. Do so by creating a feature branch.
 8 | 3. When finished coding, ensure that you've written a unit test
 9 |         - Place the test in the tests/ directory. Choose the appropriate file, or ask us if you're unsure where to test.
10 |         - Note that we use pytests, so you must prefix your function name with test_ to ensure it will run
11 | 4. Run the tests, ensuring that your test is successful and no other tests have broken.
12 | 5. Issue a pull request to merge your branch into the **dev branch**
13 |         - We use Travis for CI, so pull requests to the dev branch run all appropriate integration testing.
14 |         - If your tests succeeds, the authors will rebase and merge your pull request.
15 | 6. Once merged to dev, we'll take care of merging to master (via a version bump or hotfix depending on what you merged)
16 | 


--------------------------------------------------------------------------------
/.github/workflows/build.yml:
--------------------------------------------------------------------------------
 1 | name: Tests for autoimpute
 2 | on: 
 3 |   push:
 4 |     branches:
 5 |       - dev
 6 |       - master
 7 | 
 8 | env:
 9 |   BRANCH: ${{ github.ref_name }}
10 | 
11 | jobs:
12 | 
13 |   pybuild:
14 |     name: Python setup and Unit Tests
15 |     runs-on: ubuntu-latest
16 |     steps:
17 |       
18 |     # step 1 - echo branch
19 |     - name: Print vars
20 |       run: |
21 |         echo "Branches: $BRANCH"
22 |     
23 |     # step 2 - checkout repo
24 |     - name: Checkout
25 |       uses: actions/checkout@v2
26 | 
27 |     # step 4 - deploy python
28 |     - name: Set up Python 
29 |       uses: actions/setup-python@v1
30 |       with:
31 |         python-version: 3.8
32 | 
33 |     # step 5 - check cache & install dependencies
34 |     - name: Check cache
35 |       uses: actions/cache@v1
36 |       id: cache
37 |       with:
38 |           path: ~/.cache/pip
39 |           key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
40 |           restore-keys: |
41 |             ${{ runner.os }}-pip-
42 |             
43 |     - name: Install dependencies
44 |       run: |
45 |         pip install -r requirements.txt
46 | 
47 |     # step 6 - run unit tests
48 |     - name: Test with pytest
49 |       run: |
50 |         pytest -v
51 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | *.swp
  6 | .gitignore
  7 | 
  8 | # C extensions
  9 | *.so
 10 | 
 11 | # vscode
 12 | .vscode/
 13 | 
 14 | # Distribution / packaging
 15 | .Python
 16 | build/
 17 | develop-eggs/
 18 | dist/
 19 | downloads/
 20 | eggs/
 21 | .eggs/
 22 | lib/
 23 | lib64/
 24 | parts/
 25 | sdist/
 26 | var/
 27 | wheels/
 28 | *.egg-info/
 29 | .installed.cfg
 30 | *.egg
 31 | MANIFEST
 32 | 
 33 | # Todo lists
 34 | .todo.txt
 35 | 
 36 | # PyInstaller
 37 | #  Usually these files are written by a python script from a template
 38 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 39 | *.manifest
 40 | *.spec
 41 | 
 42 | # Installer logs
 43 | pip-log.txt
 44 | pip-delete-this-directory.txt
 45 | 
 46 | # Unit test / coverage reports
 47 | htmlcov/
 48 | .tox/
 49 | .coverage
 50 | .coverage.*
 51 | .cache
 52 | nosetests.xml
 53 | coverage.xml
 54 | *.cover
 55 | .hypothesis/
 56 | .pytest_cache/
 57 | 
 58 | # Translations
 59 | *.mo
 60 | *.pot
 61 | 
 62 | # Django stuff:
 63 | *.log
 64 | local_settings.py
 65 | db.sqlite3
 66 | 
 67 | # Flask stuff:
 68 | instance/
 69 | .webassets-cache
 70 | 
 71 | # Scrapy stuff:
 72 | .scrapy
 73 | 
 74 | # Sphinx documentation
 75 | docs/_build/
 76 | 
 77 | # PyBuilder
 78 | target/
 79 | 
 80 | # Jupyter Notebook
 81 | .ipynb_checkpoints
 82 | 
 83 | # pyenv
 84 | .python-version
 85 | 
 86 | # celery beat schedule file
 87 | celerybeat-schedule
 88 | 
 89 | # SageMath parsed files
 90 | *.sage.py
 91 | 
 92 | # Environments
 93 | .env
 94 | .venv
 95 | env/
 96 | venv/
 97 | ENV/
 98 | env.bak/
 99 | venv.bak/
100 | 
101 | # Spyder project settings
102 | .spyderproject
103 | .spyproject
104 | 
105 | # Rope project settings
106 | .ropeproject
107 | 
108 | # mkdocs documentation
109 | /site
110 | 
111 | # mypy
112 | .mypy_cache/
113 | 


--------------------------------------------------------------------------------
/AUTHORS.rst:
--------------------------------------------------------------------------------
1 | AutoImpute is developed and maintained by Joseph Kearney and Shahid Barkat.
2 | 
3 | Contacting the Authors
4 | ```````````````````````
5 | - Joseph Kearney <josephkearney14@gmail.com> `@kearnz <https://github.com/kearnz>`
6 | - Shahid Barkat <shahidbarkat@gmail.com> `@shabarka <https://github.com/shabarka>`
7 | - Arnab Bose <bosearnab@gmail.com> `@bosearnab <https://github.com/bosearnab>`
8 | 
9 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | # Contributor Code of Conduct
 2 | 
 3 | As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
 4 | 
 5 | We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
 6 | 
 7 | Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
 8 | 
 9 | Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
10 | 
11 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
12 | 
13 | This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available from http://contributor-covenant.org/version/1/0/0/.


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing to Autoimpute
 2 | Welcome to `autoimpute`! We appreciate your desire to contribute, and we're excited to see how you can add to or improve our project! We've outlined some guidelines to make contributing easy and ensure the contribution process is smooth.
 3 | 
 4 | Jump to specific sections:
 5 | * [Getting Started](#getting-started)
 6 | * [Code of Conduct](#code-of-conduct)
 7 | * [Reporting Bugs](#reporting-bugs)
 8 | * [New Features](#new-features)
 9 | * [Pull Requests](#pull-requests)
10 | 
11 | ## Getting Started
12 | If you're completely new to `autoimpute`, read through our [README](https://github.com/kearnz/autoimpute/blob/master/README.md). To learn more about using the package and review sample use cases, check out our [tutorials](https://kearnz.github.io/autoimpute-tutorials/).
13 | 
14 | ## Code of Conduct
15 | Before continuing, please read our [Code of Conduct](https://github.com/kearnz/autoimpute/blob/master/CODE_OF_CONDUCT.md). Our code of conduct extends to questions, issues raised, pull requests, and code itself. 
16 | 
17 | ## Reporting Bugs
18 | Because `autoimpute` is in rapid development, bugs are inevitable. We do our best to avoid them through testing and squash them quickly if they arise. That being said, we rely on users to help us find bugs that we missed. As noted in the [Pull Requests](#pull-requests) section, always start with a bug report before you actually submit a pull request. A bug report gives us a chance to discuss the feature in more detail and ensure a pull request develops smoothly. Here are some additional pointers for reporting bugs.
19 | 
20 | #### Before Reporting Bugs
21 | * Make sure the bug occurs when using the master branch.
22 | * Check other active branches to ensure the bug is not being worked on or already fixed and slated for a future release.
23 | * Comb through the [Issues Board](https://github.com/kearnz/autoimpute/issues) to see if the bug has already been identified.
24 | * For errors, make sure that `autoimpute` is not raising the error as expected behavior.
25 | * For unexpected behavior, determine if any implicit assumptions from `autoimpute` logic produce the output.
26 | 
27 | #### When Reporting Bugs
28 | * Follow the guidelines specified from our [Bug Report Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/bug_report.md)
29 | * If you're unsure who should act on this bug, assign it to `kearnz` by default.
30 | 
31 | #### After Reporting Bugs
32 | * Stay involved! Monitor the issue thread. Be prepared to respond to comments & questions we may have.
33 | * Keep us updated! If you have more info to provide after the fact, follow up with us in a comment.
34 | * After acknowledgement and discussion, submit a pull request! Follow the pull request guidelines to do so.
35 | 
36 | ## New Features
37 | Missing data imputation is a vast field. While we hope to cover as much as we possibly can, it'd be impossible for us to do it all. As noted in the [Pull Requests](#pull-requests) section, always start with a new feature request before you actually submit a pull request. A new feature request gives us a chance to discuss the feature in more detail and ensure a pull request develops smoothly. Here are some additional pointers for submitting new features.
38 | 
39 | #### Before Suggesting New Features
40 | * Make sure the feature does not exist within any active branch. 
41 | * If the feature exists in a stale branch, get in contact with us first.
42 | * Comb through the [Issues Board](https://github.com/kearnz/autoimpute/issues) to see if the feature has already been requested.
43 | 
44 | #### When Suggesting New Features
45 | * Follow the guidelines specified from our [New Feature Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/feature_request.md)
46 | * If you're unsure who should act on this feature request, assign it to `kearnz` by default.
47 | 
48 | #### After Suggesting New Features
49 | * Stay involved! Monitor the issue thread. Be prepared to respond to comments & questions we may have.
50 | * Keep us updated! If you have more info to provide after the fact, follow up with us in a comment.
51 | * After acknowledgement and discussion, submit a pull request! Follow the pull request guidelines to do so.
52 | 
53 | ## Pull Requests
54 | We've open sourced `autoimpute` early on so colleagues and students can use the package. We have completed what we feel is the first phase of `autoimpute`, and we are ready to accept pull requests from those who want to contribute. As noted in the previous sections, never issue a pull request to fix a bug or add a new feature until you've submitted a bug report or new feature request. This intermediary step helps us understand your issue, discuss it, and ackwoledge the work to be done. We believe it makes the pull request process more successful! Below are some additional pointers for pull requests.
55 | 
56 | #### Before Submitting a Pull Request
57 | * Raise an issue, create a new feature request, or alert us via a bug report BEFORE you submit a pull request.
58 | * Comb through existing [pull requests](https://github.com/kearnz/autoimpute/pulls) to ensure that a similar or identical pull request is not in the queue.
59 | 
60 | #### When Submitting a Pull Request
61 | * Follow the guidelines specified from our [New Feature Issue Template](https://github.com/kearnz/autoimpute/blob/master/.github/ISSUE_TEMPLATE/pull_request.md)
62 | * If you're unsure who should act on this pull request or review the code, assign it to `kearnz` by default.
63 | 
64 | #### After Submitting a Pull Request
65 | * Let us know if you'd like your name and profile added to an acknowledgement board!
66 | * Get in touch if you'd like to collaborate further on the project - we have plenty of ideas and need bandwidth!
67 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 kearnz, shabarka
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.md AUTHORS.rst LICENSE
2 | global-exclude *.py[cod] __pycache__ *.so
3 | prune tests


--------------------------------------------------------------------------------
/autoimpute/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the autoimpute package.
 2 | 
 3 | This module handles imports that should be accessible from the top level.
 4 | Because this package contains specific directories for imputation, diagnostics,
 5 | exploratory data analysis, and visualizations, no classes or methods should be
 6 | accessible from autoimpute itself. Users of this package should import specific
 7 | classes or functions they need from the appropriate folder.
 8 | 
 9 | Examples of correctly specified imports:
10 |     - import autoimupte.imputations as ai
11 |     - from autoimpute.imputations import MissingnessClassifier
12 |     - from autoimpute.utils import md_pairs, md_pattern, flux
13 | 
14 | Examples of incorrectly specified imports:
15 |     - import autoimpute as ai (gives folder access only)
16 |     - from autoimpute import * (wildcard imports discouraged and overridden)
17 | 
18 | This module handles `from autoimpute import *` with the __all__ variable below.
19 | This command imports the major directories from autoimpute.
20 | """
21 | 
22 | from .__version__ import __version__
23 | 
24 | __all__ = [
25 |     "utils",
26 |     "visuals",
27 |     "imputations",
28 |     "analysis"
29 | ]
30 | 


--------------------------------------------------------------------------------
/autoimpute/__version__.py:
--------------------------------------------------------------------------------
1 | """Version specification."""
2 | 
3 | VERSION = (0, 14, 1)
4 | 
5 | __version__ = ".".join(map(str, VERSION))
6 | 


--------------------------------------------------------------------------------
/autoimpute/analysis/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the analysis folder from the autoimpute package.
 2 | 
 3 | This module handles imports from the analysis folder that should be accessible
 4 | whenever someone imports autoimpute.analysis. The list below specifies the
 5 | methods and classes that are available on import.
 6 | 
 7 | This module handles `from autoimpute.analysis import *` with the __all__
 8 | variable below. This command imports the public classes and methods from
 9 | autoimpute.analysis.
10 | """
11 | 
12 | from .base_regressor import MiBaseRegressor
13 | from .linear_regressor import MiLinearRegression
14 | from .logistic_regressor import MiLogisticRegression
15 | from .metrics import raw_bias, percent_bias
16 | 
17 | __all__ = [
18 |     "MiBaseRegressor",
19 |     "MiLinearRegression",
20 |     "MiLogisticRegression",
21 |     "raw_bias",
22 |     "percent_bias"
23 | ]
24 | 


--------------------------------------------------------------------------------
/autoimpute/analysis/linear_regressor.py:
--------------------------------------------------------------------------------
  1 | """Module containing linear regression for multiply imputed datasets."""
  2 | 
  3 | import pandas as pd
  4 | from sklearn.base import BaseEstimator
  5 | from sklearn.linear_model import LinearRegression
  6 | from sklearn.utils.validation import check_is_fitted
  7 | from statsmodels.api import OLS
  8 | from autoimpute.utils import check_nan_columns
  9 | from .base_regressor import MiBaseRegressor
 10 | 
 11 | # pylint:disable=attribute-defined-outside-init
 12 | # pylint:disable=too-many-locals
 13 | 
 14 | class MiLinearRegression(MiBaseRegressor, BaseEstimator):
 15 |     """Linear Regression wrapper for multiply imputed datasets.
 16 | 
 17 |     The MiLinearRegression class wraps the sklearn and statsmodels libraries
 18 |     to extend linear regression to multiply imputed datasets. The class wraps
 19 |     statsmodels as well as sklearn because sklearn alone does not provide
 20 |     sufficient functionality to pool estimates under Rubin's rules. sklearn is
 21 |     for machine learning; therefore, important inference capabilities are
 22 |     lacking, such as easily calculating std. error estimates for parameters.
 23 |     If users want inference from regression analysis of multiply imputed
 24 |     data, utilze the statsmodels implementation in this class instead.
 25 | 
 26 |     Attributes:
 27 |         linear_models (dict): linear models used by supported python libs.
 28 |     """
 29 | 
 30 |     linear_models = {
 31 |         "type": "linear",
 32 |         "statsmodels": OLS,
 33 |         "sklearn": LinearRegression
 34 |     }
 35 | 
 36 |     def __init__(self, mi=None, model_lib="statsmodels", mi_kwgs=None,
 37 |                  model_kwgs=None):
 38 |         """Create an instance of the Autoimpute MiLinearRegression class.
 39 | 
 40 |         Args:
 41 |             mi (MiceImputer, Optional): An instance of a MiceImputer.
 42 |                 Default is none. Can create one through `mi_kwgs` instead.
 43 |             model_lib (str, Optional): library the regressor will use to
 44 |                 implement regression. Options are sklearn and statsmodels.
 45 |                 Default is statsmodels.
 46 |             mi_kwgs (dict, Optional): keyword args to instantiate
 47 |                 MiceImputer. Default is None. If valid MiceImputer
 48 |                 passed as mi argument, then mi_kwgs ignored.
 49 |             model_kwgs (dict, Optional): keyword args to instantiate
 50 |                 regressor. Default is None.
 51 | 
 52 |         Returns:
 53 |             self. Instance of the class.
 54 |         """
 55 |         MiBaseRegressor.__init__(
 56 |             self,
 57 |             mi=mi,
 58 |             model_lib=model_lib,
 59 |             mi_kwgs=mi_kwgs,
 60 |             model_kwgs=model_kwgs
 61 |         )
 62 | 
 63 |     @check_nan_columns
 64 |     def fit(self, X, y):
 65 |         """Fit model specified to multiply imputed dataset.
 66 | 
 67 |         Fit a linear regression on multiply imputed datasets. The method first
 68 |         creates multiply imputed data using the MiceImputer instantiated
 69 |         when creating an instance of the class. It then runs a linear model on
 70 |         each m datasets. The linear model comes from sklearn or statsmodels.
 71 |         Finally, the fit method calculates pooled parameters from the m linear
 72 |         models. Note that variance for pooled parameters using Rubin's rules
 73 |         is available for statsmodels only. sklearn does not implement
 74 |         parameter inference out of the box. Autoimpute sklearn pooling TBD.
 75 | 
 76 |         Args:
 77 |             X (pd.DataFrame): predictors to use. can contain missingness.
 78 |             y (pd.Series, pd.DataFrame): response. can contain missingness.
 79 | 
 80 |         Returns:
 81 |             self. Instance of the class
 82 |         """
 83 | 
 84 |         # retain columns incase encoding occurs
 85 |         self.fit_X_columns = X.columns.tolist()
 86 | 
 87 |         # generate the imputation datasets from multiple imputation
 88 |         # then fit the analysis models on each of the imputed datasets
 89 |         self.models_ = self._apply_models_to_mi_data(
 90 |             self.linear_models, X, y
 91 |         )
 92 | 
 93 |         # generate the fit statistics from each of the m models
 94 |         self.statistics_ = self._get_stats_from_models(self.models_)
 95 | 
 96 |         # still return an instance of the class
 97 |         return self
 98 | 
 99 |     @check_nan_columns
100 |     def predict(self, X):
101 |         """Make predictions using statistics generated from fit.
102 | 
103 |         The regression uses the pooled parameters from each of the imputed
104 |         datasets to generate a set of single predictions. The pooled params
105 |         come from multiply imputed datasets, but the predictions themselves
106 |         follow the same rules as an ordinary linear regression.
107 | 
108 |         Args:
109 |             X (pd.DataFrame): data to make predictions using pooled params.
110 | 
111 |         Returns:
112 |             np.array: predictions.
113 |         """
114 | 
115 |         # validation before prediction
116 |         X = self._predict_strategy_validator(self, X)
117 | 
118 |         # get the alpha and betas, then create linear equation for predictions
119 |         alpha = self.statistics_["coefs"].values[0]
120 |         betas = self.statistics_["coefs"].values[1:]
121 |         preds = alpha + betas.dot(X.T)
122 |         return preds
123 | 
124 |     def summary(self):
125 |         """Provide a summary for model parameters, variance, and metrics.
126 | 
127 |         The summary method brings together the statistics generated from fit
128 |         as well as the variance ratios, if available. The statistics are far
129 |         more valuable when using statsmodels than sklearn.
130 | 
131 |         Returns:
132 |             pd.DataFrame: summary statistics
133 |         """
134 | 
135 |         # only possible once we've fit a model with statsmodels
136 |         check_is_fitted(self, "statistics_")
137 |         sdf = pd.DataFrame(self.statistics_)
138 |         sdf.rename(columns={"lambda_": "lambda"}, inplace=True)
139 |         return sdf
140 | 


--------------------------------------------------------------------------------
/autoimpute/analysis/logistic_regressor.py:
--------------------------------------------------------------------------------
  1 | """Module containing logistic regression for multiply imputed datasets."""
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | from sklearn.base import BaseEstimator
  6 | from sklearn.linear_model import LogisticRegression
  7 | from sklearn.utils.validation import check_is_fitted
  8 | from statsmodels.discrete.discrete_model import Logit
  9 | from autoimpute.utils import check_nan_columns
 10 | from .base_regressor import MiBaseRegressor
 11 | 
 12 | # pylint:disable=attribute-defined-outside-init
 13 | # pylint:disable=too-many-locals
 14 | 
 15 | class MiLogisticRegression(MiBaseRegressor, BaseEstimator):
 16 |     """Logistic Regression wrapper for multiply imputed datasets.
 17 | 
 18 |     The MiLogisticRegression class wraps the sklearn and statsmodels libraries
 19 |     to extend logistic regression to multiply imputed datasets. The class wraps
 20 |     statsmodels as well as sklearn because sklearn alone does not provide
 21 |     sufficient functionality to pool estimates under Rubin's rules. sklearn is
 22 |     for machine learning; therefore, important inference capabilities are
 23 |     lacking, such as easily calculating std. error estimates for parameters.
 24 |     If users want inference from regression analysis of multiply imputed
 25 |     data, utilze the statsmodels implementation in this class instead.
 26 | 
 27 |     Attributes:
 28 |         logistic_models (dict): logistic models used by supported python libs.
 29 |     """
 30 | 
 31 |     logistic_models = {
 32 |         "type": "logistic",
 33 |         "statsmodels": Logit,
 34 |         "sklearn": LogisticRegression
 35 |     }
 36 | 
 37 |     def __init__(self, mi=None, model_lib="statsmodels", mi_kwgs=None,
 38 |                  model_kwgs=None):
 39 |         """Create an instance of the Autoimpute MiLogisticRegression class.
 40 | 
 41 |         Args:
 42 |             mi (MiceImputer, Optional): An instance of a MiceImputer.
 43 |                 Default is None. Can create one through `mi_kwgs` instead.
 44 |             model_lib (str, Optional): library the regressor will use to
 45 |                 implement regression. Options are sklearn and statsmodels.
 46 |                 Default is statsmodels.
 47 |             mi_kwgs (dict, Optional): keyword args to instantiate
 48 |                 MiceImputer. Default is None. If valid MiceImputer
 49 |                 passed as `mi` argument, then `mi_kwgs` ignored.
 50 |             model_kwgs (dict, Optional): keyword args to instantiate
 51 |                 regressor. Default is None.
 52 | 
 53 |         Returns:
 54 |             self. Instance of the class.
 55 |         """
 56 |         MiBaseRegressor.__init__(
 57 |             self,
 58 |             mi=mi,
 59 |             model_lib=model_lib,
 60 |             mi_kwgs=mi_kwgs,
 61 |             model_kwgs=model_kwgs
 62 |         )
 63 | 
 64 |     @check_nan_columns
 65 |     def fit(self, X, y):
 66 |         """Fit model specified to multiply imputed dataset.
 67 | 
 68 |         Fit a logistic regression on multiply imputed datasets. The method
 69 |         creates multiply imputed data using the MiceImputer instantiated
 70 |         when creating an instance of the class. It then runs a logistic model
 71 |         on m datasets. The logistic model comes from sklearn or statsmodels.
 72 |         Finally, the fit method calculates pooled parameters from m logistic
 73 |         models. Note that variance for pooled parameters using Rubin's rules
 74 |         is available for statsmodels only. sklearn does not implement
 75 |         parameter inference out of the box.
 76 | 
 77 |         Args:
 78 |             X (pd.DataFrame): predictors to use. can contain missingness.
 79 |             y (pd.Series, pd.DataFrame): response. can contain missingness.
 80 | 
 81 |         Returns:
 82 |             self. Instance of the class
 83 |         """
 84 | 
 85 |         # retain columns incase encoding occurs
 86 |         self.fit_X_columns = X.columns.tolist()
 87 | 
 88 |         # generate the imputation datasets from multiple imputation
 89 |         # then fit the analysis models on each of the imputed datasets
 90 |         self.models_ = self._apply_models_to_mi_data(
 91 |             self.logistic_models, X, y
 92 |         )
 93 | 
 94 |         # generate the fit statistics from each of the m models
 95 |         self.statistics_ = self._get_stats_from_models(self.models_)
 96 | 
 97 |         # still return an instance of the class
 98 |         return self
 99 | 
100 |     def _sigmoid(self, z):
101 |         """Private method that applies sigmoid function to input."""
102 |         return 1 / (1 + np.exp(-z))
103 | 
104 |     @check_nan_columns
105 |     def predict_proba(self, X):
106 |         """Predict probabilities of class membership for logistic regression.
107 | 
108 |         The regression uses the pooled parameters from each of the imputed
109 |         datasets to generate a set of single predictions. The pooled params
110 |         come from multiply imputed datasets, but the predictions themselves
111 |         follow the same rules as an logistic regression. Because this is
112 |         logistic regression, the sigmoid function is applied to the result
113 |         of the normal equation, giving us probabilities between 0 and 1 for
114 |         each prediction. This method returns those probabilities.
115 | 
116 |         Args:
117 |             X (pd.Dataframe): predictors to predict response
118 | 
119 |         Returns:
120 |             np.array: prob of class membership for predicted observations.
121 |         """
122 | 
123 |         # run validation first
124 |         X = self._predict_strategy_validator(self, X)
125 | 
126 |         # get the alpha and betas, then create linear equation for predictions
127 |         alpha = self.statistics_["coefs"].values[0]
128 |         betas = self.statistics_["coefs"].values[1:]
129 |         return self._sigmoid(alpha + np.dot(X, betas))
130 | 
131 |     @check_nan_columns
132 |     def predict(self, X, threshold=0.5):
133 |         """Make predictions using statistics generated from fit.
134 | 
135 |         The predict method calls on the predict_proba method, which returns
136 |         the probability of class membership for each prediction. These
137 |         probabilities range from 0 to 1. Therefore, anything below the set
138 |         threshold is assigned to class 0, while anything above the threshold
139 |         is assigned to class 1. The deafult threshhold is 0.5, which indicates
140 |         a balanced dataset.
141 | 
142 |         Args:
143 |             X (pd.DataFrame): data to make predictions using pooled params.
144 |             threshold (float, Optional): boundary for class membership.
145 |                 Default is 0.5. Values can range from 0 to 1.
146 | 
147 |         Returns:
148 |             np.array: predictions.
149 |         """
150 |         pred_probs = self.predict_proba(X)
151 |         pred_array = (pred_probs >= threshold).astype(int)
152 |         responses = self._response_categories.values
153 |         return responses[pred_array]
154 | 
155 |     def summary(self):
156 |         """Provide a summary for model parameters, variance, and metrics.
157 | 
158 |         The summary method brings together the statistics generated from fit
159 |         as well as the variance ratios, if available. The statistics are far
160 |         more valuable when using statsmodels than sklearn.
161 | 
162 |         Returns:
163 |             pd.DataFrame: summary statistics
164 |         """
165 | 
166 |         # only possible once we've fit a model with statsmodels
167 |         check_is_fitted(self, "statistics_")
168 |         sdf = pd.DataFrame(self.statistics_)
169 |         sdf.rename(columns={"lambda_": "lambda"}, inplace=True)
170 |         return sdf
171 | 


--------------------------------------------------------------------------------
/autoimpute/analysis/metrics.py:
--------------------------------------------------------------------------------
 1 | """This module devises metrics to compare estimates from analysis models."""
 2 | 
 3 | import numpy as np
 4 | import pandas as pd
 5 | 
 6 | def raw_bias(Q_bar, Q):
 7 |     """Calculate raw bias between coefficients Q and actual Q.
 8 | 
 9 |     Q_bar can be one estimate (scalar) or a vector of estimates. This equation
10 |     subtracts the expected Q_bar from Q, element-wise. The result is the bias
11 |     of each coefficient from its true value.
12 | 
13 |     Args:
14 |         Q_bar (number, array): single estimate or array of estimates.
15 |         Q (number, array): single truth or array of truths.
16 | 
17 |     Returns:
18 |         scalar, array: element-wise difference between estimates and truths.
19 | 
20 |     Raises:
21 |         ValueError: Shape mismatch
22 |         ValueError: Q_bar and Q not the same length
23 |     """
24 | 
25 |     # handle errors first
26 |     shape_err = "Q_bar & Q must be scalars or vectors of same length."
27 |     if isinstance(Q_bar, pd.DataFrame):
28 |         s = len(Q_bar.shape)
29 |         if s != 1:
30 |             raise ValueError(shape_err)
31 | 
32 |     if isinstance(Q, pd.DataFrame):
33 |         s = len(Q.shape)
34 |         if s != 1:
35 |             raise ValueError(shape_err)
36 | 
37 |     if len(Q_bar) != len(Q):
38 |         raise ValueError(shape_err)
39 | 
40 |     # convert any lists to ensure element-wise performed
41 |     if isinstance(Q_bar, (tuple, list)):
42 |         Q_bar = np.array(Q_bar)
43 | 
44 |     if isinstance(Q, (tuple, list)):
45 |         Q = np.array(Q)
46 | 
47 |     # perform element-wise subtraction
48 |     rb = Q_bar - Q
49 |     return rb
50 | 
51 | def percent_bias(Q_bar, Q):
52 |     """Calculate precent bias between coefficients Q and actual Q.
53 | 
54 |     Q_bar can be one estimate (scalar) or a vector of estimates. This equation
55 |     subtracts the expected Q_bar from Q, element-wise. The result is the bias
56 |     of each coefficient from its true value. We then divide this number by
57 |     Q itself, again in element-wise fashion, to produce % bias.
58 | 
59 |     Args:
60 |         Q_bar (number, array): single estimate or array of estimates.
61 |         Q (number, array): single truth or array of truths.
62 | 
63 |     Returns:
64 |         scalar, array: element-wise difference between estimates and truths.
65 | 
66 |     Raises:
67 |         ValueError: Shape mismatch
68 |         ValueError: Q_bar and Q not the same length
69 |     """
70 |     # calling this method will validate Q_bar and Q
71 |     rb = raw_bias(Q_bar, Q)
72 | 
73 |     # convert Q if necessary. must re-perform operation
74 |     if isinstance(Q, (tuple, list)):
75 |         Q = np.array(Q)
76 | 
77 |     pct_bias = 100 * (abs(rb)/Q)
78 |     return pct_bias
79 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the imputations folder from the autoimpute package.
 2 | 
 3 | This module handles imports from the imputations folder that should be
 4 | accessible whenever someone imports autoimpute.imputations. The list below
 5 | specifies the methods and classes that are currently available on import.
 6 | 
 7 | This module handles `from autoimpute.imputations import *` with the __all__
 8 | variable below. This command imports the main public classes and methods
 9 | from autoimpute.imputations.
10 | """
11 | 
12 | from .dataframe import BaseImputer
13 | from .mis_classifier import MissingnessClassifier
14 | from .dataframe import SingleImputer
15 | from .dataframe import MultipleImputer
16 | from .dataframe import MiceImputer
17 | from .deletion import listwise_delete
18 | 
19 | __all__ = [
20 |     "BaseImputer",
21 |     "MissingnessClassifier",
22 |     "SingleImputer",
23 |     "MultipleImputer",
24 |     "MiceImputer",
25 |     "listwise_delete"
26 | ]
27 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/dataframe/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the dataframe imputations folder from the autoimpute package.
 2 | 
 3 | This module handles imports from the dataframe imputations folder that should
 4 | be accessible whenever someone imports autoimpute.imputations.dataframe.
 5 | The list below specifies the methods and classes that are available on import.
 6 | 
 7 | This module handles `from autoimpute.imputations.dataframe import *` with the
 8 | __all__ variable below. This command imports the public classes and methods
 9 | from autoimpute.imputations.dataframe.
10 | """
11 | 
12 | from .base_imputer import BaseImputer
13 | from .single_imputer import SingleImputer
14 | from .multiple_imputer import MultipleImputer
15 | from .mice_imputer import MiceImputer
16 | 
17 | __all__ = [
18 |     "BaseImputer",
19 |     "SingleImputer",
20 |     "MultipleImputer",
21 |     "MiceImputer"
22 | ]
23 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/dataframe/base_imputer.py:
--------------------------------------------------------------------------------
  1 | """Module for BaseImputer - a base class for DataFrame imputers.
  2 | 
  3 | This module contains the `BaseImputer`, which is used to abstract away
  4 | functionality in both DataFrame imputers. The `BaseImputer` also holds the
  5 | methods available for imputation analysis.
  6 | """
  7 | 
  8 | import warnings
  9 | from autoimpute.utils import check_strategy_allowed
 10 | from autoimpute.imputations import method_names
 11 | from ..series import DefaultUnivarImputer, DefaultPredictiveImputer
 12 | from ..series import DefaultTimeSeriesImputer
 13 | from ..series import MeanImputer, MedianImputer, ModeImputer
 14 | from ..series import NormImputer, CategoricalImputer
 15 | from ..series import RandomImputer, InterpolateImputer
 16 | from ..series import LOCFImputer, NOCBImputer
 17 | from ..series import LeastSquaresImputer, StochasticImputer
 18 | from ..series import PMMImputer, LRDImputer
 19 | from ..series import BinaryLogisticImputer, MultinomialLogisticImputer
 20 | from ..series import BayesianLeastSquaresImputer
 21 | from ..series import BayesianBinaryLogisticImputer
 22 | from ..series import NormUnitVarianceImputer
 23 | methods = method_names
 24 | 
 25 | # pylint:disable=attribute-defined-outside-init
 26 | # pylint:disable=too-many-arguments
 27 | # pylint:disable=too-many-instance-attributes
 28 | # pylint:disable=inconsistent-return-statements
 29 | 
 30 | class BaseImputer:
 31 |     """Building blocks for more advanced DataFrame imputers.
 32 | 
 33 |     The BaseImputer is not a stand-alone class and thus serves no purpose
 34 |     other than as a parent to Imputers. Therefore, the BaseImputer should not
 35 |     be used directly unless creating an Imputer. That being said, all
 36 |     DataFrame Imputers should inherit from BaseImputer. It contains base
 37 |     functionality for any new DataFrame Imputer, and it holds the set of
 38 |     strategies that make up this imputation library.
 39 | 
 40 |     Attributes:
 41 |         univariate_strategies (dict): univariate imputation methods.
 42 |             |  Key = imputation name; Value = function to perform imputation.
 43 |             |  `univariate default` mean for numerical, mode for categorical.
 44 |             |  `time default` interpolate for numerical, mode for categorical.
 45 |             |  `mean` imputes missing values with the average of the series.
 46 |             |  `median` imputes missing values with the median of the series.
 47 |             |  `mode` imputes missing values with the mode of the series.
 48 |             |     Method handles more than one mode (see ModeImputer for info).
 49 |             |  `random` imputes random choice from set of series unique vals.
 50 |             |  `norm` imputes series w/ random draws from normal distribution.
 51 |             |     Mean and std calculated from observed values of the series.
 52 |             |  `categorical` imputes series using random draws from pmf.
 53 |             |     Proportions calculated from non-missing category instances.
 54 |             |  `interpolate` imputes series using chosen interpolation method.
 55 |             |     Default is linear. See InterpolateImputer for more info.
 56 |             |  `locf` imputes series carrying last observation moving forward.
 57 |             |  `nocb` imputes series carrying next observation moving backward.
 58 |             |  `normal unit variance` imputes using unit variance w/ norm.
 59 |         predictive_strategies (dict): predictive imputation methods.
 60 |             |  Key = imputation name; Value = function to perform imputation.
 61 |             |  `predictive default` pmm for numerical,logistic for categorical.
 62 |             |  `least squares` predict missing values from linear regression.
 63 |             |  `binary logistic` predict missing values with 2 classes.
 64 |             |  `multinomial logistic` predict missing values with multiclass.
 65 |             |  `stochastic` linear regression+random draw from norm w/ mse std.
 66 |             |  `bayesian least squares` draw from the posterior predictive
 67 |             |     distribution for each missing value, using OLS model.
 68 |             |  `bayesian binary logistic` draw from the posterior predictive
 69 |             |     distribution for each missing value, using logistic model.
 70 |             |  `pmm` imputes series using predictive mean matching. PMM is a
 71 |             |     semi-supervised method using bayesian & hot-deck imputation.
 72 |             |  `lrd` imputes series using local residual draws. LRD is a
 73 |             |     semi-supervised method using bayesian & hot-deck imputation.
 74 |     """
 75 |     univariate_strategies = {
 76 |         methods.DEFAULT_UNIVAR: DefaultUnivarImputer,
 77 |         methods.DEFAULT_TIME: DefaultTimeSeriesImputer,
 78 |         methods.MEAN: MeanImputer,
 79 |         methods.MEDIAN: MedianImputer,
 80 |         methods.MODE:  ModeImputer,
 81 |         methods.RANDOM: RandomImputer,
 82 |         methods.NORM: NormImputer,
 83 |         methods.CATEGORICAL: CategoricalImputer,
 84 |         methods.INTERPOLATE: InterpolateImputer,
 85 |         methods.LOCF: LOCFImputer,
 86 |         methods.NOCB: NOCBImputer,
 87 |         methods.NORM_UNIT_VARIANCE: NormUnitVarianceImputer,
 88 |     }
 89 | 
 90 |     predictive_strategies = {
 91 |         methods.DEFAULT_PRED: DefaultPredictiveImputer,
 92 |         methods.LS: LeastSquaresImputer,
 93 |         methods.STOCHASTIC: StochasticImputer,
 94 |         methods.BINARY_LOGISTIC: BinaryLogisticImputer,
 95 |         methods.MULTI_LOGISTIC: MultinomialLogisticImputer,
 96 |         methods.BAYESIAN_LS: BayesianLeastSquaresImputer,
 97 |         methods.BAYESIAN_BINARY_LOGISTIC: BayesianBinaryLogisticImputer,
 98 |         methods.PMM: PMMImputer,
 99 |         methods.LRD: LRDImputer
100 |     }
101 | 
102 |     strategies = {**predictive_strategies, **univariate_strategies}
103 | 
104 |     visit_sequences = (
105 |         "default",
106 |         "left-to-right"
107 |     )
108 | 
109 |     def __init__(self, strategy, imp_kwgs, visit):
110 |         """Initialize the BaseImputer.
111 | 
112 |         Args:
113 |             strategy (str, iter, dict; optional): strategies for imputation.
114 |                 Default value is str -> `predictive default`.
115 |                 If str, single strategy broadcast to all series in DataFrame.
116 |                 If iter, must provide 1 strategy per column. Each method w/in
117 |                 iterator applies to column with same index value in DataFrame.
118 |                 If dict, must provide key = column name, value = imputer.
119 |                 Dict the most flexible and PREFERRED way to create custom
120 |                 imputation strategies if not using the default. Dict does not
121 |                 require method for every column; just those specified as keys.
122 |             imp_kwgs (dict, optional): keyword arguments for each imputer.
123 |                 Default is None, which means default imputer created to match
124 |                 specific strategy. imp_kwgs keys can be either columns or
125 |                 strategies. If strategies, each column given that strategy is
126 |                 instantiated with same arguments.
127 |             visit (str, None): order to visit columns for imputation.
128 |                 Default is `default`, which implements `left-to-right`.
129 |                 More strategies (random, monotone, etc.) TBD.
130 |         """
131 |         self.strategy = strategy
132 |         self.imp_kwgs = imp_kwgs
133 |         self.visit = visit
134 | 
135 |     @property
136 |     def strategy(self):
137 |         """Property getter to return the value of the strategy property."""
138 |         return self._strategy
139 | 
140 |     @strategy.setter
141 |     def strategy(self, s):
142 |         """Validate the strategy property to ensure it's type and value.
143 | 
144 |         Class instance only possible if strategy is proper type, as outlined
145 |         in the init method. Passes supported strategies and user arg to
146 |         helper method, which performs strategy checks.
147 | 
148 |         Args:
149 |             s (str, iter, dict): Strategy passed as arg to class instance.
150 | 
151 |         Raises:
152 |             ValueError: Strategies not valid (not in allowed strategies).
153 |             TypeError: Strategy must be a string, tuple, list, or dict.
154 |             Both errors raised through helper method `check_strategy_allowed`.
155 |         """
156 |         strat_names = self.strategies.keys()
157 |         self._strategy = check_strategy_allowed(strat_names, s)
158 | 
159 |     @property
160 |     def imp_kwgs(self):
161 |         """Property getter to return the value of imp_kwgs."""
162 |         return self._imp_kwgs
163 | 
164 |     @imp_kwgs.setter
165 |     def imp_kwgs(self, kwgs):
166 |         """Validate the imp_kwgs and set default properties.
167 | 
168 |         The BaseImputer validates the `imp_kwgs` argument. `imp_kwgs` contain
169 |         optional keyword arguments for an imputers' strategies or columns. The
170 |         argument is optional, and its default is None.
171 | 
172 |         Args:
173 |             kwgs (dict, None): None or dictionary of keywords.
174 | 
175 |         Raises:
176 |             ValueError: imp_kwgs not correctly specified as argument.
177 |         """
178 |         if not isinstance(kwgs, (type(None), dict)):
179 |             err = "imp_kwgs must be dict of args used to instantiate Imputer."
180 |             raise ValueError(err)
181 |         self._imp_kwgs = kwgs
182 | 
183 |     @property
184 |     def visit(self):
185 |         """Property getter to return the value of the visit property."""
186 |         return self._visit
187 | 
188 |     @visit.setter
189 |     def visit(self, v):
190 |         """Validate the visit property to ensure it's type and value.
191 | 
192 |         Class instance only possible if visit is proper type, as outlined in
193 |         the init method. Visit property must be one of valid sequences in the
194 |         `visit_sequences` variable.
195 | 
196 |         Args:
197 |             v (str): Visit sequence passed as arg to class instance.
198 | 
199 |         Raises:
200 |             TypeError: visit sequence must be a string.
201 |             ValueError: visit sequenece not in `visit_sequences`.
202 |         """
203 | 
204 |         # deal with type first
205 |         if not isinstance(v, str):
206 |             err = "visit must be a string specifying visit sequence to use."
207 |             raise TypeError(err)
208 | 
209 |         # deal with value next
210 |         if v not in self.visit_sequences:
211 |             err = f"visit not valid. Must be one of {self.visit_sequences}"
212 |             raise ValueError(err)
213 | 
214 |         # otherwise, set property for visit
215 |         self._visit = v
216 | 
217 |     def _fit_init_params(self, column, method, kwgs):
218 |         """Private method to supply imputation model fit params if any."""
219 | 
220 |         # first, handle easy case when no kwargs given
221 |         if kwgs is None:
222 |             final_params = kwgs
223 | 
224 |         # next, check if any kwargs for a given Imputer method type
225 |         # then, override those parameters if specific column kwargs supplied
226 |         if isinstance(kwgs, dict):
227 |             initial_params = kwgs.get(method, None)
228 |             final_params = kwgs.get(column, initial_params)
229 | 
230 |         # final params must be None or a dictionary of kwargs
231 |         # this additional validation step is crucial to dictionary unpacking
232 |         if not isinstance(final_params, (type(None), dict)):
233 |             err = "Additional params must be dict of args used to init model."
234 |             raise ValueError(err)
235 |         return final_params
236 | 
237 |     def _check_if_single_dummy(self, col, X):
238 |         """Private method to check if encoding results in single cat."""
239 |         cats = X.columns.tolist()
240 |         if len(cats) == 1:
241 |             c = cats[0]
242 |             msg = f"{c} only category for feature {col}."
243 |             cons = f"Consider removing {col} from dataset."
244 |             warnings.warn(f"{msg} {cons}")
245 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/dataframe/mice_imputer.py:
--------------------------------------------------------------------------------
  1 | """This module performs a series of multiple imputations of missing features
  2 | in a dataset.
  3 | 
  4 | This module contains one class - the MiceImputer. Use this class to
  5 | impute each Series within a DataFrame multiple times using an iteration of fits
  6 | and transformations to reach a stable state of imputation each time. This
  7 | extension of MultipleImputer makes the same imputation methods available as its
  8 | parent class - both univariate and multivatiate. Each method runs `n` times on
  9 | its specified column. When `n` passes through the columns are complete, the
 10 | MultipleImputer returns the `n` imputed datasets. For each of these `n`
 11 | imputations, the method (re)fits and applies imputation to the dataset `k`
 12 | times. Typically `k` should be at least 3 to reach a stable state.
 13 | Its functioning is based upon the R package MICE
 14 | (https://cran.r-project.org/web/packages/mice/)
 15 | """
 16 | 
 17 | from autoimpute.utils import check_nan_columns
 18 | from .multiple_imputer import MultipleImputer
 19 | 
 20 | 
 21 | class MiceImputer(MultipleImputer):
 22 |     """Techniques to impute Series with missing values multiple times using
 23 |     repeated fits and applications to reach a stable imputation.
 24 | 
 25 |     The MiceImputer class implements multiple imputation, i.e., a series
 26 |     or repetition of applications of imputation to reach a stable imputation,
 27 |     similar to the functioning of the R package MICE.
 28 |     It leverages the methods found in the BaseImputer. This imputer passes
 29 |     all the work for each imputation to the SingleImputer, but it controls
 30 |     the arguments each imputer receives. The args are flexible depending on
 31 |     what the user specifies for each imputation.
 32 | 
 33 |     Note that the Imputer allows for one imputation method per column only.
 34 |     Therefore, the behavior of `strategy` is the same as the SingleImputer,
 35 |     but the predictors are allowed to change for each imputation.
 36 |     """
 37 | 
 38 |     def __init__(self, k=3, n=5,
 39 |                  strategy="default predictive", predictors="all",
 40 |                  imp_kwgs=None, seed=None, visit="default",
 41 |                  return_list=False):
 42 |         """Create an instance of the SeriesImputer class.
 43 | 
 44 |         As with sklearn classes, all arguments take default values. Therefore,
 45 |         SeriesImputer() creates a valid class instance. The instance is
 46 |         used to set up an imputer and perform checks on arguments.
 47 | 
 48 |         Args:
 49 |             k (int, optional): number of repeated fits and transformations to
 50 |                 apply to reach a stable impution. Default is 3.
 51 |                 Value must be greater than or equal to 1.
 52 |             n (int, optional): number of imputations to perform. Default is 5.
 53 |                 Value must be greater than or equal to 1.
 54 |             strategy (str, iter, dict; optional): strategy for single imputer.
 55 |                 Default value is str --> `predictive default`.
 56 |                 See BaseImputer for all available strategies.
 57 |                 If str, single strategy broadcast to all series in DataFrame.
 58 |                 If iter, must provide 1 strategy per column. Each method w/in
 59 |                 iterator applies to column with same index value in DataFrame.
 60 |                 If dict, must provide key = column name, value = imputer.
 61 |                 Dict the most flexible and PREFERRED way to create custom
 62 |                 imputation strategies if not using the default. Dict does not
 63 |                 require method for every column; just those specified as keys.
 64 |             predictors (str, iter, dict, optional): defaults to all, i.e.
 65 |                 use all predictors. If all, every column will be used for
 66 |                 every class prediction. If a list, subset of columns used for
 67 |                 all predictions. If a dict, specify which columns to use as
 68 |                 predictors for each imputation. Columns not specified in dict
 69 |                 but present in `strategy` receive `all` other cols as preds.
 70 |             imp_kwgs (dict, optional): keyword arguments for each imputer.
 71 |                 Default is None, which means default imputer created to match
 72 |                 specific strategy. imp_kwgs keys can be either columns or
 73 |                 strategies. If strategies, each column given that strategy is
 74 |                 instantiated with same arguments. When strategy is `default`,
 75 |                 imp_kwgs is ignored.
 76 |             seed (int, optional): seed setting for reproducible results.
 77 |                 Defualt is None. No validation, but values should be integer.
 78 |             return_list (bool, optional): return m as list or generator.
 79 |                 Default is False. m imputations returned as generator. More
 80 |                 memory efficient. return as list if return_list=True
 81 |         """
 82 |         self.k = k
 83 |         MultipleImputer.__init__(
 84 |             self,
 85 |             n=n,
 86 |             strategy=strategy,
 87 |             predictors=predictors,
 88 |             imp_kwgs=imp_kwgs,
 89 |             seed=seed,
 90 |             visit=visit,
 91 |             return_list=return_list
 92 |         )
 93 | 
 94 |     @property
 95 |     def k(self):
 96 |         """Property getter to return the value of the k property."""
 97 |         return self._k
 98 | 
 99 |     @k.setter
100 |     def k(self, k_):
101 |         """Validate the k property to ensure it's Type and Value.
102 | 
103 |         Args:
104 |             k_ (int): k passed as arg to class instance.
105 | 
106 |         Raises:
107 |             TypeError: k must be an integer.
108 |             ValueError: k must be greater than zero.
109 |         """
110 | 
111 |         # deal with type first
112 |         if not isinstance(k_, int):
113 |             err = """
114 |                 k must be an int specifying number of repeated fits
115 |                 and transformations in a series of imputations.
116 |             """
117 |             raise TypeError(err)
118 | 
119 |         # then check the value is greater than zero
120 |         if k_ < 1:
121 |             err = "k > 0. Cannot perform fewer than 1 imputation."
122 |             raise ValueError(err)
123 | 
124 |         # otherwise set the property value for k
125 |         self._k = k_
126 | 
127 |     def _iterate_imputation(self, X, imp):
128 |         """Helper function that iterates self.k times to create a stable imputation
129 |         by repeated application and retraining of the imputation models.
130 |         Used by transform()
131 | 
132 |         Args:
133 |             X (pd.DataFrame): fit DataFrame to impute
134 |             imp (Imputer): Imputer to apply to X
135 |         """
136 |         X2 = imp.transform(X, imp_ixs=self.imputed_)
137 |         for k in range(self.k - 1):
138 |             imp.fit(X2, imp_ixs=self.imputed_)
139 |             X2 = imp.transform(X, imp_ixs=self.imputed_, k=k)
140 |         return X2
141 | 
142 |     @check_nan_columns
143 |     def transform(self, X):
144 |         """Impute each column within a DataFrame using fit imputation methods.
145 | 
146 |         The transform step performs the actual imputations. Given a dataset
147 |         previously fit, `transform` imputes each column with it's respective
148 |         imputed values from fit (in the case of inductive) or performs new fit
149 |         and transform in one sweep (in the case of transductive).
150 |         The transformations and fits are repeatedly applied and refitted self.k
151 |         times to reach a stable imputation.
152 | 
153 |         Args:
154 |             X (pd.DataFrame): fit DataFrame to impute.
155 | 
156 |         Returns:
157 |             X (pd.DataFrame): imputed in place or copy of original.
158 | 
159 |         Raises:
160 |             ValueError: same columns must appear in fit and transform.
161 |         """
162 | 
163 |         # call transform strategy validator before applying transform
164 |         self._transform_strategy_validator()
165 | 
166 |         # make it easy to access the location of the imputed values
167 |         self.imputed_ = {}
168 |         for column in self._strats.keys():
169 |             imp_ix = X[column][X[column].isnull()].index
170 |             self.imputed_[column] = imp_ix.tolist()
171 | 
172 |         # right now, return a generator by default
173 |         # sequential only for now
174 |         imputed = ((i[0], self._iterate_imputation(X, i[1]))
175 |                    for i in self.statistics_.items())
176 | 
177 |         if self.return_list:
178 |             imputed = list(imputed)
179 |         return imputed
180 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/dataframe/multiple_imputer.py:
--------------------------------------------------------------------------------
  1 | """This module performs multiple imputations of missing features in a dataset.
  2 | 
  3 | This module contains one class - the MultipleImputer. Use this class to
  4 | impute each Series within a DataFrame multiple times. This class makes
  5 | numerous imputation methods available - both univariate and multivatiate. Each
  6 | method runs `n` times on its specified column. When `n` passes through the
  7 | columns are complete, the MultipleImputer returns the `n` imputed datasets.
  8 | """
  9 | 
 10 | from sklearn.base import BaseEstimator, TransformerMixin
 11 | from sklearn.utils.validation import check_is_fitted
 12 | from autoimpute.imputations import method_names
 13 | from autoimpute.utils import check_nan_columns, check_predictors_fit
 14 | from autoimpute.utils import check_strategy_fit
 15 | from .base_imputer import BaseImputer
 16 | from .single_imputer import SingleImputer
 17 | methods = method_names
 18 | 
 19 | # pylint:disable=attribute-defined-outside-init
 20 | # pylint:disable=protected-access
 21 | # pylint:disable=too-many-arguments
 22 | # pylint:disable=unused-argument
 23 | # pylint:disable=too-many-instance-attributes
 24 | # pylint:disable=arguments-differ
 25 | 
 26 | 
 27 | class MultipleImputer(BaseImputer, BaseEstimator, TransformerMixin):
 28 |     """Techniques to impute Series with missing values multiple times.
 29 | 
 30 |     The MultipleImputer class applies imputation multiple times. It leverages the
 31 |     methods found in the BaseImputer. This imputer passes all the work for
 32 |     each imputation to the SingleImputer, but it controls the arguments
 33 |     each imputer receives. The args are flexible depending on what the user
 34 |     specifies for each imputation.
 35 | 
 36 |     Note that the Imputer allows for one imputation method per column only.
 37 |     Therefore, the behavior of `strategy` is the same as the SingleImputer,
 38 |     but the predictors are allowed to change for each imputation.
 39 |     """
 40 | 
 41 |     def __init__(self, n=5, strategy="default predictive", predictors="all",
 42 |                  imp_kwgs=None, seed=None, visit="default", return_list=False):
 43 |         """Create an instance of the MultipleImputer class.
 44 | 
 45 |         As with sklearn classes, all arguments take default values. Therefore,
 46 |         MultipleImputer() creates a valid class instance. The instance is
 47 |         used to set up an imputer and perform checks on arguments.
 48 | 
 49 |         Args:
 50 |             n (int, optional): number of imputations to perform. Default is 5.
 51 |                 Value must be greater than or equal to 1.
 52 |             strategy (str, iter, dict; optional): strategy for single imputer.
 53 |                 Default value is str --> `predictive default`.
 54 |                 See BaseImputer for all available strategies.
 55 |                 If str, single strategy broadcast to all series in DataFrame.
 56 |                 If iter, must provide 1 strategy per column. Each method w/in
 57 |                 iterator applies to column with same index value in DataFrame.
 58 |                 If dict, must provide key = column name, value = imputer.
 59 |                 Dict the most flexible and PREFERRED way to create custom
 60 |                 imputation strategies if not using the default. Dict does not
 61 |                 require method for every column; just those specified as keys.
 62 |             predictors (str, iter, dict, optional): defaults to all, i.e.
 63 |                 use all predictors. If all, every column will be used for
 64 |                 every class prediction. If a list, subset of columns used for
 65 |                 all predictions. If a dict, specify which columns to use as
 66 |                 predictors for each imputation. Columns not specified in dict
 67 |                 but present in `strategy` receive `all` other cols as preds.
 68 |             imp_kwgs (dict, optional): keyword arguments for each imputer.
 69 |                 Default is None, which means default imputer created to match
 70 |                 specific strategy. imp_kwgs keys can be either columns or
 71 |                 strategies. If strategies, each column given that strategy is
 72 |                 instantiated with same arguments. When strategy is `default`,
 73 |                 imp_kwgs is ignored.
 74 |             seed (int, optional): seed setting for reproducible results.
 75 |                 Defualt is None. No validation, but values should be integer.
 76 |             return_list (bool, optional): return m as list or generator.
 77 |                 Default is False. m imputations returned as generator. More
 78 |                 memory efficient. return as list if return_list=True
 79 |         """
 80 |         BaseImputer.__init__(
 81 |             self,
 82 |             strategy=strategy,
 83 |             imp_kwgs=imp_kwgs,
 84 |             visit=visit
 85 |         )
 86 |         self.n = n
 87 |         self.predictors = predictors
 88 |         self.seed = seed
 89 |         self.return_list = return_list
 90 |         self.copy = True
 91 | 
 92 |     @property
 93 |     def n(self):
 94 |         """Property getter to return the value of the n property."""
 95 |         return self._n
 96 | 
 97 |     @n.setter
 98 |     def n(self, n_):
 99 |         """Validate the n property to ensure it's Type and Value.
100 | 
101 |         Args:
102 |             n_ (int): n passed as arg to class instance.
103 | 
104 |         Raises:
105 |             TypeError: n must be an integer.
106 |             ValueError: n must be greater than zero.
107 |         """
108 | 
109 |         # deal with type first
110 |         if not isinstance(n_, int):
111 |             err = "n must be an integer specifying number of imputations."
112 |             raise TypeError(err)
113 | 
114 |         # then check the value is greater than zero
115 |         if n_ < 1:
116 |             err = "n > 0. Cannot perform fewer than 1 imputation."
117 |             raise ValueError(err)
118 | 
119 |         # otherwise set the property value for n
120 |         self._n = n_
121 | 
122 |     def _fit_strategy_validator(self, X):
123 |         """Internal helper method to validate strategies appropriate for fit.
124 | 
125 |         Checks whether strategies match with type of column they are applied
126 |         to. If not, error is raised through `check_strategy_fit` method.
127 |         """
128 | 
129 |         # remove nan columns and store colnames
130 |         cols = X.columns.tolist()
131 |         self._strats = check_strategy_fit(self.strategy, cols)
132 | 
133 |         # if predictors is a list...
134 |         if isinstance(self.predictors, (tuple, list)):
135 |             # and it is not the same list of predictors for every iteration...
136 |             if not all([isinstance(x, str) for x in self.predictors]):
137 |                 len_pred = len(self.predictors)
138 |                 # raise error if not the correct length
139 |                 if len_pred != self.n:
140 |                     err = f"Predictors has {len_pred} items. Need {self.n}"
141 |                     raise ValueError(err)
142 |                 # check predictors for each in list
143 |                 self._preds = [
144 |                     check_predictors_fit(p, cols)
145 |                     for p in self.predictors
146 |                 ]
147 |             # if it is a list, but not a list of objects...
148 |             else:
149 |                 # broadcast predictors
150 |                 self._preds = check_predictors_fit(self.predictors, cols)
151 |                 self._preds = [self._preds]*self.n
152 |         # if string or dictionary...
153 |         else:
154 |             # broadcast predictors
155 |             self._preds = check_predictors_fit(self.predictors, cols)
156 |             self._preds = [self._preds]*self.n
157 | 
158 |     def _transform_strategy_validator(self):
159 |         """Private method to prep for prediction."""
160 |         check_is_fitted(self, "statistics_")
161 | 
162 |     @check_nan_columns
163 |     def fit(self, X, y=None):
164 |         """Fit imputation methods to each column within a DataFrame.
165 | 
166 |         The fit method calclulates the `statistics` necessary to later
167 |         transform a dataset (i.e. perform actual imputatations). Inductive
168 |         methods calculate statistic on the fit data, then impute new missing
169 |         data with that value. All currently supported methods are inductive.
170 | 
171 |         Args:
172 |             X (pd.DataFrame): pandas DataFrame on which imputer is fit.
173 | 
174 |         Returns:
175 |             self: instance of the PredictiveImputer class.
176 |         """
177 | 
178 |         # first, run the fit strategy validator, then create statistics
179 |         self._fit_strategy_validator(X)
180 |         self.statistics_ = {}
181 | 
182 |         # deal with potentially setting seed for each individual predictor
183 |         if self.seed is not None:
184 |             self._seeds = [self.seed + i for i in range(1, self.n*13, 13)]
185 |         else:
186 |             self._seeds = [None]*self.n
187 | 
188 |         # create PredictiveImputers. sequentially only right now
189 |         for i in range(1, self.n+1):
190 |             imputer = SingleImputer(
191 |                 strategy=self.strategy,
192 |                 predictors=self._preds[i-1],
193 |                 imp_kwgs=self.imp_kwgs,
194 |                 copy=self.copy,
195 |                 seed=self._seeds[i-1],
196 |                 visit=self.visit
197 |             )
198 |             imputer.fit(X)
199 |             self.statistics_[i] = imputer
200 | 
201 |         return self
202 | 
203 |     @check_nan_columns
204 |     def transform(self, X, **trans_kwargs):
205 |         """Impute each column within a DataFrame using fit imputation methods.
206 | 
207 |         The transform step performs the actual imputations. Given a dataset
208 |         previously fit, `transform` imputes each column with it's respective
209 |         imputed values from fit (in the case of inductive) or performs new fit
210 |         and transform in one sweep (in the case of transductive).
211 | 
212 |         Args:
213 |             X (pd.DataFrame): fit DataFrame to impute.
214 |             **trans_kwargs: dict, optional args for bayesian.
215 | 
216 |         Returns:
217 |             X (pd.DataFrame): imputed in place or copy of original.
218 | 
219 |         Raises:
220 |             ValueError: same columns must appear in fit and transform.
221 |         """
222 | 
223 |         # call transform strategy validator before applying transform
224 |         self._transform_strategy_validator()
225 | 
226 |         # make it easy to access the location of the imputed values
227 |         self.imputed_ = {}
228 |         for column in self._strats.keys():
229 |             imp_ix = X[column][X[column].isnull()].index
230 |             self.imputed_[column] = imp_ix.tolist()
231 | 
232 |         # right now, return a generator by default
233 |         # sequential only for now
234 |         imputed = ((i[0], i[1].transform(X, **trans_kwargs))
235 |                    for i in self.statistics_.items())
236 |         if self.return_list:
237 |             imputed = list(imputed)
238 |         return imputed
239 | 
240 |     def fit_transform(self, X, y=None, **trans_kwargs):
241 |         """Convenience method to fit then transform the same dataset."""
242 |         return self.fit(X, y).transform(X, **trans_kwargs)
243 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/deletion.py:
--------------------------------------------------------------------------------
 1 | """Deletion strategies to handle the missing data in pandas DataFrame."""
 2 | 
 3 | from autoimpute.utils import check_nan_columns
 4 | 
 5 | @check_nan_columns
 6 | def listwise_delete(data, inplace=False, verbose=False):
 7 |     """Delete all rows from a DataFrame where any missing values exist.
 8 | 
 9 |     Deletion is one way to handle missing values. This method removes any
10 |     records that have a missing value in any of the features. This package
11 |     focuses on imputation, not deletion. That being said, listwise deletion
12 |     is a necessary component of any imputation package, as its the default
13 |     method most people (and software) use to handle missing data.
14 | 
15 |     Args:
16 |         data (pd.DataFrame): DataFrame used to delete missing rows.
17 |         inplace (boolean, optional): perform operation inplace.
18 |             Defaults to False.
19 |         verbose (boolean, optional): print information to console.
20 |             Defaults to False.
21 | 
22 |     Returns:
23 |         pd.DataFrame: rows with missing values removed.
24 | 
25 |     Raises:
26 |         ValueError: columns with all data missing. Raised through decorator.
27 |     """
28 |     num_records_before = len(data.index)
29 |     if inplace:
30 |         data.dropna(inplace=True)
31 |     else:
32 |         data = data.dropna(inplace=False)
33 |     num_records_after = len(data.index)
34 |     if verbose:
35 |         print(f"Number of records before delete: {num_records_before}")
36 |         print(f"Number of records after delete: {num_records_after}")
37 |     return data
38 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/errors.py:
--------------------------------------------------------------------------------
 1 | """Private methods for handling errors throughout imputation analysis."""
 2 | 
 3 | from pandas.api.types import is_string_dtype
 4 | from pandas.api.types import is_numeric_dtype
 5 | 
 6 | # ERROR HANDLING
 7 | # --------------
 8 | 
 9 | def _not_num_series(m, s):
10 |     """Private method to detect columns of Matrix that are not categorical."""
11 |     if not is_numeric_dtype(s):
12 |         t = s.dtype
13 |         err = f"{m} not appropriate for Series {s.name} of type {t}."
14 |         raise TypeError(err)
15 | 
16 | def _not_num_matrix(m, mat):
17 |     """Private method to detect columns of Matrix that are not numerical."""
18 |     try:
19 |         for each_col in mat:
20 |             c = mat[each_col]
21 |             _not_num_series(m, c)
22 |     except TypeError as te:
23 |         err = f"{m} not appropriate for Matrix with non-numerical columns."
24 |         raise TypeError(err) from te
25 | 
26 | def _not_cat_series(m, s):
27 |     """Private method to detect Series that are not categorical."""
28 |     t = s.dtype
29 |     if not is_string_dtype(s) and t != "object":
30 |         err = f"{m} not appropriate for Series {s.name} of type {t}."
31 |         raise TypeError(err)
32 | 
33 | def _not_cat_matrix(m, mat):
34 |     """Private method to detect columns of Matrix that are not categorical."""
35 |     try:
36 |         for each_col in mat:
37 |             c = mat[each_col]
38 |             _not_cat_series(m, c)
39 |     except TypeError as te:
40 |         err = f"{m} not appropriate for Matrix with non-categorical columns."
41 |         raise TypeError(err) from te
42 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/helpers.py:
--------------------------------------------------------------------------------
 1 | """Private helper methods for the imputations folder."""
 2 | 
 3 | import logging
 4 | import numpy as np
 5 | import pandas as pd
 6 | from autoimpute.imputations.deletion import listwise_delete
 7 | 
 8 | def _get_observed(predictors, series, verbose=False):
 9 |     """Private method to test datasets and get observed data."""
10 |     conc = pd.concat([predictors, series], axis=1)
11 | 
12 |     # perform listwise delete on predictors and series
13 |     # resulting data serves as the `observed` data for fit modeling
14 |     predictors = listwise_delete(conc, verbose=verbose)
15 |     series = predictors.pop(series.name)
16 |     return predictors, series
17 | 
18 | def _neighbors(x, n, df, choose):
19 |     al = len(df.index)
20 |     if n > al:
21 |         err = "# neighbors greater than # predictions. Reduce neighbor count."
22 |         raise ValueError(err)
23 |     indexarr = np.argpartition(abs(df["y_pred"] - x), n)[:n]
24 |     neighbs = df.loc[indexarr, "y"].values
25 |     return choose(neighbs)
26 | 
27 | def _local_residuals(x, n, df, choose):
28 |     al = len(df.index)
29 |     if n > al:
30 |         err = "# neighbors greater than # predictions. Reduce neighbor count."
31 |         raise ValueError(err)
32 |     indexarr = np.argpartition(abs(df["y_pred"] - x), n)[:n]
33 |     neighbs = df.loc[indexarr, "y"].values
34 |     distances = df.loc[indexarr, "y_pred"].values - x
35 |     resids = neighbs + distances
36 |     return choose(resids)
37 | 
38 | def _pymc_logger(verbose=False):
39 |     """Private method to handle pymc logging."""
40 |     progress = 1
41 |     if not verbose:
42 |         progress = 0
43 |         logger = logging.getLogger('pymc')
44 |         logger.setLevel(logging.ERROR)
45 |     return progress
46 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/method_names.py:
--------------------------------------------------------------------------------
 1 | """Module contains names of methods used in Imputer Classes."""
 2 | 
 3 | DEFAULT = "default"
 4 | DEFAULT_PRED = "default predictive"
 5 | DEFAULT_UNIVAR = "default univariate"
 6 | DEFAULT_TIME = "default time"
 7 | MEAN = "mean"
 8 | MEDIAN = "median"
 9 | MODE = "mode"
10 | RANDOM = "random"
11 | NORM = "norm"
12 | CATEGORICAL = "categorical"
13 | INTERPOLATE = "interpolate"
14 | LOCF = "locf"
15 | NOCB = "nocb"
16 | LS = "least squares"
17 | BINARY_LOGISTIC = "binary logistic"
18 | MULTI_LOGISTIC = "multinomial logistic"
19 | STOCHASTIC = "stochastic"
20 | BAYESIAN_LS = "bayesian least squares"
21 | BAYESIAN_BINARY_LOGISTIC = "bayesian binary logistic"
22 | PMM = "pmm"
23 | LRD = "lrd"
24 | NONE = "none"
25 | NORM_UNIT_VARIANCE = 'normal unit variance'
26 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the series imputations folder from the autoimpute package.
 2 | 
 3 | This module handles imports from the series imputations folder that should be
 4 | accessible whenever someone imports autoimpute.imputations.series. Although
 5 | these imputers are stand-alone classes, their direct use is discouraged.
 6 | More robust imputers from the dataframe folder delegate work to these imputers
 7 | whenever their respective strategies are requested.
 8 | 
 9 | This module handles `from autoimpute.imputations.series import *` with the
10 | __all__ variable below. This command imports the main public classes and
11 | methods from autoimpute.imputations.series.
12 | """
13 | 
14 | from .default import DefaultUnivarImputer
15 | from .default import DefaultTimeSeriesImputer
16 | from .default import DefaultPredictiveImputer
17 | from .mean import MeanImputer
18 | from .median import MedianImputer
19 | from .mode import ModeImputer
20 | from .random import RandomImputer
21 | from .norm import NormImputer
22 | from .categorical import CategoricalImputer
23 | from .ffill import NOCBImputer, LOCFImputer
24 | from .interpolation import InterpolateImputer
25 | from .linear_regression import LeastSquaresImputer, StochasticImputer
26 | from .logistic_regression import BinaryLogisticImputer
27 | from .logistic_regression import MultinomialLogisticImputer
28 | from .bayesian_regression import BayesianLeastSquaresImputer
29 | from .bayesian_regression import BayesianBinaryLogisticImputer
30 | from .pmm import PMMImputer
31 | from .lrd import LRDImputer
32 | from .norm_unit_variance import NormUnitVarianceImputer
33 | 
34 | __all__ = [
35 |     "DefaultUnivarImputer",
36 |     "DefaultTimeSeriesImputer",
37 |     "DefaultPredictiveImputer",
38 |     "MeanImputer",
39 |     "MedianImputer",
40 |     "ModeImputer",
41 |     "RandomImputer",
42 |     "NormImputer",
43 |     "CategoricalImputer",
44 |     "NOCBImputer",
45 |     "LOCFImputer",
46 |     "InterpolateImputer",
47 |     "LeastSquaresImputer",
48 |     "StochasticImputer",
49 |     "BinaryLogisticImputer",
50 |     "MultinomialLogisticImputer",
51 |     "BayesianLeastSquaresImputer",
52 |     "BayesianBinaryLogisticImputer",
53 |     "PMMImputer",
54 |     "LRDImputer",
55 |     "NormUnitVarianceImputer",
56 | ]
57 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/base.py:
--------------------------------------------------------------------------------
 1 | """This module implements the abstract base class used for series-imputers.
 2 | 
 3 | The base class is quite simple but important. It specifies the contract for
 4 | how series-imputers should behave. All series-imputers should inherit from
 5 | this base class to be considered valid Imputers.
 6 | """
 7 | 
 8 | import abc
 9 | from sklearn.base import BaseEstimator
10 | 
11 | class ISeriesImputer(BaseEstimator, metaclass=abc.ABCMeta):
12 |     """ISeriesImputer implements the abstract base class for series-imputers.
13 | 
14 |     All series imputers should have a fit, impute, and fit_impute method to be
15 |     considered valid to build imputation models. The ISeriesImputer is the
16 |     contract series-imputers must adhere to."""
17 | 
18 |     @abc.abstractmethod
19 |     def fit(self, X, y):
20 |         """Contract to fit an imputation model.
21 | 
22 |         Args:
23 |             X (pd.Series, pd.Dataframe): data used to build imputation
24 |                 model. pd.Series if univariate, pd.DataFrame if multivariate.
25 |             y (pd.Series, None): column to impute. None if univariate,
26 |                 pd.Series if multivariate.
27 | 
28 |         Returns:
29 |             self: instance of a class
30 |         """
31 | 
32 |     @abc.abstractmethod
33 |     def impute(self, X):
34 |         """Contract to impute using a fit imputation model.
35 | 
36 |         Args:
37 |             X (pd.Series, pd.DataFrame): data to use to generate imputations.
38 |                 pd.Series if univariate, pd.DataFrame if multivariate.
39 | 
40 |         Returns:
41 |             imputations, scalar or array depending on imputation model.
42 |         """
43 | 
44 |     @abc.abstractmethod
45 |     def fit_impute(self, X, y):
46 |         """Convenience method that implements fit & impute in one go.
47 | 
48 |          Args:
49 |             X (pd.Series, pd.Dataframe): data used to build imputation
50 |                 model. pd.Series if univariate, pd.DataFrame if multivariate.
51 |             y (pd.Series, None): column to impute. None if univariate,
52 |                 pd.Series if multivariate.
53 | 
54 |         Returns:
55 |             imputations, scalar or array depending on imputation model.
56 |         """
57 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/categorical.py:
--------------------------------------------------------------------------------
 1 | """This module implements categorical imputation via the CategoricalImputer.
 2 | 
 3 | The CategoricalImputer determines the proportions of discrete features within
 4 | observed data. It then samples this distribution to impute missing values.
 5 | Dataframe imputers utilize this class when its strategy is requested. Use
 6 | SingleImputer or MultipleImputer with strategy = `categorical` to broadcast
 7 | the strategy across all the columns in a dataframe, or specify this strategy
 8 | for a given column.
 9 | """
10 | 
11 | import numpy as np
12 | from sklearn.utils.validation import check_is_fitted
13 | from autoimpute.imputations import method_names
14 | from autoimpute.imputations.errors import _not_cat_series
15 | from .base import ISeriesImputer
16 | methods = method_names
17 | # pylint:disable=attribute-defined-outside-init
18 | # pylint:disable=unnecessary-pass
19 | 
20 | class CategoricalImputer(ISeriesImputer):
21 |     """Impute missing data w/ draw from dataset's categorical distribution.
22 | 
23 |     The categorical imputer computes the proportion of observed values for
24 |     each category within a discrete dataset. The imputer then samples the
25 |     distribution to impute missing values with a respective random draw. The
26 |     imputer can be used directly, but such behavior is discouraged.
27 |     CategoricalImputer does not have the flexibility / robustness of dataframe
28 |     imputers, nor is its behavior identical. Preferred use is
29 |     MultipleImputer(strategy="categorical").
30 |     """
31 |     # class variables
32 |     strategy = methods.CATEGORICAL
33 | 
34 |     def __init__(self):
35 |         """Create an instance of the CategoricalImputer class."""
36 |         pass
37 | 
38 |     def fit(self, X, y=None):
39 |         """Fit the Imputer to the dataset and calculate proportions.
40 | 
41 |         Args:
42 |             X (pd.Series): Dataset to fit the imputer.
43 |             y (None): ignored, None to meet requirements of base class
44 | 
45 |         Returns:
46 |             self. Instance of the class.
47 |         """
48 |         _not_cat_series(self.strategy, X)
49 |         # get proportions of discrete observed values to sample from
50 |         proportions = X.value_counts() / np.sum(~X.isnull())
51 |         self.statistics_ = {"param": proportions, "strategy": self.strategy}
52 |         return self
53 | 
54 |     def impute(self, X):
55 |         """Perform imputations using the statistics generated from fit.
56 | 
57 |         The impute method handles the actual imputation. Transform
58 |         constructs a categorical distribution for each feature using the
59 |         proportions of observed values from fit. It then imputes missing
60 |         values with a random draw from the respective distribution.
61 | 
62 |         Args:
63 |             X (pd.Series): Dataset to impute missing data from fit.
64 | 
65 |         Returns:
66 |             np.array -- imputed dataset.
67 |         """
68 |         # check if fitted and identify location of missingness
69 |         check_is_fitted(self, "statistics_")
70 |         _not_cat_series(self.strategy, X)
71 |         ind = X[X.isnull()].index
72 | 
73 |         # get observed weighted by count of total and sample
74 |         param = self.statistics_["param"]
75 |         cats = param.index
76 |         proportions = param.tolist()
77 |         imp = np.random.choice(cats, size=len(ind), p=proportions)
78 |         return imp
79 | 
80 |     def fit_impute(self, X, y=None):
81 |         """Convenience method to perform fit and imputation in one go."""
82 |         return self.fit(X, y).impute(X)
83 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/ffill.py:
--------------------------------------------------------------------------------
  1 | """This module implements forward & backward imputation via two Imputers.
  2 | 
  3 | The LOCFImputer carries the last observation forward (locf) to impute missing
  4 | data in a time series. NOCBImputer carries the next observation backward (nocb)
  5 | to impute missing data in a time series. Dataframe imputers utilize these
  6 | classes when each's strategy is requested. Use SingleImputer or MultipleImputer
  7 | with strategy = `locf` or `nocb` to broadcast either strategy across all the
  8 | columns in a dataframe, or specify either strategy for a given column.
  9 | """
 10 | 
 11 | import pandas as pd
 12 | from sklearn.utils.validation import check_is_fitted
 13 | from autoimpute.imputations import method_names
 14 | from .base import ISeriesImputer
 15 | methods = method_names
 16 | # pylint:disable=attribute-defined-outside-init
 17 | # pylint:disable=unnecessary-pass
 18 | # pylint:disable=unused-argument
 19 | 
 20 | class LOCFImputer(ISeriesImputer):
 21 |     """Impute missing values by carrying the last observation forward.
 22 | 
 23 |     LOCFImputer carries the last observation forward to impute missing data.
 24 |     The imputer can be used directly, but such behavior is discouraged.
 25 |     LOCFImputer does not have the flexibility / robustness of dataframe
 26 |     imputers, nor is its behavior identical. Preferred use is
 27 |     MultipleImputer(strategy="locf").
 28 |     """
 29 |     # class variables
 30 |     strategy = methods.LOCF
 31 | 
 32 |     def __init__(self, start=None):
 33 |         """Create an instance of the LOCFImputer class.
 34 | 
 35 |         Args:
 36 |             start (any, optional): can be any value to impute first if first
 37 |                 is missing. Default is None, which ends up taking first
 38 |                 observed value found. Can also use "mean" to start with mean
 39 |                 of the series.
 40 | 
 41 |         Returns:
 42 |             self. Instance of class.
 43 |         """
 44 |         self.start = start
 45 | 
 46 |     def _handle_start(self, v, X):
 47 |         "private method to handle start values."
 48 |         if v is None:
 49 |             v = X.loc[X.first_valid_index()]
 50 |         if v == "mean":
 51 |             v = X.mean()
 52 |         return v
 53 | 
 54 |     def fit(self, X, y=None):
 55 |         """Fit the Imputer to the dataset.
 56 | 
 57 |         Args:
 58 |             X (pd.Series): Dataset to fit the imputer.
 59 |             y (None): ignored, None to meet requirements of base class
 60 | 
 61 | 
 62 |         Returns:
 63 |             self. Instance of the class.
 64 |         """
 65 |         self.statistics_ = {"param": None, "strategy": self.strategy}
 66 |         return self
 67 | 
 68 |     def impute(self, X):
 69 |         """Perform imputations using the statistics generated from fit.
 70 | 
 71 |         The impute method handles the actual imputation. Missing values
 72 |         in a given dataset are replaced with the last observation carried
 73 |         forward.
 74 | 
 75 |         Args:
 76 |             X (pd.Series): Dataset to impute missing data from fit.
 77 | 
 78 |         Returns:
 79 |             np.array -- imputed dataset.
 80 |         """
 81 |         # check if fitted then impute with mean if first value
 82 |         # or impute with observation carried forward otherwise
 83 |         check_is_fitted(self, "statistics_")
 84 | 
 85 |         # handle start...
 86 |         if pd.isnull(X.iloc[0]):
 87 |             ix = X.head(1).index[0]
 88 |             X.fillna(
 89 |                 {ix: self._handle_start(self.start, X)}, inplace=True
 90 |             )
 91 |         return X.fillna(method="ffill", inplace=False)
 92 | 
 93 |     def fit_impute(self, X, y=None):
 94 |         """Convenience method to perform fit and imputation in one go."""
 95 |         return self.fit(X, y).impute(X)
 96 | 
 97 | class NOCBImputer(ISeriesImputer):
 98 |     """Impute missing data by carrying the next observation backward.
 99 | 
100 |     NOCBImputer carries the next observation backward to impute missing data.
101 |     The imputer can be used directly, but such behavior is discouraged.
102 |     NOCBImputer does not have the flexibility / robustness of dataframe
103 |     imputers, nor is its behavior identical. Preferred use is
104 |     MultipleImputer(strategy="nocb").
105 |     """
106 |     # class variables
107 |     strategy = methods.NOCB
108 | 
109 |     def __init__(self, end=None):
110 |         """Create an instance of the NOCBImputer class.
111 | 
112 |         Args:
113 |             end (any, optional): can be any value to impute end if end
114 |                 is missing. Default is None, which ends up taking last
115 |                 observed value found. Can also use "mean" to end with
116 |                 mean of the series.
117 | 
118 |         Returns:
119 |             self. Instance of class.
120 |         """
121 |         self.end = end
122 | 
123 |     def _handle_end(self, v, X):
124 |         "private method to handle end values."
125 |         if v is None:
126 |             v = X.loc[X.last_valid_index()]
127 |         if v == "mean":
128 |             v = X.mean()
129 |         return v
130 | 
131 |     def fit(self, X, y=None):
132 |         """Fit the Imputer to the dataset and calculate the mean.
133 | 
134 |         Args:
135 |             X (pd.Series): Dataset to fit the imputer
136 |             y (None): ignored, None to meet requirements of base class
137 | 
138 | 
139 |         Returns:
140 |             self. Instance of the class.
141 |         """
142 |         self.statistics_ = {"param": None, "strategy": self.strategy}
143 |         return self
144 | 
145 |     def impute(self, X):
146 |         """Perform imputations using the statistics generated from fit.
147 | 
148 |         The impute method handles the actual imputation. Missing values
149 |         in a given dataset are replaced with the next observation carried
150 |         backward.
151 | 
152 |         Args:
153 |             X (pd.Series): Dataset to impute missing data from fit.
154 | 
155 |         Returns:
156 |             np.array -- imputed dataset.
157 |         """
158 |         # check if fitted then impute with mean if first value
159 |         # or impute with observation carried backward otherwise
160 |         check_is_fitted(self, "statistics_")
161 | 
162 |         # handle end...
163 |         if pd.isnull(X.iloc[-1]):
164 |             ix = X.tail(1).index[0]
165 |             X.fillna(
166 |                 {ix: self._handle_end(self.end, X)}, inplace=True
167 |             )
168 |         return X.fillna(method="bfill", inplace=False)
169 | 
170 |     def fit_impute(self, X, y=None):
171 |         """Convenience method to perform fit and imputation in one go."""
172 |         return self.fit(X, y).impute(X)
173 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/interpolation.py:
--------------------------------------------------------------------------------
  1 | """This module implements interpolation methods via the InterpolateImputer.
  2 | 
  3 | InterpolateImputer imputes missing data using some interpolation strategies
  4 | suppoted by pd.Series.interpolate. Linear is the default strategy, although a
  5 | number of additional strategies exist. Dataframe imputers utilize this class
  6 | when its strategy is requested. Use SingleImputer or MultipleImputer with
  7 | strategy = `interpolate` to broadcast the strategy across all the columns in a
  8 | dataframe, or specify this strategy for a given column.
  9 | """
 10 | 
 11 | import pandas as pd
 12 | from sklearn.utils.validation import check_is_fitted
 13 | from autoimpute.imputations import method_names
 14 | from .base import ISeriesImputer
 15 | methods = method_names
 16 | # pylint:disable=attribute-defined-outside-init
 17 | # pylint:disable=unnecessary-pass
 18 | # pylint:disable=unused-argument
 19 | 
 20 | class InterpolateImputer(ISeriesImputer):
 21 |     """Impute missing values using interpolation techniques.
 22 | 
 23 |     The InterpolateImputer imputes missing values uses a valid pd.Series
 24 |     interpolation strategy. See __init__ method docs for supported strategies.
 25 |     The imputer can be used directly, but such behavior is discouraged.
 26 |     InterpolateImputer does not have the flexibility / robustness of dataframe
 27 |     imputers, nor is its behavior identical. Preferred use is
 28 |     MultipleImputer(strategy="interpolate").
 29 |     """
 30 |     # class variables
 31 |     strategy = methods.INTERPOLATE
 32 |     fill_strategies = (
 33 |         "linear", "time", "quadratic", "cubic",
 34 |         "spline", "barycentric", "polynomial"
 35 |     )
 36 | 
 37 |     def __init__(self, fill_strategy="linear",
 38 |                  start=None, end=None, order=None):
 39 |         """Create an instance of the InterpolateImputer class.
 40 | 
 41 |         Args:
 42 |             fill_strategy (str, Optional): type of interpolation to perform
 43 |                 Default is linear. Other strategies supported include:
 44 |                 `time`, `quadratic`, `cubic`, `spline`, `barycentric`,
 45 |                 `polynomial`.
 46 |             start (int, Optional): value to impute if first number in
 47 |                 Series is missing. Default is None, but first valid used
 48 |                 when required for quadratic, cubic, polynomial.
 49 |             end (int, Optional): value to impute if last number in
 50 |                 Series is missing. Default is None, but last valid used
 51 |                 when required for quadratic, cubic, polynomial.
 52 |             order (int, Optional): if strategy is spline or polynomial,
 53 |                 order must be number. Otherwise not considered.
 54 | 
 55 |         Returns:
 56 |             self. Instance of the class.
 57 |         """
 58 |         self.fill_strategy = fill_strategy
 59 |         self.start = start
 60 |         self.end = end
 61 |         self.order = order
 62 | 
 63 |     @property
 64 |     def fill_strategy(self):
 65 |         """Property getter to return the value of fill_strategy property."""
 66 |         return self._fill_strategy
 67 | 
 68 |     @fill_strategy.setter
 69 |     def fill_strategy(self, fs):
 70 |         """Validate the fill_strategy property and set default parameters.
 71 | 
 72 |         Args:
 73 |             fs (str, Optional): if None, use linear.
 74 | 
 75 |         Raises:
 76 |             ValueError: not a valid fill strategy for InterpolateImputer
 77 |         """
 78 |         if fs not in self.fill_strategies:
 79 |             err = f"{fs} not a valid fill strategy for InterpolateImputer"
 80 |             raise ValueError(err)
 81 |         self._fill_strategy = fs
 82 | 
 83 |     def _handle_start(self, v, X):
 84 |         "private method to handle start values."
 85 |         if v is None:
 86 |             v = X.loc[X.first_valid_index()]
 87 |         if v == "mean":
 88 |             v = X.mean()
 89 |         return v
 90 | 
 91 |     def _handle_end(self, v, X):
 92 |         "private method to handle end values."
 93 |         if v is None:
 94 |             v = X.loc[X.last_valid_index()]
 95 |         if v == "mean":
 96 |             v = X.mean()
 97 |         return v
 98 | 
 99 |     def fit(self, X, y=None):
100 |         """Fit the Imputer to the dataset. Nothing to calculate.
101 | 
102 |         Args:
103 |             X (pd.Series): Dataset to fit the imputer.
104 |             y (None): ignored, None to meet requirements of base class
105 | 
106 |         Returns:
107 |             self. Instance of the class.
108 |         """
109 |         self.statistics_ = {"param": self.fill_strategy,
110 |                             "strategy": self.strategy}
111 |         return self
112 | 
113 |     def impute(self, X):
114 |         """Perform imputations using the statistics generated from fit.
115 | 
116 |         The impute method handles the actual imputation. Missing values
117 |         in a given dataset are replaced with results from interpolation.
118 | 
119 |         Args:
120 |             X (pd.Series): Dataset to impute missing data from fit.
121 | 
122 |         Returns:
123 |             np.array -- imputed dataset.
124 |         """
125 |         # check if fitted then impute with interpolation strategy
126 |         check_is_fitted(self, "statistics_")
127 |         imp = self.statistics_["param"]
128 | 
129 |         # setting defaults if no value passed for start and last
130 |         # quadratic, cubic, and polynomial require first and last
131 |         if imp in ("quadratic", "cubic", "polynomial"):
132 |             # handle start and end...
133 |             if pd.isnull(X.iloc[0]):
134 |                 ix = X.head(1).index[0]
135 |                 X.fillna(
136 |                     {ix: self._handle_start(self.start, X)}, inplace=True
137 |                 )
138 |             if pd.isnull(X.iloc[-1]):
139 |                 ix = X.tail(1).index[0]
140 |                 X.fillna(
141 |                     {ix: self._handle_end(self.end, X)}, inplace=True
142 |                 )
143 | 
144 |         # handling for methods that need order
145 |         num_observed = min(6, X.count())
146 |         if imp in ("polynomial", "spline"):
147 |             if self.order is None or self.order >= num_observed:
148 |                 err = f"Order must be between 1 and {num_observed-1}"
149 |                 raise ValueError(err)
150 | 
151 |         # finally, perform interpolation
152 |         return X.interpolate(method=imp,
153 |                              limit=None,
154 |                              limit_direction="both",
155 |                              inplace=False,
156 |                              order=self.order)
157 | 
158 |     def fit_impute(self, X, y=None):
159 |         """Convenience method to perform fit and imputation in one go."""
160 |         return self.fit(X, y).impute(X)
161 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/linear_regression.py:
--------------------------------------------------------------------------------
  1 | """This module implements least squares and stochastic imputation.
  2 | 
  3 | This module contains the LeastSquaresImputer and the StochasticImputer. Both
  4 | use least squares to find a line of best fit and fill imputations with the
  5 | predictions from the line. Stochastic adds random error to each prediction.
  6 | Dataframe imputers utilize this class when its strategy is requested. Use
  7 | SingleImputer or MultipleImputer with strategy = `least squares` to broadcast
  8 | the strategy across all the columns in a dataframe, or specify this strategy
  9 | for a given column.
 10 | """
 11 | 
 12 | from numpy import sqrt
 13 | from scipy.stats import norm
 14 | from sklearn.utils.validation import check_is_fitted
 15 | from sklearn.linear_model import LinearRegression
 16 | from sklearn.metrics import mean_squared_error
 17 | from autoimpute.imputations import method_names
 18 | from autoimpute.imputations.errors import _not_num_series
 19 | from .base import ISeriesImputer
 20 | methods = method_names
 21 | # pylint:disable=attribute-defined-outside-init
 22 | # pylint:
 23 | 
 24 | class LeastSquaresImputer(ISeriesImputer):
 25 |     """Impute missing values using predictions from least squares regression.
 26 | 
 27 |     The LeastSquaresImputer produces predictions using the least squares
 28 |     methodology. The prediction from the line of best fit given a set of
 29 |     predictors become the imputations. To implement least squares, the imputer
 30 |     wraps the sklearn LinearRegression class. The imputer can be used
 31 |     directly, but such behavior is discouraged. LeastSquaresImputer does not
 32 |     have the flexibility / robustness of dataframe imputers, nor is its
 33 |     behavior identical. Preferred use is
 34 |     MultipleImputer(strategy="least squares").
 35 |     """
 36 |     # class variables
 37 |     strategy = methods.LS
 38 | 
 39 |     def __init__(self, **kwargs):
 40 |         """Create an instance of the LeastSquaresImputer class.
 41 | 
 42 |         Args:
 43 |             **kwargs: keyword arguments passed to LinearRegression
 44 | 
 45 |         """
 46 |         self.lm = LinearRegression(**kwargs)
 47 | 
 48 |     def fit(self, X, y):
 49 |         """Fit the Imputer to the dataset by fitting linear model.
 50 | 
 51 |         Args:
 52 |             X (pd.Dataframe): dataset to fit the imputer.
 53 |             y (pd.Series): response, which is eventually imputed.
 54 | 
 55 |         Returns:
 56 |             self. Instance of the class.
 57 |         """
 58 |         _not_num_series(self.strategy, y)
 59 |         self.lm.fit(X, y)
 60 |         self.statistics_ = {"strategy": self.strategy}
 61 |         return self
 62 | 
 63 |     def impute(self, X):
 64 |         """Generate imputations using predictions from the fit linear model.
 65 | 
 66 |         The impute method returns the values for imputation. Missing values
 67 |         in a given dataset are replaced with the predictions from the least
 68 |         squares regression line of best fit. This transform method returns
 69 |         those predictions.
 70 | 
 71 |         Args:
 72 |             X (pd.DataFrame): predictors to determine imputed values.
 73 | 
 74 |         Returns:
 75 |             np.array: imputed dataset.
 76 |         """
 77 |         # check if fitted then predict with least squares
 78 |         check_is_fitted(self, "statistics_")
 79 |         imp = self.lm.predict(X)
 80 |         return imp
 81 | 
 82 |     def fit_impute(self, X, y):
 83 |         """Fit impute method to generate imputations where y is missing.
 84 | 
 85 |         Args:
 86 |             X (pd.Dataframe): predictors in the dataset.
 87 |             y (pd.Series): response w/ missing values to impute.
 88 | 
 89 |         Returns:
 90 |             np.array: imputed dataset.
 91 |         """
 92 |         # transform occurs with records from X where y is missing
 93 |         miss_y_ix = y[y.isnull()].index
 94 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
 95 | 
 96 | class StochasticImputer(ISeriesImputer):
 97 |     """Impute missing values adding error to least squares regression preds.
 98 | 
 99 |     The StochasticImputer predicts using the least squares methodology. The
100 |     imputer then samples from the regression's error distribution and adds the
101 |     random draw to the prediction. This draw adds the stochastic element to
102 |     the imputations. The imputer can be used directly, but such behavior is
103 |     discouraged. StochasticImputer does not have the flexibility / robustness
104 |     of dataframe imputers, nor is its behavior identical. Preferred use is
105 |     MultipleImputer(strategy="stochastic").
106 |     """
107 |     # class variables
108 |     strategy = methods.STOCHASTIC
109 | 
110 |     def __init__(self, **kwargs):
111 |         """Create an instance of the StochasticImputer class.
112 | 
113 |         Args:
114 |             **kwargs: keyword arguments passed to LinearRegression.
115 | 
116 |         """
117 |         self.lm = LinearRegression(**kwargs)
118 | 
119 |     def fit(self, X, y):
120 |         """Fit the Imputer to the dataset by fitting linear model.
121 | 
122 |         The fit step also generates predictions on the observed data. These
123 |         predictions are necessary to derive the mean_squared_error, which is
124 |         passed as a parameter to the impute phase. The MSE is used to create
125 |         the normal error distribution from which the imptuer draws.
126 | 
127 |         Args:
128 |             X (pd.Dataframe): dataset to fit the imputer.
129 |             y (pd.Series): response, which is eventually imputed.
130 | 
131 |         Returns:
132 |             self. Instance of the class.
133 |         """
134 |         _not_num_series(self.strategy, y)
135 |         self.lm.fit(X, y)
136 |         preds = self.lm.predict(X)
137 |         mse = mean_squared_error(y, preds)
138 |         self.statistics_ = {"param": mse, "strategy": self.strategy}
139 |         return self
140 | 
141 |     def impute(self, X):
142 |         """Generate imputations using predictions from the fit linear model.
143 | 
144 |         The impute method returns the values for imputation. Missing values
145 |         in a given dataset are replaced with the predictions from the least
146 |         squares regression line of best fit plus a random draw from the normal
147 |         error distribution.
148 | 
149 |         Args:
150 |             X (pd.DataFrame): predictors to determine imputed values.
151 | 
152 |         Returns:
153 |             np.array: imputed dataset.
154 |         """
155 |         # check if fitted then predict with least squares
156 |         check_is_fitted(self, "statistics_")
157 |         mse = self.statistics_["param"]
158 |         preds = self.lm.predict(X)
159 | 
160 |         # add random draw from normal dist w/ mean squared error
161 |         # from observed model. This makes lm stochastic
162 |         mse_dist = norm.rvs(loc=0, scale=sqrt(mse), size=len(preds))
163 |         imp = preds + mse_dist
164 |         return imp
165 | 
166 |     def fit_impute(self, X, y):
167 |         """Fit impute method to generate imputations where y is missing.
168 | 
169 |         Args:
170 |             X (pd.Dataframe): predictors in the dataset.
171 |             y (pd.Series): response w/ missing values to impute
172 | 
173 |         Returns:
174 |             np.array: imputated dataset.
175 |         """
176 |         # transform occurs with records from X where y is missing
177 |         miss_y_ix = y[y.isnull()].index
178 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
179 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/logistic_regression.py:
--------------------------------------------------------------------------------
  1 | """This module implements logistic regression imputation.
  2 | 
  3 | This module contains the BinaryLogisticImputer and the MultiLogisticImputer.
  4 | Both use logistic regression to generate class predictions that become values
  5 | for imputations of missing data. Binary is optimized to deal with two classes,
  6 | while Multi is optimized to deal with multiple classes. Dataframe imputers
  7 | utilize these classes when each's strategy is requested. Use SingleImputer or
  8 | MultipleImputer with strategy = `binary logistic` or `multinomial logistic`
  9 | to broadcast either strategy across all the columns in a dataframe, or specify
 10 | either strategy for a given column.
 11 | """
 12 | 
 13 | import warnings
 14 | from pandas import Series
 15 | from sklearn.utils.validation import check_is_fitted
 16 | from sklearn.linear_model import LogisticRegression
 17 | from autoimpute.imputations import method_names
 18 | from .base import ISeriesImputer
 19 | methods = method_names
 20 | # pylint:disable=attribute-defined-outside-init
 21 | # pylint:
 22 | 
 23 | class BinaryLogisticImputer(ISeriesImputer):
 24 |     """Impute missing values w/ predictions from binary logistic regression.
 25 | 
 26 |     The BinaryLogisticImputer produces predictions using logsitic regression
 27 |     with two classes. The class predictions given a set of predictors become
 28 |     the imputations. To implement logistic regression, the imputer wraps the
 29 |     sklearn LogisticRegression class with a default solver (liblinear). The
 30 |     imputer can be used directly, but such behavior is discouraged.
 31 |     BinaryLogisticImputer does not have the flexibility / robustness of
 32 |     dataframe imputers, nor is its behavior identical. Preferred use is
 33 |     MultipleImputer(strategy="binary logistic").
 34 |     """
 35 |     # class variables
 36 |     strategy = methods.BINARY_LOGISTIC
 37 | 
 38 |     def __init__(self, **kwargs):
 39 |         """Create an instance of the BinaryLogisticImputer class.
 40 | 
 41 |         Args:
 42 |             **kwargs: keyword arguments passed to LogisticRegresion.
 43 | 
 44 |         """
 45 |         self.solver = kwargs.pop("solver", "liblinear")
 46 |         self.glm = LogisticRegression(solver=self.solver, **kwargs)
 47 | 
 48 |     def fit(self, X, y):
 49 |         """Fit the Imputer to the dataset by fitting logistic model.
 50 | 
 51 |         Args:
 52 |             X (pd.Dataframe): dataset to fit the imputer.
 53 |             y (pd.Series): response, which is eventually imputed.
 54 | 
 55 |         Returns:
 56 |             self. Instance of the class.
 57 |         """
 58 |         y = y.astype("category").cat
 59 |         y_cat_l = len(y.codes.unique())
 60 |         if y_cat_l > 2:
 61 |             err = "Binary requires 2 categories. Use multinomial instead."
 62 |             raise ValueError(err)
 63 |         self.glm.fit(X, y.codes)
 64 |         self.statistics_ = {"param": y.categories, "strategy": self.strategy}
 65 |         return self
 66 | 
 67 |     def impute(self, X):
 68 |         """Generate imputations using predictions from the fit logistic model.
 69 | 
 70 |         The impute method returns the values for imputation. Missing values
 71 |         in a given dataset are replaced with the predictions from the logistic
 72 |         regression class specification.
 73 | 
 74 |         Args:
 75 |             X (pd.DataFrame): predictors to determine imputed values.
 76 | 
 77 |         Returns:
 78 |             np.array: imputed dataset.
 79 |         """
 80 |         # check if fitted then predict with logistic
 81 |         check_is_fitted(self, "statistics_")
 82 |         labels = self.statistics_["param"]
 83 |         preds = self.glm.predict(X)
 84 | 
 85 |         # map category codes back to actual labels
 86 |         # then impute the actual labels to keep categories in tact
 87 |         label_dict = {i:j for i, j in enumerate(labels.values)}
 88 |         imp = Series(preds).replace(label_dict, inplace=False)
 89 |         return imp.values
 90 | 
 91 |     def fit_impute(self, X, y):
 92 |         """Fit impute method to generate imputations where y is missing.
 93 | 
 94 |         Args:
 95 |             X (pd.Dataframe): predictors in the dataset.
 96 |             y (pd.Series): response w/ missing values to impute.
 97 | 
 98 |         Returns:
 99 |             np.array: imputed dataset.
100 |         """
101 |         # transform occurs with records from X where y is missing
102 |         miss_y_ix = y[y.isnull()].index
103 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
104 | 
105 | class MultinomialLogisticImputer(ISeriesImputer):
106 |     """Impute missing values w/ preds from multinomial logistic regression.
107 | 
108 |     The MultinomialLogisticImputer produces predictions w/ logsitic regression
109 |     with more than two classes. Class predictions given a set of predictors
110 |     become the imputations. To implement logistic regression, the imputer
111 |     wraps the sklearn LogisticRegression class with a default solver (saga)
112 |     and default `multi_class` set to multinomial. The imputer can be used
113 |     directly, but such behavior is discouraged. MultinomialLogisticImputer
114 |     does not have the flexibility / robustness of dataframe imputers, nor is
115 |     its behavior identical. Preferred use is
116 |     MultipleImputer(strategy="multinomial logistic").
117 |     """
118 |     # class variables
119 |     strategy = methods.MULTI_LOGISTIC
120 | 
121 |     def __init__(self, **kwargs):
122 |         """Create an instance of the MultiLogisticImputer class.
123 | 
124 |         Args:
125 |             **kwargs: keyword arguments passed to LogisticRegression.
126 | 
127 |         """
128 |         self.solver = kwargs.pop("solver", "saga")
129 |         self.multiclass = kwargs.pop("multi_class", "multinomial")
130 |         self.glm = LogisticRegression(
131 |             solver=self.solver,
132 |             multi_class=self.multiclass,
133 |             **kwargs
134 |         )
135 | 
136 |     def fit(self, X, y):
137 |         """Fit the Imputer to the dataset by fitting logistic model.
138 | 
139 |         Args:
140 |             X (pd.Dataframe): dataset to fit the imputer.
141 |             y (pd.Series): response, which is eventually imputed.
142 | 
143 |         Returns:
144 |             self. Instance of the class.
145 |         """
146 |         y = y.astype("category").cat
147 |         y_cat_l = len(y.codes.unique())
148 |         if y_cat_l == 2:
149 |             w = "Multiple categories (c) expected. Use binary instead if c=2."
150 |             warnings.warn(w)
151 |         self.glm.fit(X, y.codes)
152 |         self.statistics_ = {"param": y.categories, "strategy": self.strategy}
153 |         return self
154 | 
155 |     def impute(self, X):
156 |         """Generate imputations using predictions from the fit logistic model.
157 | 
158 |         The impute method returns the values for imputation. Missing values
159 |         in a given dataset are replaced with the predictions from the logistic
160 |         regression class specification.
161 | 
162 |         Args:
163 |             X (pd.DataFrame): predictors to determine imputed values.
164 | 
165 |         Returns:
166 |             np.array: imputed dataset.
167 |         """
168 |         # check if fitted then predict with logistic
169 |         check_is_fitted(self, "statistics_")
170 |         labels = self.statistics_["param"]
171 |         preds = self.glm.predict(X)
172 | 
173 |         # map category codes back to actual labels
174 |         # then impute the actual labels to keep categories in tact
175 |         label_dict = {i:j for i, j in enumerate(labels.values)}
176 |         imp = Series(preds).replace(label_dict, inplace=False)
177 |         return imp.values
178 | 
179 |     def fit_impute(self, X, y):
180 |         """Fit impute method to generate imputations where y is missing.
181 | 
182 |         Args:
183 |             X (pd.Dataframe): predictors in the dataset.
184 |             y (pd.Series): response w/ missing values to impute.
185 | 
186 |         Returns:
187 |             np.array: imputed dataset.
188 |         """
189 |         # transform occurs with records from X where y is missing
190 |         miss_y_ix = y[y.isnull()].index
191 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
192 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/lrd.py:
--------------------------------------------------------------------------------
  1 | """This module implements local residual draws via the LRDImputer.
  2 | 
  3 | This module contains the LRDImputer, which takes local residual draws to
  4 | impute missing values. LRD is similar to PMM, but residuals are added onto
  5 | the selected value for imputation. Dataframe imputers utilize this class when
  6 | its strategy is requested. Use SingleImputer or MultipleImputer with
  7 | strategy = `lrd` to broadcast the strategy across all the columns in a
  8 | dataframe, or specify this strategy for a given column.
  9 | """
 10 | 
 11 | import numpy as np
 12 | import pymc as pm
 13 | from pandas import DataFrame
 14 | from scipy.stats import multivariate_normal
 15 | from sklearn.linear_model import LinearRegression
 16 | from sklearn.utils.validation import check_is_fitted
 17 | from autoimpute.imputations import method_names
 18 | from autoimpute.imputations.errors import _not_num_series
 19 | from autoimpute.imputations.helpers import _local_residuals
 20 | from .base import ISeriesImputer
 21 | methods = method_names
 22 | # pylint:disable=attribute-defined-outside-init
 23 | # pylint:disable=no-member
 24 | # pylint:disable=unused-variable
 25 | 
 26 | class LRDImputer(ISeriesImputer):
 27 |     """Impute missing values using local residual draws.
 28 | 
 29 |     The LRDImputer produces predictions using a combination of bayesian
 30 |     approach to least squares and least squares itself. For each missing value
 31 |     LRD finds the `n` closest neighbors from a least squares regression
 32 |     prediction set, and samples from the corresponding true values for y of
 33 |     each of those `n` predictions. The imputation is the resulting sample plus
 34 |     the residual, or the distance between the prediction and the neighbor.
 35 |     To implement bayesian least squares, the imputer utlilizes the pymc
 36 |     library. The imputer can be used directly, but such behavior is
 37 |     discouraged. LRDImputer does not have the flexibility / robustness of
 38 |     dataframe imputers, nor is its behavior identical. Preferred use is
 39 |     MultipleImputer(strategy="lrd").
 40 |     """
 41 |     # class variables
 42 |     strategy = methods.LRD
 43 | 
 44 |     def __init__(self, **kwargs):
 45 |         """Create an instance of the LRDImputer class.
 46 | 
 47 |         The class requires multiple arguments necessary to create priors for
 48 |         a bayesian linear regression equation and least squares itself.
 49 |         Therefore, LRD arguments include all of those seen in bayesian least
 50 |         squares and least squares itself. New parameters include `neighbors`,
 51 |         or the number of neighbors that LRD uses to sample observed.
 52 | 
 53 |         Args:
 54 |             **kwargs: default keyword arguments for lm & bayesian analysis.
 55 |                 Note - kwargs popped for default arguments defined below.
 56 |                 Next set of kwargs popped and sent to linear regression.
 57 |                 Rest of kwargs passed as params to sampling (see pymc).
 58 |             am (float, Optional): mean of alpha prior. Default 0.
 59 |             asd (float, Optional): std. deviation of alpha prior. Default 10.
 60 |             bm (float, Optional): mean of beta priors. Default 0.
 61 |             bsd (float, Optional): std. deviation of beta priors. Default 10.
 62 |             sig (float, Optional): parameter of sigma prior. Default 1.
 63 |             sample (int, Optional): number of posterior samples per chain.
 64 |                 Default = 1000. More samples, longer to run, but better
 65 |                 approximation of the posterior & chance of convergence.
 66 |             tune (int, Optional): parameter for tuning. Draws done in addition
 67 |                 to sample. Default = 1000.
 68 |             init (str, Optional): MCMC algo to use for posterior sampling.
 69 |                 Default = 'auto'. See pymc docs for more info on choices.
 70 |             fill_value (str, Optional): How to draw from the posterior to
 71 |                 create imputations. Default is "random". 'random' and 'mean'
 72 |                 supported for explicit options.
 73 |             neighbors (int, Optional): number of neighbors. Default is 5.
 74 |                 Value should be greater than 0 and less than # observed,
 75 |                 although anything greater than 10-20 generally too high
 76 |                 unless dataset is massive.
 77 |             fit_intercept (bool, Optional): sklearn LinearRegression param.
 78 |             copy_x (bool, Optional): sklearn LinearRegression param.
 79 |             n_jobs (int, Optional): sklearn LinearRegression param.
 80 |         """
 81 |         self.am = kwargs.pop("am", None)
 82 |         self.asd = kwargs.pop("asd", 10)
 83 |         self.bm = kwargs.pop("bm", None)
 84 |         self.bsd = kwargs.pop("bsd", 10)
 85 |         self.sig = kwargs.pop("sig", 1)
 86 |         self.sample = kwargs.pop("sample", 1000)
 87 |         self.tune = kwargs.pop("tune", 1000)
 88 |         self.init = kwargs.pop("init", "auto")
 89 |         self.fill_value = kwargs.pop("fill_value", "random")
 90 |         self.neighbors = kwargs.pop("neighbors", 5)
 91 |         self.fit_intercept = kwargs.pop("fit_intercept", True)
 92 |         self.copy_x = kwargs.pop("copy_x", True)
 93 |         self.n_jobs = kwargs.pop("n_jobs", None)
 94 |         self.lm = LinearRegression(
 95 |             fit_intercept=self.fit_intercept,
 96 |             copy_X=self.copy_x,
 97 |             n_jobs=self.n_jobs
 98 |         )
 99 |         self.sample_kwargs = kwargs
100 | 
101 |     def fit(self, X, y):
102 |         """Fit the Imputer to the dataset by fitting bayesian and LS model.
103 | 
104 |         Args:
105 |             X (pd.Dataframe): dataset to fit the imputer.
106 |             y (pd.Series): response, which is eventually imputed.
107 | 
108 |         Returns:
109 |             self. Instance of the class.
110 |         """
111 |         _not_num_series(self.strategy, y)
112 |         nc = len(X.columns)
113 | 
114 |         # get predictions for the data, which will be used for "closest" vals
115 |         y_pred = self.lm.fit(X, y).predict(X)
116 |         y_df = DataFrame({"y": y, "y_pred": y_pred})
117 | 
118 |         # calculate bayes and use appropriate means for alpha and beta priors
119 |         # here we specify the point estimates from the linear regression as the
120 |         # means for the priors. This will greatly speed up posterior sampling
121 |         # and help ensure that convergence occurs
122 |         if self.am is None:
123 |             self.am = self.lm.intercept_
124 |         if self.bm is None:
125 |             self.bm = self.lm.coef_
126 | 
127 |         # initialize model for bayesian linear reg. Default vals for priors
128 |         # assume data is scaled and centered. Convergence can struggle or fail
129 |         # if not the case and proper values for the priors are not specified
130 |         # separately, also assumes each beta is normal and "independent"
131 |         # while betas likely not independent, this is technically a rule of OLS
132 |         with pm.Model() as fit_model:
133 |             alpha = pm.Normal("alpha", self.am, self.asd)
134 |             beta = pm.Normal("beta", self.bm, self.bsd, shape=nc)
135 |             sigma = pm.HalfCauchy("σ", self.sig)
136 |             mu = alpha+beta.dot(X.T)
137 |             score = pm.Normal("score", mu, sigma, observed=y)
138 |         params = {"model": fit_model, "y_obs": y_df}
139 |         self.statistics_ = {"param": params, "strategy": self.strategy}
140 |         return self
141 | 
142 |     def impute(self, X):
143 |         """Generate imputations using predictions from the fit bayesian model.
144 | 
145 |         The transform method returns the values for imputation. Missing values
146 |         in a given dataset are replaced with the random selection from the LRD
147 |         process. Again, LRD imputes actually observed values, and the observed
148 |         values are selected by finding the closest least squares predictions
149 |         to a given prediction from the bayesian model.
150 | 
151 |         Args:
152 |             X (pd.DataFrame): predictors to determine imputed values.
153 | 
154 |         Returns:
155 |             np.array: imputed dataset.
156 |         """
157 |         # check if fitted then predict with least squares
158 |         check_is_fitted(self, "statistics_")
159 |         model = self.statistics_["param"]["model"]
160 |         df = self.statistics_["param"]["y_obs"]
161 |         df = df.reset_index(drop=True)
162 | 
163 |         # generate posterior distribution for alpha, beta coefficients
164 |         with model:
165 |             tr = pm.sample(
166 |                 self.sample,
167 |                 tune=self.tune,
168 |                 init=self.init,
169 |                 **self.sample_kwargs
170 |             )
171 |         self.trace_ = tr
172 | 
173 |         # support for pymc - handling InferenceData obj instead of MultiTrace
174 |         # we have to compress chains ourselves w/ InferenceData obj (xarray)
175 |         post = tr.posterior
176 |         alpha_, beta_ = post.alpha.values, post.beta.values
177 |         chain, draws, beta_dim = beta_.shape
178 |         beta_ = beta_.reshape(chain*draws, beta_dim)
179 | 
180 |         # sample random alpha from alpha posterior distribution
181 |         alpha_bayes = np.random.choice(alpha_.ravel())
182 | 
183 |         # get the mean and covariance of the multivariate betas
184 |         # betas assumed multivariate normal by linear reg rules
185 |         # sample beta w/ cov structure to create realistic variability
186 |         beta_means, beta_cov = beta_.mean(0), np.cov(beta_.T)
187 |         beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs())
188 | 
189 |         # predictions for missing y, using bayes alpha + coeff samples
190 |         # use these preds for nearest neighbor search from reg results
191 |         # neighbors are nearest from prediction model fit on observed
192 |         # imputed values are actual y vals corresponding to nearest neighbors
193 |         # therefore, this is a form of "hot-deck" imputation
194 |         y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T)
195 |         n_ = self.neighbors
196 |         if X.columns.size == 1:
197 |             y_pred_bayes = y_pred_bayes[0]
198 |         if self.fill_value == "mean":
199 |             imp = [_local_residuals(x, n_, df, np.mean) for x in y_pred_bayes]
200 |         elif self.fill_value == "random":
201 |             choice = np.random.choice
202 |             imp = [_local_residuals(x, n_, df, choice) for x in y_pred_bayes]
203 |         else:
204 |             err = f"{self.fill_value} must be `mean` or `random`."
205 |             raise ValueError(err)
206 | 
207 |         # finally, set last class values and return imputations
208 |         self.y_pred = y_pred_bayes
209 |         self.alphas = alpha_bayes
210 |         self.betas = beta_bayes
211 |         return imp
212 | 
213 |     def fit_impute(self, X, y):
214 |         """Fit impute method to generate imputations where y is missing.
215 | 
216 |         Args:
217 |             X (pd.Dataframe): predictors in the dataset.
218 |             y (pd.Series): response w/ missing values to impute.
219 | 
220 |         Returns:
221 |             np.array: imputed dataset.
222 |         """
223 |         # transform occurs with records from X where y is missing
224 |         miss_y_ix = y[y.isnull()].index
225 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
226 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/mean.py:
--------------------------------------------------------------------------------
 1 | """This module implements mean imputation via the MeanImputer.
 2 | 
 3 | The MeanImputer imputes missing data with the mean of observed data.
 4 | Dataframe imputers utilize this class when its strategy is requested. Use
 5 | SingleImputer or MultipleImputer with strategy = `mean` to broadcast the
 6 | strategy across all the columns in a dataframe, or specify this strategy
 7 | for a given column.
 8 | """
 9 | 
10 | from sklearn.utils.validation import check_is_fitted
11 | from autoimpute.imputations import method_names
12 | from autoimpute.imputations.errors import _not_num_series
13 | from .base import ISeriesImputer
14 | methods = method_names
15 | # pylint:disable=attribute-defined-outside-init
16 | # pylint:disable=unnecessary-pass
17 | 
18 | class MeanImputer(ISeriesImputer):
19 |     """Impute missing values with the mean of the observed data.
20 | 
21 |     This imputer imputes missing values with the mean of observed data.
22 |     The imputer can be used directly, but such behavior is discouraged.
23 |     MeanImputer does not have the flexibility / robustness of dataframe
24 |     imputers, nor is its behavior identical. Preferred use is
25 |     MultipleImputer(strategy="mean").
26 |     """
27 |     # class variables
28 |     strategy = methods.MEAN
29 | 
30 |     def __init__(self):
31 |         """Create an instance of the MeanImputer class."""
32 |         pass
33 | 
34 |     def fit(self, X, y):
35 |         """Fit the Imputer to the dataset and calculate the mean.
36 | 
37 |         Args:
38 |             X (pd.Series): Dataset to fit the imputer.
39 |             y (None): ignored, None to meet requirements of base class
40 | 
41 |         Returns:
42 |             self. Instance of the class.
43 |         """
44 |         _not_num_series(self.strategy, X)
45 |         mu = X.mean()
46 |         self.statistics_ = {"param": mu, "strategy": self.strategy}
47 |         return self
48 | 
49 |     def impute(self, X):
50 |         """Perform imputations using the statistics generated from fit.
51 | 
52 |         The impute method handles the actual imputation. Missing values
53 |         in a given dataset are replaced with the respective mean from fit.
54 | 
55 |         Args:
56 |             X (pd.Series): Dataset to impute missing data from fit.
57 | 
58 |         Returns:
59 |             float -- imputed dataset.
60 |         """
61 |         # check if fitted then impute with mean
62 |         check_is_fitted(self, "statistics_")
63 |         _not_num_series(self.strategy, X)
64 |         imp = self.statistics_["param"]
65 |         return imp
66 | 
67 |     def fit_impute(self, X, y=None):
68 |         """Convenience method to perform fit and imputation in one go."""
69 |         return self.fit(X, y).impute(X)
70 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/median.py:
--------------------------------------------------------------------------------
 1 | """This module implements median imputation via the MedianImputer.
 2 | 
 3 | The MedianImputer imputes missing data with the median of observed data.
 4 | Dataframe imputers utilize this class when its strategy is requested. Use
 5 | SingleImputer or MultipleImputer with strategy = `median` to broadcast the
 6 | strategy across all the columns in a dataframe, or specify this strategy
 7 | for a given column.
 8 | """
 9 | 
10 | from sklearn.utils.validation import check_is_fitted
11 | from autoimpute.imputations import method_names
12 | from autoimpute.imputations.errors import _not_num_series
13 | from .base import ISeriesImputer
14 | methods = method_names
15 | # pylint:disable=attribute-defined-outside-init
16 | # pylint:disable=unnecessary-pass
17 | 
18 | class MedianImputer(ISeriesImputer):
19 |     """Impute missing values with the median of the observed data.
20 | 
21 |     This imputer imputes missing values with the median of observed data.
22 |     The imputer can be used directly, but such behavior is discouraged.
23 |     MedianImputer does not have the flexibility / robustness of dataframe
24 |     imputers, nor is its behavior identical. Preferred use is
25 |     MultipleImputer(strategy="median").
26 |     """
27 |     # class variables
28 |     strategy = methods.MEDIAN
29 | 
30 |     def __init__(self):
31 |         """Create an instance of the MedianImputer class."""
32 |         pass
33 | 
34 |     def fit(self, X, y=None):
35 |         """Fit the Imputer to the dataset and calculate the median.
36 | 
37 |         Args:
38 |             X (pd.Series): Dataset to fit the imputer.
39 |             y (None): ignored, None to meet requirements of base class
40 | 
41 |         Returns:
42 |             self. Instance of the class.
43 |         """
44 |         _not_num_series(self.strategy, X)
45 |         median = X.median()
46 |         self.statistics_ = {"param": median, "strategy": self.strategy}
47 |         return self
48 | 
49 |     def impute(self, X):
50 |         """Perform imputations using the statistics generated from fit.
51 | 
52 |         The impute method handles the actual imputation. Missing values
53 |         in a given dataset are replaced with the respective median from fit.
54 | 
55 |         Args:
56 |             X (pd.Series): Dataset to impute missing data from fit.
57 | 
58 |         Returns:
59 |             float -- imputed dataset.
60 |         """
61 |         # check is fitted then impute with median
62 |         check_is_fitted(self, "statistics_")
63 |         _not_num_series(self.strategy, X)
64 |         imp = self.statistics_["param"]
65 |         return imp
66 | 
67 |     def fit_impute(self, X, y=None):
68 |         """Convenience method to perform fit and imputation in one go."""
69 |         return self.fit(X, y).impute(X)
70 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/mode.py:
--------------------------------------------------------------------------------
  1 | """This module implements mode imputation via the ModeImputer.
  2 | 
  3 | The ModeImputer uses the mode of observed data to impute missing values.
  4 | Dataframe imputers utilize this class when its strategy is requested. Use
  5 | SingleImputer or MultipleImputer with strategy = `mode` to broadcast the
  6 | strategy across all the columns in a dataframe, or specify this strategy
  7 | for a given column.
  8 | """
  9 | 
 10 | import numpy as np
 11 | import pandas as pd
 12 | from sklearn.utils.validation import check_is_fitted
 13 | from autoimpute.imputations import method_names
 14 | from .base import ISeriesImputer
 15 | methods = method_names
 16 | # pylint:disable=attribute-defined-outside-init
 17 | 
 18 | class ModeImputer(ISeriesImputer):
 19 |     """Impute missing values with the mode of the observed data.
 20 | 
 21 |     The mode imputer calculates the mode of the observed dataset and uses
 22 |     it to impute missing observations. In the case where there are more than
 23 |     one mode, the user can supply a `fill_strategy` to choose the mode.
 24 |     The imputer can be used directly, but such behavior is discouraged.
 25 |     ModeImputer does not have the flexibility / robustness of dataframe
 26 |     imputers, nor is its behavior identical. Preferred use is
 27 |     MultipleImputer(strategy="mode").
 28 |     """
 29 |     # class variables
 30 |     strategy = methods.MODE
 31 |     fill_strategies = (None, "first", "last", "random")
 32 | 
 33 |     def __init__(self, fill_strategy=None):
 34 |         """Create an instance of the ModeImputer class.
 35 | 
 36 |         Args:
 37 |             fill_strategy (str, Optional): strategy to pick mode, if multiple.
 38 |                 Default is None, which means first mode taken.
 39 |                 Options include None, first, last, random.
 40 |                 First, None -> select first of modes.
 41 |                 Last -> select the last of modes.
 42 |                 Random -> randomly sample from modes with replacement.
 43 |         """
 44 |         self.fill_strategy = fill_strategy
 45 | 
 46 |     @property
 47 |     def fill_strategy(self):
 48 |         """Property getter to return the value of fill_strategy property."""
 49 |         return self._fill_strategy
 50 | 
 51 |     @fill_strategy.setter
 52 |     def fill_strategy(self, fs):
 53 |         """Validate the fill_strategy property and set default parameters.
 54 | 
 55 |         Args:
 56 |             fs (str, None): if None, use first mode.
 57 | 
 58 |         Raises:
 59 |             ValueError: not a valid fill strategy for ModeImputer.
 60 |         """
 61 |         if fs not in self.fill_strategies:
 62 |             err = f"{fs} not a valid fill strategy for ModeImputer"
 63 |             raise ValueError(err)
 64 |         self._fill_strategy = fs
 65 | 
 66 |     def fit(self, X, y=None):
 67 |         """Fit the Imputer to the dataset and calculate the mode.
 68 | 
 69 |         Args:
 70 |             X (pd.Series): Dataset to fit the imputer.
 71 |             y (None): ignored, None to meet requirements of base class
 72 | 
 73 |         Returns:
 74 |             self. Instance of the class.
 75 |         """
 76 |         mode = X.mode().values
 77 |         self.statistics_ = {"param": mode, "strategy": self.strategy}
 78 |         return self
 79 | 
 80 |     def impute(self, X):
 81 |         """Perform imputations using the statistics generated from fit.
 82 | 
 83 |         This method handles the actual imputation. Missing values in a given
 84 |         dataset are replaced with the mode observed from fit. Note that there
 85 |         can be more than one mode. If more than one mode, use the
 86 |         fill_strategy to determine how to use the modes.
 87 | 
 88 |         Args:
 89 |             X (pd.Series): Dataset to impute missing data from fit.
 90 | 
 91 |         Returns:
 92 |             float or np.array -- imputed dataset.
 93 |         """
 94 |         # check is fitted and identify locations of missingness
 95 |         check_is_fitted(self, "statistics_")
 96 |         ind = X[X.isnull()].index
 97 | 
 98 |         # get the number of modes
 99 |         imp = self.statistics_["param"]
100 | 
101 |         # default imputation is to pick first, such as scipy does
102 |         if self.fill_strategy is None:
103 |             imp = imp[0]
104 | 
105 |         # picking the first of the modes when fill_strategy = first
106 |         if self.fill_strategy == "first":
107 |             imp = imp[0]
108 | 
109 |         # picking the last of the modes when fill_strategy = last
110 |         if self.fill_strategy == "last":
111 |             imp = imp[-1]
112 | 
113 |         # sampling when strategy is random
114 |         if self.fill_strategy == "random":
115 |             num_modes = len(imp)
116 |             # check if more modes
117 |             if num_modes == 1:
118 |                 imp = imp[0]
119 |             else:
120 |                 samples = np.random.choice(imp, len(ind))
121 |                 imp = pd.Series(samples, index=ind).values
122 | 
123 |         # finally, fill in the right fill values for missing X
124 |         return imp
125 | 
126 |     def fit_impute(self, X, y=None):
127 |         """Convenience method to perform fit and imputation in one go."""
128 |         return self.fit(X, y).impute(X)
129 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/norm.py:
--------------------------------------------------------------------------------
 1 | """This module implements norm imputation via the NormImputer.
 2 | 
 3 | The NormImputer imputes missing data with random draws from a construted
 4 | normal distribution. Dataframe imputers utilize this class when its strategy
 5 | is requested. Use SingleImputer or MultipleImputer with strategy = `norm` to
 6 | broadcast the strategy across all the columns in a dataframe, or specify this
 7 | strategy for a given column.
 8 | """
 9 | 
10 | from scipy.stats import norm
11 | from sklearn.utils.validation import check_is_fitted
12 | from autoimpute.imputations import method_names
13 | from autoimpute.imputations.errors import _not_num_series
14 | from .base import ISeriesImputer
15 | methods = method_names
16 | # pylint:disable=attribute-defined-outside-init
17 | # pylint:disable=unnecessary-pass
18 | 
19 | class NormImputer(ISeriesImputer):
20 |     """Impute missing data with draws from normal distribution.
21 | 
22 |     The NormImputer constructs a normal distribution using the sample mean and
23 |     variance of the observed data. The imputer then randomly samples from this
24 |     distribution to impute missing data. The imputer can be used directly, but
25 |     such behavior is discouraged. NormImputer does not have the flexibility /
26 |     robustness of dataframe imputers, nor is its behavior identical.
27 |     Preferred use is MultipleImputer(strategy="norm").
28 |     """
29 |     # class variables
30 |     strategy = methods.NORM
31 | 
32 |     def __init__(self):
33 |         """Create an instance of the NormImputer class."""
34 |         pass
35 | 
36 |     def fit(self, X, y=None):
37 |         """Fit Imputer to dataset and calculate mean and sample variance.
38 | 
39 |         Args:
40 |             X (pd.Series): Dataset to fit the imputer.
41 |             y (None): ignored, None to meet requirements of base class
42 | 
43 |         Returns:
44 |             self. Instance of the class.
45 |         """
46 | 
47 |         # get the moments for the normal distribution of feature X
48 |         _not_num_series(self.strategy, X)
49 |         moments = (X.mean(), X.std())
50 |         self.statistics_ = {"param": moments, "strategy": self.strategy}
51 |         return self
52 | 
53 |     def impute(self, X):
54 |         """Perform imputations using the statistics generated from fit.
55 | 
56 |         The transform method handles the actual imputation. It constructs a
57 |         normal distribution using the sample mean and variance from fit.
58 |         It then imputes missing values with a random draw from the respective
59 |         distribution.
60 | 
61 |         Args:
62 |             X (pd.Series): Dataset to impute missing data from fit.
63 | 
64 |         Returns:
65 |             np.array -- imputed dataset.
66 |         """
67 | 
68 |         # check if fitted and identify location of missingness
69 |         check_is_fitted(self, "statistics_")
70 |         _not_num_series(self.strategy, X)
71 |         ind = X[X.isnull()].index
72 | 
73 |         # create normal distribution and sample from it
74 |         imp_mean, imp_std = self.statistics_["param"]
75 |         imp = norm(imp_mean, imp_std).rvs(size=len(ind))
76 |         return imp
77 | 
78 |     def fit_impute(self, X, y):
79 |         """Convenience method to perform fit and imputation in one go."""
80 |         return self.fit(X, y=None).impute(X)
81 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/norm_unit_variance.py:
--------------------------------------------------------------------------------
 1 | """This module implements normal imputation with constant unit variance single imputation 
 2 | via the NormUnitVarianceImputer.
 3 | 
 4 | The NormUnitVarianceImputer imputes missing data assuming that the
 5 | single column is normally distributed with a-priori known constant  unit
 6 | variance. Use SingleImputer or MultipleImputer with strategy=`norm_const_variance`
 7 | to broadcast the strategy across all the columns in a dataframe, 
 8 | or specify this strategy for a given column.
 9 | """
10 | 
11 | from scipy import stats
12 | import pandas as pd
13 | import numpy as np
14 | from sklearn.utils.validation import check_is_fitted
15 | from autoimpute.imputations import method_names
16 | from autoimpute.imputations.errors import _not_num_series
17 | from .base import ISeriesImputer
18 | methods = method_names
19 | # pylint:disable=attribute-defined-outside-init
20 | # pylint:disable=unnecessary-pass
21 | 
22 | class NormUnitVarianceImputer(ISeriesImputer):
23 |     """Impute missing values assuming normally distributed 
24 |     data with unknown mean and *known* variance.
25 |     """
26 |     # class variables
27 |     strategy = methods.NORM_UNIT_VARIANCE
28 | 
29 |     def __init__(self):
30 |         """Create an instance of the NormUnitVarianceImputer class."""
31 |         pass
32 | 
33 |     def fit(self, X, y):
34 |         """Fit the Imputer to the dataset and calculate the mean.
35 | 
36 |         Args:
37 |             X (pd.Series): Dataset to fit the imputer.
38 |             y (None): ignored, None to meet requirements of base class
39 | 
40 |         Returns:
41 |             self. Instance of the class.
42 |         """
43 |         _not_num_series(self.strategy, X)
44 |         mu = X.mean()  # mean of observed data
45 |         self.statistics_ = {"param": mu, "strategy": self.strategy}
46 |         return self
47 | 
48 |     def impute(self, X):
49 |         """Perform imputations using the statistics generated from fit.
50 | 
51 |         The impute method handles the actual imputation. Missing values
52 |         in a given dataset are replaced with the respective mean from fit.
53 | 
54 |         Args:
55 |             X (pd.Series): Dataset to impute missing data from fit.
56 | 
57 |         Returns:
58 |             np.array -- imputed dataset.
59 |         """
60 |         # check if fitted then impute with mean
61 |         check_is_fitted(self, "statistics_")
62 |         _not_num_series(self.strategy, X)
63 |         omu = self.statistics_["param"] # mean of observed data
64 |         idx = X.isnull()                # missing data
65 |         nO = sum(~idx)                  # number of observed
66 |         m = sum(idx)                    # number to impute
67 |         muhatk = stats.norm(omu,np.sqrt(1/nO))
68 |         # imputation cross-terms *NOT* uncorrelated
69 |         Ymi=stats.multivariate_normal(np.ones(m)*muhatk.rvs(),
70 |                                       np.ones((m,m))/nO+np.eye(m)).rvs()
71 |         out = X.copy()
72 |         out[idx] = Ymi
73 |         return out
74 | 
75 |     def fit_impute(self, X, y=None):
76 |         """Convenience method to perform fit and imputation in one go."""
77 |         return self.fit(X, y).impute(X)
78 | 
79 | if __name__ == '__main__':
80 |     from autoimpute.imputations import SingleImputer
81 |     si=SingleImputer('normal unit variance')
82 |     Yo=stats.norm(0,1).rvs(100)
83 |     df = pd.DataFrame(columns=['Yo'],index=range(200),dtype=float)
84 |     df.loc[range(100),'Yo'] = Yo
85 |     si.fit_transform(df)
86 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/pmm.py:
--------------------------------------------------------------------------------
  1 | """This module implements predictive mean matching via the PMMImputer.
  2 | 
  3 | This module contains the PMMImputer, which implements predictive mean matching
  4 | to impute missing values. Predictive mean matching is a semi-supervised,
  5 | hot-deck technique to impute missing values. Dataframe imputers utilize this
  6 | class when its strategy is requested. Use SingleImputer or MultipleImputer
  7 | with strategy = `pmm` to broadcast the strategy across all the columns in a
  8 | dataframe, or specify this strategy for a given column.
  9 | """
 10 | 
 11 | import numpy as np
 12 | import pymc as pm
 13 | from pandas import DataFrame
 14 | from scipy.stats import multivariate_normal
 15 | from sklearn.linear_model import LinearRegression
 16 | from sklearn.utils.validation import check_is_fitted
 17 | from autoimpute.imputations import method_names
 18 | from autoimpute.imputations.helpers import _neighbors
 19 | from autoimpute.imputations.errors import _not_num_series
 20 | from .base import ISeriesImputer
 21 | methods = method_names
 22 | # pylint:disable=attribute-defined-outside-init
 23 | # pylint:disable=no-member
 24 | # pylint:disable=unused-variable
 25 | 
 26 | class PMMImputer(ISeriesImputer):
 27 |     """Impute missing values using predictive mean matching.
 28 | 
 29 |     The PMMIMputer produces predictions using a combination of bayesian
 30 |     approach to least squares and least squares itself. For each missing value
 31 |     PMM finds the `n` closest neighbors from a least squares regression
 32 |     prediction set, and samples from the corresponding true values for y of
 33 |     each of those `n` predictions. The imputation is the resulting sample.
 34 |     To implement bayesian least squares, the imputer utlilizes the pymc
 35 |     library. The imputer can be used directly, but such behavior is
 36 |     discouraged. PmmImputer does not have the flexibility / robustness of
 37 |     dataframe imputers, nor is its behavior identical. Preferred use is
 38 |     MultipleImputer(strategy="pmm").
 39 |     """
 40 |     # class variables
 41 |     strategy = methods.PMM
 42 | 
 43 |     def __init__(self, **kwargs):
 44 |         """Create an instance of the PMMImputer class.
 45 | 
 46 |         The class requires multiple arguments necessary to create priors for
 47 |         a bayesian linear regression equation and least squares itself.
 48 |         Therefore, PMM arguments include all of those seen in bayesian least
 49 |         squares and least squares itself. New parameters include `neighbors`,
 50 |         or the number of neighbors that PMM uses to sample observed.
 51 | 
 52 |         Args:
 53 |             **kwargs: default keyword arguments for lm & bayesian analysis.
 54 |                 Note - kwargs popped for default arguments defined below.
 55 |                 Next set of kwargs popped and sent to linear regression.
 56 |                 Rest of kwargs passed as params to sampling (see pymc).
 57 |             am (float, Optional): mean of alpha prior. Default 0.
 58 |             asd (float, Optional): std. deviation of alpha prior. Default 10.
 59 |             bm (float, Optional): mean of beta priors. Default 0.
 60 |             bsd (float, Optional): std. deviation of beta priors. Default 10.
 61 |             sig (float, Optional): parameter of sigma prior. Default 1.
 62 |             sample (int, Optional): number of posterior samples per chain.
 63 |                 Default = 1000. More samples, longer to run, but better
 64 |                 approximation of the posterior & chance of convergence.
 65 |             tune (int, Optional): parameter for tuning. Draws done in addition
 66 |                 to sample. Default = 1000.
 67 |             init (str, Optional): MCMC algo to use for posterior sampling.
 68 |                 Default = 'auto'. See pymc docs for more info on choices.
 69 |             fill_value (str, Optional): How to draw from the posterior to
 70 |                 create imputations. Default is "random". 'random' and 'mean'
 71 |                 supported for explicit options.
 72 |             neighbors (int, Optional): number of neighbors. Default is 5.
 73 |                 Value should be greater than 0 and less than # observed,
 74 |                 although anything greater than 10-20 generally too high
 75 |                 unless dataset is massive.
 76 |             fit_intercept (bool, Optional): sklearn LinearRegression param.
 77 |             normalize (bool, Optional): sklearn LinearRegression param.
 78 |             copy_x (bool, Optional): sklearn LinearRegression param.
 79 |             n_jobs (int, Optional): sklearn LinearRegression param.
 80 |         """
 81 |         self.am = kwargs.pop("am", None)
 82 |         self.asd = kwargs.pop("asd", 10)
 83 |         self.bm = kwargs.pop("bm", None)
 84 |         self.bsd = kwargs.pop("bsd", 10)
 85 |         self.sig = kwargs.pop("sig", 1)
 86 |         self.sample = kwargs.pop("sample", 1000)
 87 |         self.tune = kwargs.pop("tune", 1000)
 88 |         self.init = kwargs.pop("init", "auto")
 89 |         self.fill_value = kwargs.pop("fill_value", "random")
 90 |         self.neighbors = kwargs.pop("neighbors", 5)
 91 |         self.fit_intercept = kwargs.pop("fit_intercept", True)
 92 |         self.copy_x = kwargs.pop("copy_x", True)
 93 |         self.n_jobs = kwargs.pop("n_jobs", None)
 94 |         self.lm = LinearRegression(
 95 |             fit_intercept=self.fit_intercept,
 96 |             copy_X=self.copy_x,
 97 |             n_jobs=self.n_jobs
 98 |         )
 99 |         self.sample_kwargs = kwargs
100 | 
101 |     def fit(self, X, y):
102 |         """Fit the Imputer to the dataset by fitting bayesian and LS model.
103 | 
104 |         Args:
105 |             X (pd.Dataframe): dataset to fit the imputer.
106 |             y (pd.Series): response, which is eventually imputed.
107 | 
108 |         Returns:
109 |             self. Instance of the class.
110 |         """
111 |         _not_num_series(self.strategy, y)
112 |         nc = len(X.columns)
113 | 
114 |         # get predictions for the data, which will be used for "closest" vals
115 |         y_pred = self.lm.fit(X, y).predict(X)
116 |         y_df = DataFrame({"y": y, "y_pred": y_pred})
117 | 
118 |         # calculate bayes and use appropriate means for alpha and beta priors
119 |         # here we specify the point estimates from the linear regression as the
120 |         # means for the priors. This will greatly speed up posterior sampling
121 |         # and help ensure that convergence occurs
122 |         if self.am is None:
123 |             self.am = self.lm.intercept_
124 |         if self.bm is None:
125 |             self.bm = self.lm.coef_
126 | 
127 |         # initialize model for bayesian linear reg. Default vals for priors
128 |         # assume data is scaled and centered. Convergence can struggle or fail
129 |         # if not the case and proper values for the priors are not specified
130 |         # separately, also assumes each beta is normal and "independent"
131 |         # while betas likely not independent, this is technically a rule of OLS
132 |         with pm.Model() as fit_model:
133 |             alpha = pm.Normal("alpha", self.am, self.asd)
134 |             beta = pm.Normal("beta", self.bm, self.bsd, shape=nc)
135 |             sigma = pm.HalfCauchy("σ", self.sig)
136 |             mu = alpha+beta.dot(X.T)
137 |             score = pm.Normal("score", mu, sigma, observed=y)
138 |         params = {"model": fit_model, "y_obs": y_df}
139 |         self.statistics_ = {"param": params, "strategy": self.strategy}
140 |         return self
141 | 
142 |     def impute(self, X):
143 |         """Generate imputations using predictions from the fit bayesian model.
144 | 
145 |         The transform method returns the values for imputation. Missing values
146 |         in a given dataset are replaced with the random selection from the PMM
147 |         process. Again, PMM imputes actually observed values, and the observed
148 |         values are selected by finding the closest least squares predictions
149 |         to a given prediction from the bayesian model.
150 | 
151 |         Args:
152 |             X (pd.DataFrame): predictors to determine imputed values.
153 | 
154 |         Returns:
155 |             np.array: imputed dataset.
156 |         """
157 |         # check if fitted then predict with least squares
158 |         check_is_fitted(self, "statistics_")
159 |         model = self.statistics_["param"]["model"]
160 |         df = self.statistics_["param"]["y_obs"]
161 |         df = df.reset_index(drop=True)
162 | 
163 |         # generate posterior distribution for alpha, beta coefficients
164 |         with model:
165 |             tr = pm.sample(
166 |                 self.sample,
167 |                 tune=self.tune,
168 |                 init=self.init,
169 |                 **self.sample_kwargs
170 |             )
171 |         self.trace_ = tr
172 | 
173 |         # support for pymc - handling InferenceData obj instead of MultiTrace
174 |         # we have to compress chains ourselves w/ InferenceData obj (xarray)
175 |         post = tr.posterior
176 |         alpha_, beta_ = post.alpha.values, post.beta.values
177 |         chain, draws, beta_dim = beta_.shape
178 |         beta_ = beta_.reshape(chain*draws, beta_dim)
179 | 
180 |         # sample random alpha from alpha posterior distribution
181 |         alpha_bayes = np.random.choice(alpha_.ravel())
182 | 
183 |         # get the mean and covariance of the multivariate betas
184 |         # betas assumed multivariate normal by linear reg rules
185 |         # sample beta w/ cov structure to create realistic variability
186 |         beta_means, beta_cov = beta_.mean(0), np.cov(beta_.T)
187 |         beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs())
188 | 
189 |         # predictions for missing y, using bayes alpha + coeff samples
190 |         # use these preds for nearest neighbor search from reg results
191 |         # neighbors are nearest from prediction model fit on observed
192 |         # imputed values are actual y vals corresponding to nearest neighbors
193 |         # therefore, this is a form of "hot-deck" imputation
194 |         y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T)
195 |         n_ = self.neighbors
196 |         if X.columns.size == 1:
197 |             y_pred_bayes = y_pred_bayes[0]
198 |         if self.fill_value == "mean":
199 |             imp = [_neighbors(x, n_, df, np.mean) for x in y_pred_bayes]
200 |         elif self.fill_value == "random":
201 |             choice = np.random.choice
202 |             imp = [_neighbors(x, n_, df, choice) for x in y_pred_bayes]
203 |         else:
204 |             err = f"{self.fill_value} must be `mean` or `random`."
205 |             raise ValueError(err)
206 | 
207 |         # finally, set last class values and return imputations
208 |         self.y_pred = y_pred_bayes
209 |         self.alphas = alpha_bayes
210 |         self.betas = beta_bayes
211 |         return imp
212 | 
213 |     def fit_impute(self, X, y):
214 |         """Fit impute method to generate imputations where y is missing.
215 | 
216 |         Args:
217 |             X (pd.Dataframe): predictors in the dataset.
218 |             y (pd.Series): response w/ missing values to impute.
219 | 
220 |         Returns:
221 |             np.array: imputed dataset.
222 |         """
223 |         # transform occurs with records from X where y is missing
224 |         miss_y_ix = y[y.isnull()].index
225 |         return self.fit(X, y).impute(X.loc[miss_y_ix])
226 | 


--------------------------------------------------------------------------------
/autoimpute/imputations/series/random.py:
--------------------------------------------------------------------------------
 1 | """This module implements random imputation via the RandomImputer.
 2 | 
 3 | The RandomImputer imputes missing data using a random draw with replacement
 4 | from the observed data. Dataframe imputers utilize this class when its
 5 | strategy is requested. Use SingleImputer or MultipleImputer with
 6 | strategy = `random` to broadcast the strategy across all the columns in a
 7 | dataframe, or specify this strategy for a given column.
 8 | """
 9 | 
10 | import numpy as np
11 | from sklearn.utils.validation import check_is_fitted
12 | from autoimpute.imputations import method_names
13 | from .base import ISeriesImputer
14 | methods = method_names
15 | # pylint:disable=attribute-defined-outside-init
16 | # pylint:disable=unnecessary-pass
17 | 
18 | class RandomImputer(ISeriesImputer):
19 |     """Impute missing data using random draws from observed data.
20 | 
21 |     The RandomImputer samples with replacement from observed data. The imputer
22 |     can be used directly, but such behavior is discouraged. RandomImputer does
23 |     not have the flexibility / robustness of dataframe imputers, nor is its
24 |     behavior identical. Preferred use is MultipleImputer(strategy="random").
25 |     """
26 |     # class variables
27 |     strategy = methods.RANDOM
28 | 
29 |     def __init__(self):
30 |         """Create an instance of the RandomImputer class."""
31 |         pass
32 | 
33 |     def fit(self, X, y=None):
34 |         """Fit the Imputer to the dataset and get unique observed to sample.
35 | 
36 |         Args:
37 |             X (pd.Series): Dataset to fit the imputer.
38 |             y (None): ignored, None to meet requirements of base class
39 | 
40 |         Returns:
41 |             self. Instance of the class.
42 |         """
43 | 
44 |         # determine set of observed values to sample from
45 |         random = list(set(X[~X.isnull()]))
46 |         self.statistics_ = {"param": random, "strategy": self.strategy}
47 |         return self
48 | 
49 |     def impute(self, X):
50 |         """Perform imputations using the statistics generated from fit.
51 | 
52 |         The transform method handles the actual imputation. Each missing value
53 |         in a given dataset is replaced with a random draw from unique set of
54 |         observed values determined during the fit stage.
55 | 
56 |         Args:
57 |             X (pd.Series): Dataset to impute missing data from fit.
58 | 
59 |         Returns:
60 |             np.array -- imputed dataset
61 |         """
62 |         # check if fitted and identify location of missingness
63 |         check_is_fitted(self, "statistics_")
64 |         ind = X[X.isnull()].index
65 | 
66 |         # get the observed values and sample from them
67 |         param = self.statistics_["param"]
68 |         imp = np.random.choice(param, len(ind))
69 |         return imp
70 | 
71 |     def fit_impute(self, X, y=None):
72 |         """Convenience method to perform fit and imputation in one go."""
73 |         return self.fit(X, y).impute(X)
74 | 


--------------------------------------------------------------------------------
/autoimpute/utils/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the utils lib from the autoimpute package.
 2 | 
 3 | This module handles imports from the utils directory that should be accessible
 4 | whenever someone imports autoimpute.utils. The imports include methods for
 5 | checks & validations as well as functions to explore patterns in missing data.
 6 | 
 7 | This module handles `from autoimpute.utils import *` with the __all__ variable
 8 | below. This command imports the main public methods from autoimpute.utils.
 9 | """
10 | 
11 | from .checks import check_data_structure, check_missingness
12 | from .checks import check_nan_columns, check_strategy_allowed
13 | from .checks import check_strategy_fit, check_predictors_fit
14 | from .patterns import md_pairs, md_pattern, md_locations
15 | from .patterns import inbound, outbound, influx, outflux, flux
16 | from .patterns import proportions, nullility_cov, nullility_corr
17 | 
18 | __all__ = [
19 |     "check_data_structure",
20 |     "check_missingness",
21 |     "check_nan_columns",
22 |     "check_strategy_allowed",
23 |     "check_strategy_fit",
24 |     "check_predictors_fit",
25 |     "md_pairs",
26 |     "md_pattern",
27 |     "md_locations",
28 |     "inbound",
29 |     "outbound",
30 |     "influx",
31 |     "outflux",
32 |     "flux",
33 |     "proportions",
34 |     "nullility_cov",
35 |     "nullility_corr"
36 | ]
37 | 


--------------------------------------------------------------------------------
/autoimpute/utils/dataframes.py:
--------------------------------------------------------------------------------
  1 | """Sample DataFrames utilized for examples and tests."""
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | from sklearn.datasets import make_regression
  6 | # pylint:disable=missing-docstring
  7 | 
  8 | # missingness lambdas
  9 | eq_miss = lambda x: np.random.choice([x, np.nan], 1)[0]
 10 | val_miss = lambda x: np.random.choice([x, np.nan], 1, p=[x/100, 1-x/100])[0]
 11 | 
 12 | # strategies used for imputation
 13 | num_strategies = ["mean", "median", "mode", "random", "norm", "interpolate",'normal unit variance']
 14 | cat_strategies = ["mode", "categorical"]
 15 | time_strategies = ["interpolate", "locf", "nocb"]
 16 | 
 17 | # Numerical DataFrame with different % missingness per column
 18 | df_num = pd.DataFrame()
 19 | df_num["A"] = np.random.choice(np.arange(90, 100), 1000)
 20 | df_num["B"] = np.random.choice(np.arange(50, 100), 1000)
 21 | df_num["C"] = np.random.choice(np.arange(1, 100), 1000)
 22 | df_num["A"] = df_num["A"].apply(eq_miss)
 23 | df_num["B"] = df_num["B"].apply(val_miss)
 24 | 
 25 | # Numerical DataFrame with column names for missingness classifier
 26 | df_mis_classifier = pd.DataFrame()
 27 | df_mis_classifier["a"] = np.random.choice(np.arange(90, 100), 1000)
 28 | df_mis_classifier["k"] = np.random.choice(np.arange(50, 100), 1000)
 29 | df_mis_classifier["c"] = np.random.choice(np.arange(1, 100), 1000)
 30 | df_mis_classifier["a"] = df_mis_classifier["a"].apply(eq_miss)
 31 | df_mis_classifier["k"] = df_mis_classifier["k"].apply(val_miss)
 32 | 
 33 | # Mixed DataFrame with different % missingness per column & some dependence
 34 | df_mix = pd.DataFrame()
 35 | df_mix["gender"] = np.random.choice(["Male", "Female", None], 500)
 36 | df_mix["salary"] = np.random.choice(np.arange(20, 100), 500)
 37 | df_mix["age"] = np.random.choice(
 38 |     [10, 20, 30, 40, 50, 60, 70], 500,
 39 |     p=[0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1]
 40 | )
 41 | df_mix["amm"] = np.random.choice(np.arange(0, 100), 500)
 42 | 
 43 | for each in df_mix.index:
 44 |     s = df_mix.loc[each, "salary"]
 45 |     if df_mix.loc[each, "age"] > 50:
 46 |         s = np.random.choice([s, np.nan], p=[0.3, 0.7])
 47 |     elif df_mix.loc[each, "age"] < 30:
 48 |         s = np.random.choice([s, np.nan], p=[0.4, 0.6])
 49 | 
 50 | # DataFrame with numerical feature, `np.nan` column, `None` column
 51 | df_col_miss = pd.DataFrame()
 52 | df_col_miss["A"] = np.random.choice(np.arange(90, 100), 1000)
 53 | df_col_miss["A"] = df_col_miss["A"].apply(eq_miss)
 54 | df_col_miss["B"] = np.random.choice([np.nan], 1000)
 55 | df_col_miss["C"] = np.random.choice([None], 1000)
 56 | 
 57 | # DataFrame to test all missing
 58 | df_all_miss = pd.DataFrame({
 59 |     "A":[None, None],
 60 |     "B": [np.nan, np.nan]
 61 | })
 62 | 
 63 | # DataFrame to test time series
 64 | df_ts_num = pd.DataFrame({
 65 |     "date": pd.to_datetime(['2018-01-04', '2018-01-05', '2018-01-06',
 66 |                             '2018-01-07', '2018-01-08', '2018-01-09']),
 67 |     "date_tm1": pd.to_datetime(['2017-01-04', '2017-01-05', '2017-01-06',
 68 |                                 '2017-01-07', '2017-01-08', '2017-01-09']),
 69 |     "values": [np.nan, 20, np.nan, 30, 40, np.nan],
 70 |     "values_tm1": [np.nan, 15, np.nan, 25, 50, 70]
 71 | })
 72 | 
 73 | # DataFrame to test default with added time column and cat column
 74 | df_ts_mixed = pd.DataFrame({
 75 |     "date": pd.to_datetime(['2018-01-04', '2018-01-05', '2018-01-06',
 76 |                             '2018-01-07', '2018-01-08', '2018-01-09']),
 77 |     "values": [271238, 329285, np.nan, 260260, 263711, np.nan],
 78 |     "cats": ["red", None, "green", "green", "red", "green"]
 79 | })
 80 | 
 81 | # bayesian regression testing
 82 | sc = lambda x, d: (x-d.mean())/d.std()
 83 | mis = lambda x: np.random.choice([x, np.nan], 1, p=[0.8, 0.2])[0]
 84 | X, y = make_regression(n_samples=1000, n_features=3, noise=0.50)
 85 | df_bayes_reg = pd.DataFrame({"x1": X[:, 0], "x2": X[:, 1],
 86 |                              "x3": X[:, 2], "y": y})
 87 | df_bayes_reg.x1 = df_bayes_reg.x1.apply(lambda x: sc(x, df_bayes_reg.x1))
 88 | df_bayes_reg.x2 = df_bayes_reg.x2.apply(lambda x: sc(x, df_bayes_reg.x2))
 89 | df_bayes_reg.x3 = df_bayes_reg.x3.apply(lambda x: sc(x, df_bayes_reg.x1))
 90 | df_bayes_reg.y = df_bayes_reg.y.apply(mis)
 91 | 
 92 | # bayesian logistic testing
 93 | def trans_binary(c):
 94 |     m = df_bayes_log.y.mean()
 95 |     if not pd.isnull(c):
 96 |         if c > m:
 97 |             return "male"
 98 |         return "female"
 99 | 
100 | df_bayes_log = df_bayes_reg.copy()
101 | df_bayes_log.y = df_bayes_log.y.apply(trans_binary)
102 | 
103 | # partial dependence test
104 | df_partial_dependence = pd.DataFrame(
105 |     {'A':np.random.uniform(0,1,100),
106 |         'B':np.random.uniform(0,1,100)}
107 | )
108 | df_partial_dependence['B'][df_partial_dependence['B'] < 0.25] = np.nan
109 | df_partial_dependence['C'] = df_partial_dependence['B'] * 2
110 | df_partial_dependence['C'][df_partial_dependence['C'] < 0.7] = np.nan
111 | 


--------------------------------------------------------------------------------
/autoimpute/utils/helpers.py:
--------------------------------------------------------------------------------
 1 | """Helper functions used throughout other methods in automipute.utils."""
 2 | 
 3 | import warnings
 4 | import numpy as np
 5 | import pandas as pd
 6 | 
 7 | def _sq_output(data, cols, square=False):
 8 |     """Private method to turn unlabeled data into a DataFrame."""
 9 |     if not isinstance(data, pd.DataFrame):
10 |         data = pd.DataFrame(data, columns=cols)
11 |     if square:
12 |         data.index = data.columns
13 |     return data
14 | 
15 | def _index_output(data, index):
16 |     """Private method to transform data to DataFrame and set the index."""
17 |     if not isinstance(data, pd.DataFrame):
18 |         data = pd.DataFrame(data, index=index)
19 |     return data
20 | 
21 | def _nan_col_dropper(data):
22 |     """Private method to drop columns w/ all missing values from DataFrame."""
23 |     cb = set(data.columns.tolist())
24 |     data.dropna(axis=1, how='all', inplace=True)
25 |     ca = set(data.columns.tolist())
26 |     cdiff = cb.difference(ca)
27 |     if cdiff:
28 |         wrn = f"{cdiff} dropped from DataFrame because all rows missing."
29 |         warnings.warn(wrn)
30 |     return data, cdiff
31 | 
32 | def _one_hot_encode(X, used_columns=None):
33 |     """Private method to handle one hot encoding for categoricals."""
34 |     cats = X.select_dtypes(include=("object",)).columns.size
35 |     if cats > 0:
36 |         X_temp = pd.get_dummies(X, drop_first=True, dtype=float)
37 |         if used_columns is None:
38 |             used_columns = X_temp.columns
39 |         if len(X_temp.columns) != len(used_columns):
40 |             one_hot = pd.get_dummies(X, dtype=float)
41 |             # if wasn't in `used_columns`, then it's the first category
42 |             to_drop = set(one_hot.columns).difference(used_columns)
43 |             one_hot.drop(to_drop, axis=1, inplace=True)
44 |             # if wasn't in `one_hot`, there were no instances of this category
45 |             to_add = set(used_columns).difference(one_hot.columns)
46 |             X_temp = one_hot.assign(**{col:0 for col in to_add})
47 |             X_temp = X_temp.reindex(columns=used_columns, copy=False)
48 |         X = X_temp
49 |     return X
50 | 


--------------------------------------------------------------------------------
/autoimpute/visuals/__init__.py:
--------------------------------------------------------------------------------
 1 | """Manage the visuals lib from the autoimpute package.
 2 | 
 3 | This module handles imports from the visuals directory that should be
 4 | accessible whenever someone imports autoimpute.visuals. The imports include
 5 | methods for visual analysis of missing data, from exploration to analysis.
 6 | 
 7 | This module handles `from autoimpute.visuals import *` with the __all__
 8 | variable below. This command imports the main public methods from
 9 | autoimpute.visuals.
10 | """
11 | 
12 | from .utils import plot_md_locations, plot_md_percent
13 | from .utils import plot_nullility_corr, plot_nullility_dendogram
14 | from .imputations import plot_imp_scatter, plot_imp_dists, plot_imp_boxplots
15 | from .imputations import plot_imp_swarm, plot_imp_strip
16 | 
17 | __all__ = [
18 |     "plot_md_locations",
19 |     "plot_md_percent",
20 |     "plot_nullility_corr",
21 |     "plot_nullility_dendogram",
22 |     "plot_imp_scatter",
23 |     "plot_imp_dists",
24 |     "plot_imp_boxplots",
25 |     "plot_imp_swarm",
26 |     "plot_imp_strip"
27 | ]
28 | 


--------------------------------------------------------------------------------
/autoimpute/visuals/helpers.py:
--------------------------------------------------------------------------------
  1 | """Helper functions used throughout other methods in automipute.visuals."""
  2 | 
  3 | import pandas as pd
  4 | import seaborn as sns
  5 | from autoimpute.imputations import MultipleImputer
  6 | 
  7 | #pylint:disable=unnecessary-lambda
  8 | 
  9 | def _fully_complete(data):
 10 |     """Private method to exit plotting and raise error if no data missing."""
 11 |     if not pd.isnull(data).sum().any():
 12 |         err = "No data is missing in any column. Cannot generate plot."
 13 |         raise ValueError(err)
 14 | 
 15 | def _default_plot_args(**kwargs):
 16 |     """Private method to set up the default plot style arguments."""
 17 |     rc = {}
 18 |     rc["figure.figsize"] = kwargs.pop("figsize", (12, 8))
 19 |     context = kwargs.pop("context", "talk")
 20 |     sns.set(context=context, rc=rc)
 21 | 
 22 | def _validate_data(d, mi, imp_col=None):
 23 |     """Private helper method to validate data vs multiple imputations.
 24 | 
 25 |     Args:
 26 |         d (list): dataset returned from multiple imputation.
 27 |         mi (MultipleImputer): multiple imputer used to generate d.
 28 |         imp_col (str): column to plot. Should be a column with imputations.
 29 | 
 30 |     Raises:
 31 |         ValueError: d should be list of tuples returned from mi transform.
 32 |         ValueError: mi should be instance of MultipleImputer used to produce d.
 33 |         ValueError: mi should have imputed_ attribute after transformation.
 34 |         ValueError: Number of imputations should equal length of the dataset.
 35 |         ValueError: Columns in each imputed data should be the same.
 36 |         ValueError: Colums in each imputed data should be same as mi.imputed_.
 37 |         ValueError: imp_col must be in both datasets and mi.imputed_ keys.
 38 |     """
 39 | 
 40 |     if not isinstance(d, list):
 41 |         err = "d should be list of tuples returned from mi transform."
 42 |         raise ValueError(err)
 43 | 
 44 |     if not isinstance(mi, MultipleImputer):
 45 |         err = "mi should be instance of MultipleImputer used to produce d."
 46 |         raise ValueError(err)
 47 | 
 48 |     if not hasattr(mi, "imputed_"):
 49 |         err = "mi should have imputed_ attribute after transformation."
 50 |         raise ValueError(err)
 51 | 
 52 |     # names of values from expressions we need to test
 53 |     num_imps = len(d)
 54 |     num_mi = mi.n
 55 |     imp_cols = set(mi.imputed_.keys())
 56 |     sets_ = [set(d[i][1].columns.tolist()) for i in range(len(d))]
 57 |     diff_d = any(list(map(lambda i: sets_[0].difference(i), sets_[1:])))
 58 |     diff_m = len(sets_[0].difference(imp_cols))
 59 | 
 60 |     if num_imps != num_mi:
 61 |         err = "Number of imputations should equal length of the dataset."
 62 |         raise ValueError(err)
 63 | 
 64 |     if diff_d:
 65 |         err = "Columns in each imputed dataset should be the same."
 66 |         raise ValueError(err)
 67 | 
 68 |     if diff_m > 0:
 69 |         err = "Colums w/in each imputed dataset should be same as mi.imputed_"
 70 |         raise ValueError(err)
 71 | 
 72 |     if not imp_col is None:
 73 |         imp_in_d = imp_col in sets_[0]
 74 |         imp_in_mi = imp_col in imp_cols
 75 |         if not all([imp_in_d, imp_in_mi]):
 76 |             err = "imp_col must be in both datasets and mi.imputed_ keys."
 77 |             raise ValueError(err)
 78 | 
 79 | def _get_observed(d, mi, imp_col):
 80 |     """Private helper method to get observed data after imputation."""
 81 |     _validate_data(d, mi, imp_col)
 82 |     obs = set(d[0][1].index).difference(mi.imputed_[imp_col])
 83 |     return list(obs)
 84 | 
 85 | def _plot_imp_dists_helper(d, hist_imputed, imp_col, ax=None, l="Imputed"):
 86 |     """Private helper method to plot distribution of imputed data."""
 87 |     for each in d:
 88 |         sns.distplot(
 89 |             each[1][imp_col], hist=hist_imputed, ax=ax,
 90 |             label=f"{imp_col} Imp {each[0]}"
 91 |         ).set(xlabel=l)
 92 | 
 93 | def _validate_kwgs(kwgs):
 94 |     """Private helper method to validate kwargs arguments."""
 95 |     if not isinstance(kwgs, (type(None), dict)):
 96 |         err = "kwgs must be None or a dictionary of keyword arguments"
 97 |         raise ValueError(err)
 98 | 
 99 | def _melt_df(d, mi, imp_col):
100 |     """Private helper method to melt dataframe vertically."""
101 |     datasets_added = []
102 |     for each in d:
103 |         e = each[1].copy()
104 |         e["imp_num"] = f"Imp {each[0]}"
105 |         e["imputed"] = "no"
106 |         imputed = mi.imputed_[imp_col]
107 |         e.loc[imputed, "imputed"] = "yes"
108 |         datasets_added.append(e)
109 |     datasets_merged = pd.concat(datasets_added)
110 |     return datasets_merged
111 | 


--------------------------------------------------------------------------------
/autoimpute/visuals/utils.py:
--------------------------------------------------------------------------------
 1 | """Visualizations to explore missingness within a dataset.
 2 | 
 3 | This module is a wrapper around the excellent missingno library, which
 4 | provides a number of plots to explore missingness within a dataset. This
 5 | wrapper handles some basic plot style setting and error handling for the user
 6 | that missingno handles differently. The reason we wrap missingno is to fine
 7 | tune the package and apply it directly to autoimpute.
 8 | """
 9 | 
10 | import missingno as msno
11 | from autoimpute.utils import check_data_structure
12 | from .helpers import _fully_complete, _default_plot_args
13 | 
14 | @check_data_structure
15 | def plot_md_locations(data, **kwargs):
16 |     """Plot the locations where data is missing within a DataFrame.
17 | 
18 |     Args:
19 |         data (pd.DataFrame): DataFrame to plot.
20 |         **kwargs: Keyword arguments for plot. Passed to missingno.matrix.
21 | 
22 |     Returns:
23 |         matplotlib.axes._subplots.AxesSubplot: missingness location plot.
24 | 
25 |     Raises:
26 |         TypeError: if data is not a DataFrame. Error raised through decorator.
27 |     """
28 |     _default_plot_args(**kwargs)
29 |     msno.matrix(data, **kwargs)
30 | 
31 | @check_data_structure
32 | def plot_md_percent(data, **kwargs):
33 |     """Plot the percentage of missing data by column within a DataFrame.
34 | 
35 |     Args:
36 |         data (pd.DataFrame): DataFrame to plot.
37 |         **kwargs: Keyword arguments for plot. Passed to missingno.bar.
38 | 
39 |     Returns:
40 |         matplotlib.axes._subplots.AxesSubplot: missingness percent plot.
41 | 
42 |     Raises:
43 |         TypeError: if data is not a DataFrame. Error raised through decorator.
44 |     """
45 |     _default_plot_args(**kwargs)
46 |     msno.bar(data, **kwargs)
47 | 
48 | @check_data_structure
49 | def plot_nullility_corr(data, **kwargs):
50 |     """Plot the nullility correlation of missing data within a DataFrame.
51 | 
52 |     Args:
53 |         data (pd.DataFrame): DataFrame to plot.
54 |         **kwargs: Keyword arguments for plot. Passed to missingno.heatmap.
55 | 
56 |     Returns:
57 |         matplotlib.axes._subplots.AxesSubplot: nullility correlation plot.
58 | 
59 |     Raises:
60 |         TypeError: if data is not a DataFrame. Error raised through decorator.
61 |         ValueError: dataset fully observed. Raised through helper method.
62 |     """
63 |     _fully_complete(data)
64 |     _default_plot_args(**kwargs)
65 |     msno.heatmap(data, **kwargs)
66 | 
67 | @check_data_structure
68 | def plot_nullility_dendogram(data, **kwargs):
69 |     """Plot the nullility dendogram of missing data within a DataFrame.
70 | 
71 |     Args:
72 |         data (pd.DataFrame): DataFrame to plot.
73 |         **kwargs: Keyword arguments for plot. Passed to missingno.dendogram.
74 | 
75 |     Returns:
76 |         matplotlib.axes._subplots.AxesSubplot: nullility dendogram plot.
77 | 
78 |     Raises:
79 |         TypeError: if data is not a DataFrame. Error raised through decorator.
80 |         ValueError: dataset fully observed. Raised through helper method.
81 |     """
82 |     _fully_complete(data)
83 |     _default_plot_args(**kwargs)
84 |     msno.dendrogram(data, **kwargs)
85 | 


--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line.
 5 | SPHINXOPTS    =
 6 | SPHINXBUILD   = sphinx-build
 7 | SOURCEDIR     = source
 8 | BUILDDIR      = build
 9 | 
10 | # Put it first so that "make" without argument is like "make help".
11 | help:
12 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
13 | 
14 | .PHONY: help Makefile
15 | 
16 | # Catch-all target: route all unknown targets to Sphinx using the new
17 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
18 | %: Makefile
19 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)


--------------------------------------------------------------------------------
/docs/source/conf.py:
--------------------------------------------------------------------------------
  1 | # pylint:disable=missing-docstring
  2 | # pylint:disable=redefined-builtin
  3 | # pylint:disable=too-few-public-methods
  4 | # -*- coding: utf-8 -*-
  5 | #
  6 | # Configuration file for the Sphinx documentation builder.
  7 | #
  8 | # This file does only contain a selection of the most common options. For a
  9 | # full list see the documentation:
 10 | # http://www.sphinx-doc.org/en/master/config
 11 | 
 12 | # -- Path setup --------------------------------------------------------------
 13 | 
 14 | # If extensions (or modules to document with autodoc) are in another directory,
 15 | # add these directories to sys.path here. If the directory is relative to the
 16 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 17 | #
 18 | import os
 19 | import sys
 20 | import mock
 21 | sys.path.insert(0, os.path.abspath("../.."))
 22 | 
 23 | # must mock modules that contain C code or autodoc won't install libs
 24 | MOCK_MODULES = [
 25 |     "numpy",
 26 |     "numpy.ma",
 27 |     "pandas",
 28 |     "pandas.api.types",
 29 |     "matplotlib",
 30 |     "matplotlib.pyplot",
 31 |     "matplotlib.pylab",
 32 |     "seaborn",
 33 |     "missingno",
 34 |     "scipy",
 35 |     "scipy.stats",
 36 |     "scipy.cluster",
 37 |     "sklearn",
 38 |     "sklearn.base",
 39 |     "sklearn.metrics",
 40 |     "sklearn.utils.validation",
 41 |     "sklearn.linear_model",
 42 |     "statsmodels",
 43 |     "statsmodels.api",
 44 |     "statsmodels.discrete",
 45 |     "statsmodels.discrete.discrete_model",
 46 |     "pymc",
 47 |     "xgboost"
 48 | ]
 49 | for mod_name in MOCK_MODULES:
 50 |     sys.modules[mod_name] = mock.Mock()
 51 | 
 52 | # dealing with mock metaclass issue
 53 | class BaseEstimatorClass:
 54 |     pass
 55 | 
 56 | class TransformerMixinClass:
 57 |     pass
 58 | 
 59 | class ClassifierMixinClass:
 60 |     pass
 61 | 
 62 | setattr(sys.modules["sklearn.base"], "BaseEstimator", BaseEstimatorClass)
 63 | setattr(sys.modules["sklearn.base"], "TransformerMixin", TransformerMixinClass)
 64 | setattr(sys.modules["sklearn.base"], "ClassifierMixin", ClassifierMixinClass)
 65 | setattr(sys.modules["numpy"], "__version__", "1.15.4")
 66 | 
 67 | 
 68 | # -- Project information -----------------------------------------------------
 69 | 
 70 | project = "Autoimpute"
 71 | copyright = "2022, Joseph Kearney, Shahid Barkat"
 72 | author = "Joseph Kearney, Shahid Barkat"
 73 | 
 74 | # The short X.Y version
 75 | version = "0.13.0"
 76 | # The full version, including alpha/beta/rc tags
 77 | release = ""
 78 | 
 79 | 
 80 | # -- General configuration ---------------------------------------------------
 81 | 
 82 | # If your documentation needs a minimal Sphinx version, state it here.
 83 | #
 84 | needs_sphinx = '1.8'
 85 | 
 86 | # Add any Sphinx extension module names here, as strings. They can be
 87 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 88 | # ones.
 89 | extensions = [
 90 |     "sphinx.ext.autodoc",
 91 |     "sphinx.ext.intersphinx",
 92 |     "sphinx.ext.todo",
 93 |     "sphinx.ext.mathjax",
 94 |     "sphinx.ext.ifconfig",
 95 |     "sphinx.ext.viewcode",
 96 |     "sphinx.ext.githubpages",
 97 |     "sphinx.ext.napoleon",
 98 |     "m2r"
 99 | ]
100 | 
101 | # Add any paths that contain templates here, relative to this directory.
102 | templates_path = ["_templates"]
103 | 
104 | # The suffix(es) of source filenames.
105 | # You can specify multiple suffix as a list of string:
106 | #
107 | source_suffix = [".rst", ".md"]
108 | #source_suffix = '.rst'
109 | 
110 | # The master toctree document.
111 | master_doc = "index"
112 | 
113 | # The language for content autogenerated by Sphinx. Refer to documentation
114 | # for a list of supported languages.
115 | #
116 | # This is also used if you do content translation via gettext catalogs.
117 | # Usually you set "language" from the command line for these cases.
118 | language = None
119 | 
120 | # List of patterns, relative to source directory, that match files and
121 | # directories to ignore when looking for source files.
122 | # This pattern also affects html_static_path and html_extra_path.
123 | exclude_patterns = []
124 | 
125 | # The name of the Pygments (syntax highlighting) style to use.
126 | pygments_style = None
127 | 
128 | 
129 | # -- Options for HTML output -------------------------------------------------
130 | 
131 | # The theme to use for HTML and HTML Help pages.  See the documentation for
132 | # a list of builtin themes.
133 | #
134 | html_theme = "sphinx_rtd_theme"
135 | 
136 | # Theme options are theme-specific and customize the look and feel of a theme
137 | # further.  For a list of options available for each theme, see the
138 | # documentation.
139 | #
140 | # html_theme_options = {}
141 | 
142 | # Add any paths that contain custom static files (such as style sheets) here,
143 | # relative to this directory. They are copied after the builtin static files,
144 | # so a file named "default.css" will overwrite the builtin "default.css".
145 | html_static_path = ["_static"]
146 | 
147 | # Custom sidebar templates, must be a dictionary that maps document names
148 | # to template names.
149 | #
150 | # The default sidebars (for documents that don't match any pattern) are
151 | # defined by theme itself.  Builtin themes are using these templates by
152 | # default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
153 | # 'searchbox.html']``.
154 | #
155 | # html_sidebars = {}
156 | 
157 | 
158 | # -- Options for HTMLHelp output ---------------------------------------------
159 | 
160 | # Output file base name for HTML help builder.
161 | htmlhelp_basename = "Autoimputedoc"
162 | 
163 | 
164 | # -- Options for LaTeX output ------------------------------------------------
165 | 
166 | latex_elements = {
167 |     # The paper size ('letterpaper' or 'a4paper').
168 |     #
169 |     # 'papersize': 'letterpaper',
170 | 
171 |     # The font size ('10pt', '11pt' or '12pt').
172 |     #
173 |     # 'pointsize': '10pt',
174 | 
175 |     # Additional stuff for the LaTeX preamble.
176 |     #
177 |     # 'preamble': '',
178 | 
179 |     # Latex figure (float) alignment
180 |     #
181 |     # 'figure_align': 'htbp',
182 | }
183 | 
184 | # Grouping the document tree into LaTeX files. List of tuples
185 | # (source start file, target name, title,
186 | #  author, documentclass [howto, manual, or own class]).
187 | latex_documents = [
188 |     (master_doc, "Autoimpute.tex", "Autoimpute Documentation",
189 |      "Joseph Kearney, Shahid Barkat", "manual"),
190 | ]
191 | 
192 | 
193 | # -- Options for manual page output ------------------------------------------
194 | 
195 | # One entry per manual page. List of tuples
196 | # (source start file, name, description, authors, manual section).
197 | man_pages = [
198 |     (master_doc, "autoimpute", "Autoimpute Documentation",
199 |      [author], 1)
200 | ]
201 | 
202 | 
203 | # -- Options for Texinfo output ----------------------------------------------
204 | 
205 | # Grouping the document tree into Texinfo files. List of tuples
206 | # (source start file, target name, title, author,
207 | #  dir menu entry, description, category)
208 | texinfo_documents = [
209 |     (master_doc, "Autoimpute", "Autoimpute Documentation",
210 |      author, "Autoimpute", "One line description of project.",
211 |      "Miscellaneous"),
212 | ]
213 | 
214 | 
215 | # -- Options for Epub output -------------------------------------------------
216 | 
217 | # Bibliographic Dublin Core info.
218 | epub_title = project
219 | 
220 | # The unique identifier of the text. This can be a ISBN number
221 | # or the project homepage.
222 | #
223 | # epub_identifier = ''
224 | 
225 | # A unique identification for the text.
226 | #
227 | # epub_uid = ''
228 | 
229 | # A list of files that should not be packed into the epub file.
230 | epub_exclude_files = ["search.html"]
231 | 
232 | 
233 | # -- Extension configuration -------------------------------------------------
234 | 
235 | # -- Options for intersphinx extension ---------------------------------------
236 | 
237 | # Example configuration for intersphinx: refer to the Python standard library.
238 | intersphinx_mapping = {"https://docs.python.org/": None}
239 | 
240 | # -- Options for todo extension ----------------------------------------------
241 | 
242 | # If true, `todo` and `todoList` produce output, else they produce nothing.
243 | todo_include_todos = True
244 | 


--------------------------------------------------------------------------------
/docs/source/index.rst:
--------------------------------------------------------------------------------
 1 | Welcome to Autoimpute!
 2 | ======================
 3 | 
 4 | .. image:: https://badge.fury.io/py/autoimpute.svg
 5 |    :target: https://badge.fury.io/py/autoimpute
 6 |    :alt: PyPI version
 7 |  
 8 | .. image:: https://github.com/kearnz/autoimpute/actions/workflows/build.yml/badge.svg
 9 |    :target: https://github.com/kearnz/autoimpute/actions
10 |    :alt: Build Status
11 |  
12 | .. image:: https://readthedocs.org/projects/autoimpute/badge/?version=latest
13 |    :target: https://autoimpute.readthedocs.io/en/latest/?badge=latest
14 |    :alt: Documentation Status
15 |  
16 | .. image:: https://img.shields.io/badge/License-MIT-blue.svg
17 |    :target: https://lbesson.mit-license.org/
18 |    :alt: MIT license
19 |  
20 | .. image:: https://img.shields.io/badge/python-3.8+-blue.svg
21 |    :target: https://www.python.org/downloads/release/python-380/
22 |    :alt: Python 3.8+
23 | 
24 | Autoimpute_ is a Python package for analysis and implementation of Imputation Methods.
25 | 
26 | .. _Autoimpute: https://pypi.org/project/autoimpute/
27 | 
28 | This page introduces users to the package and documents its features.
29 | 
30 | Check out the package on github_, or head to our website_ to walk through some tutorials.
31 | 
32 | .. _github: https://github.com/kearnz/autoimpute
33 | .. _website: https://kearnz.github.io/autoimpute-tutorials/
34 | 
35 | .. toctree::
36 |     :caption: User Guide
37 |     :titlesonly:
38 |     
39 |     Getting Started <user_guide/getting_started>
40 |     Utility Methods <user_guide/utils>
41 |     Deletion and Imputation Strategies <user_guide/strategies>
42 |     Single, Multiple, and Mice Imputers <user_guide/imputers>
43 |     Missingness Classifier <user_guide/missingness>
44 |     Analysis Models <user_guide/analysis>
45 |     Visualization Methods <user_guide/visuals>
46 | 
47 | We are actively developing `Autoimpute`, so sometimes the docs fall behind. The `README` is always up to date, as is the master branch. Therefore, consult those first. If you find discrepancies between the docs and the package, please still let us know!
48 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/analysis.rst:
--------------------------------------------------------------------------------
 1 | Analysis Models
 2 | ===============
 3 | 
 4 | This section documents analysis models within ``Autoimpute`` and their respective diagnostics. 
 5 | 
 6 | The ``MiLinearRegression`` and ``MiLogisticRegression`` extend linear and logistic regression to multiply imputed datasets. Under the hood, each regression class uses a ``MiceImputer`` to handle missing data prior to supervised analysis. Users of each regression class can tweak the underlying ``MiceImputer`` through the ``mi_kwgs`` argument or pass a pre-configured instance to the ``mi`` argument (recommended).
 7 | 
 8 | Users can also specify whether the classes should use ``sklearn`` or ``statsmodels`` to implement linear or logistic regression. The default is ``statsmodels``. When used, end users get more detailed parameter diagnostics for regression on multiply imputed data.
 9 | 
10 | Finally, this section provides diagnostic helper methods to assess bias of parameters from a regression model.
11 | 
12 | Linear Regression for Multiply Imputed Data
13 | -------------------------------------------
14 | 
15 | .. autoclass:: autoimpute.analysis.MiLinearRegression
16 |     :special-members:
17 |     :members:
18 | 
19 | 
20 | Logistic Regression for Multiply Imputed Data
21 | ---------------------------------------------
22 | 
23 | .. autoclass:: autoimpute.analysis.MiLogisticRegression
24 |     :special-members:
25 |     :members:
26 | 
27 | 
28 | Diagnostics
29 | -----------
30 | 
31 | .. autofunction:: autoimpute.analysis.raw_bias
32 | 
33 | .. autofunction:: autoimpute.analysis.percent_bias
34 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/getting_started.rst:
--------------------------------------------------------------------------------
  1 | Getting Started
  2 | ===============
  3 | 
  4 | The sections below provide a high level overview of the ``Autoimpute`` package. This page takes you through installation, dependencies, main features, imputation methods supported, and basic usage of the package. It also provides links to get in touch with the authors, review our lisence, and review how to contribute.
  5 | 
  6 | Installation
  7 | ------------
  8 | 
  9 | 
 10 | * Download ``Autoimpute`` with ``pip install autoimpute``. 
 11 | * If ``pip`` cached an older version, try ``pip install --no-cache-dir --upgrade autoimpute``.
 12 | * If you want to work with the development branch, use the script below:
 13 | 
 14 | *Development*
 15 | 
 16 | .. code-block:: sh
 17 | 
 18 |    git clone -b dev --single-branch https://github.com/kearnz/autoimpute.git
 19 |    cd autoimpute
 20 |    python setup.py install
 21 | 
 22 | Versions and Dependencies
 23 | -------------------------
 24 | 
 25 | 
 26 | * Python 3.8+
 27 | * Dependencies:
 28 | 
 29 |   * ``numpy``
 30 |   * ``scipy``
 31 |   * ``pandas``
 32 |   * ``statsmodels``
 33 |   * ``scikit-learn``
 34 |   * ``xgboost``
 35 |   * ``pymc``
 36 |   * ``seaborn``
 37 |   * ``missingno``
 38 | 
 39 | *A note for Windows Users*\ :
 40 | 
 41 | * Autoimpute 0.13.0+ has not been tested on windows and can't verify support for pymc. Historically we've had some issues with pymc on windows.
 42 | * Autoimpute works on Windows but users may have trouble with pymc for bayesian methods. `(See discourse) <https://discourse.pymc.io/t/an-error-message-about-cant-pickle-fortran-objects/1073>`_
 43 | * Users may receive a runtime error ``‘can’t pickle fortran objects’`` when sampling using multiple chains.
 44 | * There are a couple of things to do to try to overcome this error:
 45 | 
 46 |   * Reinstall theano and pymc. Make sure to delete .theano cache in your home folder.
 47 |   * Upgrade joblib in the process, which is reponsible for generating the error (pymc uses joblib under the hood).
 48 |   * Set ``cores=1`` in ``pm.sample``. This should be a last resort, as it means posterior sampling will use 1 core only. Not using multiprocessing will slow down bayesian imputation methods significantly.
 49 | 
 50 | * Reach out and let us know if you've worked through this issue successfully on Windows and have a better solution!
 51 | 
 52 | Main Features
 53 | -------------
 54 | 
 55 | 
 56 | * Utility functions and basic visualizations to explore missingness patterns
 57 | * Missingness classifier and automatic missing data test set generator
 58 | * Native handling for categorical variables (as predictors and targets of imputation)
 59 | * Single and multiple imputation classes for ``pandas`` ``DataFrames``
 60 | * Analysis methods and pooled parameter inference using multiply imputed datasets
 61 | * Numerous imputation methods, as specified in the table below:
 62 | 
 63 | Imputation Methods Supported
 64 | ----------------------------
 65 | 
 66 | .. list-table::
 67 |    :header-rows: 1
 68 | 
 69 |    * - Univariate
 70 |      - Multivariate
 71 |      - Time Series / Interpolation
 72 |    * - Mean
 73 |      - Linear Regression
 74 |      - Linear 
 75 |    * - Median
 76 |      - Binomial Logistic Regression
 77 |      - Quadratic 
 78 |    * - Mode
 79 |      - Multinomial Logistic Regression
 80 |      - Cubic
 81 |    * - Random
 82 |      - Stochastic Regression
 83 |      - Polynomial
 84 |    * - Norm
 85 |      - Bayesian Linear Regression
 86 |      - Spline
 87 |    * - Categorical
 88 |      - Bayesian Binary Logistic Regression
 89 |      - Time-weighted
 90 |    * - 
 91 |      - Predictive Mean Matching
 92 |      - Next Obs Carried Backward
 93 |    * - 
 94 |      - Local Residual Draws
 95 |      - Last Obs Carried Forward
 96 | 
 97 | Example Usage
 98 | -------------
 99 | 
100 | Autoimpute is designed to be user friendly and flexible. When performing imputation, Autoimpute fits directly into ``scikit-learn`` machine learning projects. Imputers inherit from sklearn's ``BaseEstimator`` and ``TransformerMixin`` and implement ``fit`` and ``transform`` methods, making them valid Transformers in an sklearn pipeline.
101 | 
102 | Right now, there are two ``Imputer`` classes we'll work with:
103 | 
104 | .. code-block:: python
105 | 
106 | 	from autoimpute.imputations import SingleImputer, MultipleImputer, MiceImputer
107 | 	si = SingleImputer() # pass through data once
108 | 	mi = MultipleImputer() # pass through data multiple times
109 | 	mice = MiceImputer() # pass through data multiple times and iteratively optimize imputations in each column
110 | 
111 | Imputations can be as simple as:
112 | 
113 | .. code-block:: python
114 | 
115 |    # simple example using default instance of MiceImputer
116 |    imp = MiceImputer()
117 | 
118 |    # fit transform returns a generator by default, calculating each imputation method lazily
119 |    imp.fit_transform(data)
120 | 
121 | Or quite complex, such as:
122 | 
123 | .. code-block:: python
124 | 
125 |    # create a complex instance of the MiceImputer
126 |    # Here, we specify strategies by column and predictors for each column
127 |    # We also specify what additional arguments any `pmm` strategies should take
128 |    imp = MiceImputer(
129 |        n=10,
130 |        strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"},
131 |        predictors={"salary": "all", "gender": ["salary", "education", "weight"]},
132 |        imp_kwgs={"pmm": {"fill_value": "random"}},
133 |        visit="left-to-right",
134 |        return_list=True
135 |    )
136 | 
137 |    # Because we set return_list=True, imputations are done all at once, not evaluated lazily.
138 |    # This will return M*N, where M is the number of imputations and N is the size of original dataframe.
139 |    imp.fit_transform(data)
140 | 
141 | Autoimpute also extends supervised machine learning methods from ``scikit-learn`` and ``statsmodels`` to apply them to multiply imputed datasets (using the ``MiceImputer`` under the hood). Right now, Autoimpute supports linear regression and binary logistic regression. Additional supervised methods are currently under development.
142 | 
143 | As with Imputers, Autoimpute's analysis methods can be simple or complex:
144 | 
145 | .. code-block:: python
146 | 
147 |    from autoimpute.analysis import MiLinearRegression
148 | 
149 |    # By default, use statsmodels OLS and MiceImputer()
150 |    simple_lm = MiLinearRegression()
151 | 
152 |    # fit the model on each multiply imputed dataset and pool parameters
153 |    simple_lm.fit(X_train, y_train)
154 | 
155 |    # get summary of fit, which includes pooled parameters under Rubin's rules
156 |    # also provides diagnostics related to analysis after multiple imputation
157 |    simple_lm.summary()
158 | 
159 |    # make predictions on a new dataset using pooled parameters
160 |    predictions = simple_lm.predict(X_test)
161 | 
162 |    # Control both the regression used and the MiceImputer itself
163 |    multiple_imputer_arguments = dict(
164 |        n=3,
165 |        strategy={"salary": "pmm", "gender": "bayesian binary logistic", "age": "norm"},
166 |        predictors={"salary": "all", "gender": ["salary", "education", "weight"]},
167 |        imp_kwgs={"pmm": {"fill_value": "random"}},
168 |        scaler=StandardScaler(),
169 |        visit="left-to-right",
170 |        verbose=True
171 |    )
172 |    complex_lm = MiLinearRegression(
173 |        model_lib="sklearn", # use sklearn linear regression
174 |        mi_kwgs=multiple_imputer_arguments # control the multiple imputer
175 |    )
176 | 
177 |    # fit the model on each multiply imputed dataset
178 |    complex_lm.fit(X_train, y_train)
179 | 
180 |    # get summary of fit, which includes pooled parameters under Rubin's rules
181 |    # also provides diagnostics related to analysis after multiple imputation
182 |    complex_lm.summary()
183 | 
184 |    # make predictions on new dataset using pooled parameters
185 |    predictions = complex_lm.predict(X_test)
186 | 
187 | Note that we can also pass a pre-specified ``MiceImputer`` to either analysis model instead of using ``mi_kwgs``. The option is ours, and it's a matter of preference. If we pass a pre-specified ``MiceImputer``\ , anything in ``mi_kwgs`` is ignored, although the ``mi_kwgs`` argument is still validated.
188 | 
189 | .. code-block:: python
190 | 
191 |    from autoimpute.imputations import MiceImputer
192 |    from autoimpute.analysis import MiLinearRegression
193 | 
194 |    # create a multiple imputer first
195 |    custom_imputer = MiceImputer(n=3, strategy="pmm", return_list=True)
196 | 
197 |    # pass the imputer to a linear regression model
198 |    complex_lm = MiLinearRegression(mi=custom_imputer, model_lib="statsmodels")
199 | 
200 |    # proceed the same as the previous examples
201 |    complex_lm.fit(X_train, y_train).predict(X_test)
202 |    complex_lm.summary()
203 | 
204 | For a deeper understanding of how the package works and its features, see our `tutorials website <https://kearnz.github.io/autoimpute-tutorials/>`_.
205 | 
206 | Creators and Maintainers
207 | ------------------------
208 | 
209 | 
210 | * Joseph Kearney – `@kearnz <https://github.com/kearnz>`_
211 | * Shahid Barkat - `@shabarka <https://github.com/shabarka>`_
212 | 
213 | See the `Authors <https://github.com/kearnz/autoimpute/blob/master/AUTHORS.rst>`_ page to get in touch!
214 | 
215 | License
216 | -------
217 | 
218 | Distributed under the MIT license. See `LICENSE <https://github.com/kearnz/autoimpute/blob/master/LICENSE>`_ for more information.
219 | 
220 | Contributing
221 | ------------
222 | 
223 | Guidelines for contributing to our project. See `CONTRIBUTING <https://github.com/kearnz/autoimpute/blob/master/CONTRIBUTING.md>`_ for more information.
224 | 
225 | Contributor Code of Conduct
226 | ---------------------------
227 | 
228 | Adapted from Contributor Covenant, version 1.0.0. See `Code of Conduct <https://github.com/kearnz/autoimpute/blob/master/CODE_OF_CONDUCT.md>`_ for more information.
229 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/imputers.rst:
--------------------------------------------------------------------------------
 1 | DataFrame Imputers
 2 | ==================
 3 | 
 4 | This section documents the DataFrame Imputers within ``Autoimpute``.
 5 | 
 6 | DataFrame Imputers are the primary feature of the package. The ``SingleImputer`` imputes each column within a DataFrame one time, while the ``MultipleImputer`` imputes each column within a DataFrame multiple times using independent runs. Under the hood, the ``MultipleImputer`` actually creates separate instances of the ``SingleImputer`` to handle each run. The ``MiceImputer`` takes the ``MultipleImputer`` one step futher, iteratively improving imputations in each column ``k`` times for each ``n`` runs the
 7 | ``MultipleImputer`` performs. 
 8 | 
 9 | The base class for all imputers is the ``BaseImputer``. While you should not use the ``BaseImputer`` directly unless you're creating your own imputer class, you should understand what it provides the other imputers. The ``BaseImputer`` also contains the strategy "key-value store", or the methods that ``autoimpute`` currently supports. 
10 | 
11 | Base Imputer
12 | ------------
13 | 
14 | .. autoclass:: autoimpute.imputations.BaseImputer
15 |     :special-members:
16 |     :members:
17 | 
18 | Single Imputer
19 | --------------
20 | 
21 | .. autoclass:: autoimpute.imputations.SingleImputer
22 |     :special-members:
23 |     :members:
24 | 
25 | Multiple Imputer
26 | ----------------
27 | 
28 | .. autoclass:: autoimpute.imputations.MultipleImputer
29 |     :special-members:
30 |     :members:
31 | 
32 | Mice Imputer
33 | -------------
34 | 
35 | .. autoclass:: autoimpute.imputations.MiceImputer
36 |    :special-members:
37 |    :members:
38 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/missingness.rst:
--------------------------------------------------------------------------------
1 | Missingness Classifier
2 | ======================
3 | 
4 | .. automodule:: autoimpute.imputations.mis_classifier
5 |     :special-members:
6 |     :members:
7 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/strategies.rst:
--------------------------------------------------------------------------------
 1 | Deletion and Imputation Strategies
 2 | ==================================
 3 | 
 4 | This section documents deletion and imputation strategies within ``Autoimpute``.
 5 | 
 6 | Deletion is implemented through a single function, ``listwise_delete``, documented below.
 7 | 
 8 | Imputation strategies are implemented as classes. The authors of this package refer to these classes as "series-imputers". Each series-imputer maps to an imputation method - either univariate or multivariate - that imputes missing values within a pandas Series. The imputation methods are the workhorses of the DataFrame Imputers, the ``SingleImputer``, ``MultipleImputer``, and ``MiceImputer``. Refer to the :doc:`imputers documentation<imputers>` for more information on the DataFrame Imputers.
 9 | 
10 | For more information regarding the relationship between DataFrame Imputers and series-imputers, refer to the following tutorial_. The tutorial covers series-imputers in detail as well as the design patterns behind ``AutoImpute`` Imputers.
11 | 
12 | .. _tutorial: https://kearnz.github.io/autoimpute-tutorials/imputer-mechanics-II.html
13 | 
14 | Deletion Methods
15 | ----------------
16 | 
17 | .. autofunction:: autoimpute.imputations.listwise_delete
18 | 
19 | 
20 | Imputation Strategies
21 | ---------------------
22 | 
23 | .. automodule:: autoimpute.imputations.series
24 |     :special-members:
25 |     :members:
26 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/utils.rst:
--------------------------------------------------------------------------------
1 | Utility Methods
2 | ===============
3 | 
4 | .. automodule:: autoimpute.utils.patterns
5 |     :members:
6 | 


--------------------------------------------------------------------------------
/docs/source/user_guide/visuals.rst:
--------------------------------------------------------------------------------
 1 | Visualization Methods
 2 | =====================
 3 | 
 4 | This section documents visualization methods within ``Autoimpute``.
 5 | 
 6 | Visualization methods support all functionality within ``Autoimpute``, from missing data exploration to imputation analysis. The documentation below breaks down each visualization method and groups them into their respsective categories. The categories represent other modules within ``Autoimpute``.
 7 | 
 8 | NOTE: The visualization module is currently under development. While the functions outlined below are stable in ``0.12.x``, they might change thereafterr.
 9 | 
10 | Utility
11 | -------
12 | 
13 | ``Autoimpute`` comes with a number of :doc:`utility methods<utils>` to examine missing data before imputation takes place. This package supports these methods with a number of visualization techniques to explore patterns within missing data. The primary techniques wrap the excellent `missingno package <https://github.com/ResidentMario/missingno>`_. ``Autoimpute`` simply leverages ``missingno`` to make its offerings familiar in this packages' API design. The methods appear below:
14 | 
15 | .. autofunction:: autoimpute.visuals.plot_md_locations
16 | 
17 | .. autofunction:: autoimpute.visuals.plot_md_percent
18 | 
19 | .. autofunction:: autoimpute.visuals.plot_nullility_corr
20 | 
21 | .. autofunction:: autoimpute.visuals.plot_nullility_dendogram
22 | 
23 | Imputation
24 | ----------
25 | 
26 | Two main classes within ``Autoimpute`` are the :doc:`SingleImputer and MultipleImputer<imputers>`. The visualization module within this package contains a number of techniques to visually assess the quality and performance of these imputers. The important methods appear below:
27 | 
28 | .. autofunction:: autoimpute.visuals.helpers._validate_data
29 | 
30 | .. autofunction:: autoimpute.visuals.plot_imp_scatter
31 | 
32 | .. autofunction:: autoimpute.visuals.plot_imp_dists
33 | 
34 | .. autofunction:: autoimpute.visuals.plot_imp_boxplots
35 | 
36 | .. autofunction:: autoimpute.visuals.plot_imp_swarm
37 | 
38 | .. autofunction:: autoimpute.visuals.plot_imp_strip
39 | 


--------------------------------------------------------------------------------
/pytest.ini:
--------------------------------------------------------------------------------
 1 | [pytest]
 2 | ;command-line options by default
 3 | addopts = -v
 4 | ;designate tests as prefix only
 5 | python_functions = test_*
 6 | ;filter deprecation warnings for collections in 3.8, import stuff, and logistic reg UserWarning
 7 | filterwarnings =
 8 |     ignore:.*importing the ABCs*:DeprecationWarning
 9 |     ignore:Importing from numpy.testing.nosetester*:DeprecationWarning
10 |     ignore:the imp module*:DeprecationWarning
11 |     ignore:Multiple categories (c)*:UserWarning
12 |     ignore::sklearn.exceptions.ConvergenceWarning


--------------------------------------------------------------------------------
/requirements.readthedocs.txt:
--------------------------------------------------------------------------------
1 | mock
2 | m2r


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy>=1.22.0
 2 | scipy>=1.6.0
 3 | pandas>=1.3.0
 4 | statsmodels>=0.12.0
 5 | xgboost>=0.81
 6 | scikit-learn>=1.0.0
 7 | arviz>=0.11.0
 8 | pymc>=4.1.0
 9 | seaborn>=0.9.0
10 | missingno>=0.4.1
11 | pytest
12 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 | license-file = LICENSE
4 | 
5 | [pylint]
6 | exclude = 
7 |     docs/*
8 |     tests/*    
9 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | """Setup autoimpute package."""
 2 | 
 3 | import io
 4 | import os
 5 | from setuptools import find_packages, setup
 6 | # pylint:disable=exec-used
 7 | 
 8 | # Package meta-data
 9 | NAME = "autoimpute"
10 | DESCRIPTION = "Imputation Methods in Python"
11 | URL = "https://github.com/kearnz/autoimpute"
12 | EMAIL = "josephkearney14@gmail.com, shahidbarkat@gmail.com"
13 | AUTHOR = "Joseph Kearney, Shahid Barkat"
14 | REQUIRES_PYTHON = ">=3.8.0"
15 | INLCUDE_PACKAGE_DATA = True
16 | LICENSE = "MIT"
17 | VERSION = None
18 | REQUIRED = [
19 |     "numpy",
20 |     "scipy",
21 |     "pandas",
22 |     "statsmodels",
23 |     "xgboost",
24 |     "scikit-learn",
25 |     "pymc",
26 |     "seaborn",
27 |     "missingno"
28 | ]
29 | CLASSIFIERS = [
30 |     "Development Status :: 3 - Alpha",
31 |     "Intended Audience :: Science/Research",
32 |     "Intended Audience :: Developers",
33 |     "License :: OSI Approved :: MIT License",
34 |     "Programming Language :: Python",
35 |     "Programming Language :: Python :: 3",
36 |     "Programming Language :: Python :: 3.8",
37 |     "Programming Language :: Python :: 3.9",
38 |     "Operating System :: OS Independent",
39 |     "Topic :: Software Development",
40 |     "Topic :: Scientific/Engineering"
41 | ]
42 | EXTRAS = {}
43 | 
44 | here = os.path.abspath(os.path.dirname(__file__))
45 | 
46 | # Import the README and use it as the long description.
47 | try:
48 |     with io.open(os.path.join(here, "README.md"), encoding="utf-8") as f:
49 |         long_description = f.read()
50 | except FileNotFoundError:
51 |     long_description = DESCRIPTION
52 | 
53 | # Load the package's __version__.py module as a dictionary.
54 | about = {}
55 | if not VERSION:
56 |     with open(os.path.join(here, NAME, "__version__.py")) as f:
57 |         exec(f.read(), about)
58 | else:
59 |     about["__version__"] = VERSION
60 | 
61 | # Setup specification
62 | setup(
63 |     name=NAME,
64 |     version=about["__version__"],
65 |     description=DESCRIPTION,
66 |     long_description=long_description,
67 |     long_description_content_type="text/markdown",
68 |     author=AUTHOR,
69 |     author_email=EMAIL,
70 |     python_requires=REQUIRES_PYTHON,
71 |     url=URL,
72 |     packages=find_packages(exclude=("tests",)),
73 |     install_requires=REQUIRED,
74 |     extras_require=EXTRAS,
75 |     include_package_data=INLCUDE_PACKAGE_DATA,
76 |     license=LICENSE,
77 |     classifiers=CLASSIFIERS,
78 | )
79 | 


--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/__init__.py


--------------------------------------------------------------------------------
/tests/test_imputations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_imputations/__init__.py


--------------------------------------------------------------------------------
/tests/test_imputations/test_mice_imputer.py:
--------------------------------------------------------------------------------
 1 | """Tests written to ensure the MiceImputer in the imputations package works.
 2 | Note that this also tests the MultipleImputer, which really just passes to
 3 | the SingleImputer. SingleImputer has tests, some of which are the same as here.
 4 | 
 5 | Tests use the pytest library. The tests in this module ensure the following:
 6 | - `test_stochastic_predictive_imputer` test stochastic strategy.
 7 | - `test_bayesian_reg_imputer` test bayesian regression strategy.
 8 | - `test_bayesian_logistic_imputer` test bayesian logistic strategy.
 9 | - `test_pmm_lrd_imputer` test pmm and lrd strategy.
10 | - `test_normal_unit_variance_imputer` test unit variance imputer
11 | """
12 | 
13 | import pytest
14 | from autoimpute.imputations import MiceImputer
15 | from autoimpute.utils import dataframes
16 | dfs = dataframes
17 | # pylint:disable=len-as-condition
18 | # pylint:disable=pointless-string-statement
19 | 
20 | def test_stochastic_predictive_imputer():
21 |     """Test stochastic works for numerical columns of PredictiveImputer."""
22 |     # generate linear, then stochastic
23 |     imp_p = MiceImputer(strategy={"A":"least squares"})
24 |     imp_s = MiceImputer(strategy={"A":"stochastic"})
25 |     # make sure both work
26 |     _ = imp_p.fit_transform(dfs.df_num)
27 |     _ = imp_s.fit_transform(dfs.df_num)
28 |     assert imp_p.imputed_["A"] == imp_s.imputed_["A"]
29 | 
30 | def test_bayesian_reg_imputer():
31 |     """Test bayesian works for numerical column of PredictiveImputer."""
32 |     # test designed first - test kwargs and params
33 |     imp_b = MiceImputer(strategy={"y":"bayesian least squares"},
34 |                           imp_kwgs={"y":{"fill_value": "random",
35 |                                          "am": 11, "cores": 2}})
36 |     imp_b.fit_transform(dfs.df_bayes_reg)
37 |     # test on numerical in general
38 |     imp_n = MiceImputer(strategy="bayesian least squares")
39 |     imp_n.fit_transform(dfs.df_num)
40 | 
41 | def test_bayesian_logistic_imputer():
42 |     """Test bayesian works for binary column of PredictiveImputer."""
43 |     imp_b = MiceImputer(strategy={"y":"bayesian binary logistic"},
44 |                           imp_kwgs={"y":{"fill_value": "random"}})
45 |     imp_b.fit_transform(dfs.df_bayes_log)
46 | 
47 | def test_pmm_lrd_imputer():
48 |     """Test pmm and lrd work for numerical column of PredictiveImputer."""
49 |     # test pmm first - test kwargs and params
50 |     imp_pmm = MiceImputer(strategy={"y":"pmm"},
51 |                             imp_kwgs={"y": {"fill_value": "random",
52 |                                       "copy_x": False}})
53 |     imp_pmm.fit_transform(dfs.df_bayes_reg)
54 | 
55 |     # test lrd second - test kwargs and params
56 |     imp_lrd = MiceImputer(strategy={"y":"lrd"},
57 |                             imp_kwgs={"y": {"fill_value": "random",
58 |                                       "copy_x": False}})
59 |     imp_lrd.fit_transform(dfs.df_bayes_reg)
60 | 
61 | def test_normal_unit_variance_imputer():
62 |     """Test normal unit variance imputer for numerical column"""
63 |     imp_pmm = MiceImputer(strategy={"y":"normal unit variance"},)
64 |     imp_pmm.fit_transform(dfs.df_bayes_reg)
65 | 
66 | def test_partial_dependence_imputer():
67 |     """Test to ensure that edge case for partial dependence whandled"""
68 |     imp = MiceImputer(strategy='stochastic')
69 |     imp.fit_transform(dfs.df_partial_dependence)
70 | 


--------------------------------------------------------------------------------
/tests/test_imputations/test_missingness_classifier.py:
--------------------------------------------------------------------------------
 1 | """Tests written to ensure the SingleImputer in the imputations package works.
 2 | 
 3 | Tests use the pytest library. The tests in this module ensure the following:
 4 | - `test_missing_classifier` tests that bug in issue 56 is fixed
 5 | """
 6 | 
 7 | from autoimpute.imputations import MissingnessClassifier
 8 | from autoimpute.utils import dataframes
 9 | dfs = dataframes
10 | # pylint:disable=len-as-condition
11 | # pylint:disable=pointless-string-statement
12 | 
13 | 
14 | def test_single_missing_column():
15 |     """Test that the missingness classifier works correctly"""
16 |     imp = MissingnessClassifier()
17 |     imp.fit_predict(dfs.df_mis_classifier)
18 |     imp.fit_predict_proba(dfs.df_mis_classifier)
19 | 


--------------------------------------------------------------------------------
/tests/test_imputations/test_single_imputer.py:
--------------------------------------------------------------------------------
  1 | """Tests written to ensure the SingleImputer in the imputations package works.
  2 | 
  3 | Tests use the pytest library. The tests in this module ensure the following:
  4 | - `test_single_missing_column` throw error when any column fully missing.
  5 | - `test_bad_strategy` throw error if strategy is not allowed.
  6 | - `test_imputer_strategies_not_allowed` test strategies misspecified.
  7 | - `test_wrong_numerical_type` test valid strategy but not for numerical.
  8 | - `test_wrong_categorical_type` test valid strategy but not for categorical.
  9 | - `test_default_single_imputer` tests the simplest implementation: defaults.
 10 | - `test_numerical_univar_imputers` test all numerical strategies.
 11 | - `test_categorical_univar_imputers` test all categorical strategies.
 12 | - `test_stochastic_predictive_imputer` test stochastic strategy.
 13 | - `test_bayesian_reg_imputer` test bayesian regression strategy.
 14 | - `test_bayesian_logistic_imputer` test bayesian logistic strategy.
 15 | - `test_pmm_lrd_imputer` test pmm and lrd strategy.
 16 | - `test_normal_unit_variance_imputer` test unit variance imputer
 17 | """
 18 | 
 19 | import pytest
 20 | from autoimpute.imputations import SingleImputer
 21 | from autoimpute.utils import dataframes
 22 | dfs = dataframes
 23 | # pylint:disable=len-as-condition
 24 | # pylint:disable=pointless-string-statement
 25 | 
 26 | def test_single_missing_column():
 27 |     """Test that the imputer removes columns that are fully missing."""
 28 |     with pytest.raises(ValueError):
 29 |         imp = SingleImputer()
 30 |         imp.fit_transform(dfs.df_col_miss)
 31 | 
 32 | def test_bad_strategy():
 33 |     """Test that strategies not supported throw a ValueError."""
 34 |     with pytest.raises(ValueError):
 35 |         imp = SingleImputer(strategy="not_a_strategy")
 36 |         imp.fit_transform(dfs.df_num)
 37 | 
 38 | def bad_imputers():
 39 |     """Test supported strategies but improper column specification."""
 40 |     # example with too few strategies specified for given DataFrame
 41 |     bad_list = SingleImputer(strategy=["mean", "median"])
 42 |     # example with incorrect keys for given DataFrame
 43 |     bad_keys = SingleImputer(strategy={"X":"mean", "B":"median", "C":"mode"})
 44 |     return [bad_list, bad_keys]
 45 | 
 46 | @pytest.mark.parametrize("imp", bad_imputers())
 47 | def test_imputer_strategies_not_allowed(imp):
 48 |     """Test bad imputers"""
 49 |     with pytest.raises(ValueError):
 50 |         imp.fit_transform(dfs.df_num)
 51 | 
 52 | def test_wrong_numerical_type():
 53 |     """Test supported strategies but improper column type for strategy."""
 54 |     num_for_cat = SingleImputer(strategy={"cats": "mean"})
 55 |     with pytest.raises(TypeError):
 56 |         num_for_cat.fit_transform(dfs.df_ts_mixed)
 57 | 
 58 | def test_wrong_categorical_type():
 59 |     """Test supported strategies but improper column type for strategy."""
 60 |     cat_for_num = SingleImputer(strategy="categorical")
 61 |     with pytest.raises(TypeError):
 62 |         cat_for_num.fit_transform(dfs.df_num)
 63 | 
 64 | def test_default_single_imputer():
 65 |     """Test the _default method and results for SingleImputer()."""
 66 |     imp = SingleImputer()
 67 |     # test df_num first
 68 |     # -----------------
 69 |     # all strategies should default to pmm
 70 |     imp.fit_transform(dfs.df_num)
 71 |     for imputer in imp.statistics_.values():
 72 |         strat = imputer.statistics_["strategy"]
 73 |         assert strat == "pmm"
 74 | 
 75 |     # test df_ts_mixed next
 76 |     # ---------------
 77 |     # datetime col should default to none
 78 |     # numerical col should default to mean
 79 |     # categorical col should default to mean
 80 |     imp.fit_transform(dfs.df_mix)
 81 |     gender_imputer = imp.statistics_["gender"]
 82 |     salary_imputer = imp.statistics_["salary"]
 83 |     assert salary_imputer.statistics_["strategy"] == "pmm"
 84 |     assert gender_imputer.statistics_["strategy"] == "multinomial logistic"
 85 | 
 86 | def test_numerical_univar_imputers():
 87 |     """Test numerical methods when not using the _default."""
 88 |     for num_strat in dfs.num_strategies:
 89 |         imp = SingleImputer(strategy=num_strat)
 90 |         imp.fit_transform(dfs.df_num)
 91 |         for imputer in imp.statistics_.values():
 92 |             strat = imputer.statistics_["strategy"]
 93 |             assert strat == num_strat
 94 | 
 95 | def test_categorical_univar_imputers():
 96 |     """Test categorical methods when not using the _default."""
 97 |     for cat_strat in dfs.cat_strategies:
 98 |         imp = SingleImputer(strategy={"cats": cat_strat})
 99 |         imp.fit_transform(dfs.df_ts_mixed)
100 |         for imputer in imp.statistics_.values():
101 |             strat = imputer.statistics_["strategy"]
102 |             assert strat == cat_strat
103 | 
104 | def test_stochastic_predictive_imputer():
105 |     """Test stochastic works for numerical columns of PredictiveImputer."""
106 |     # generate linear, then stochastic
107 |     imp_p = SingleImputer(strategy={"A":"least squares"})
108 |     imp_s = SingleImputer(strategy={"A":"stochastic"})
109 |     # make sure both work
110 |     _ = imp_p.fit_transform(dfs.df_num)
111 |     _ = imp_s.fit_transform(dfs.df_num)
112 |     assert imp_p.imputed_["A"] == imp_s.imputed_["A"]
113 | 
114 | def test_bayesian_reg_imputer():
115 |     """Test bayesian works for numerical column of PredictiveImputer."""
116 |     # test designed first - test kwargs and params
117 |     imp_b = SingleImputer(strategy={"y":"bayesian least squares"},
118 |                           imp_kwgs={"y":{"fill_value": "random",
119 |                                          "am": 11, "cores": 2}})
120 |     imp_b.fit_transform(dfs.df_bayes_reg)
121 |     # test on numerical in general
122 |     imp_n = SingleImputer(strategy="bayesian least squares")
123 |     imp_n.fit_transform(dfs.df_num)
124 | 
125 | def test_bayesian_logistic_imputer():
126 |     """Test bayesian works for binary column of PredictiveImputer."""
127 |     imp_b = SingleImputer(strategy={"y":"bayesian binary logistic"},
128 |                           imp_kwgs={"y":{"fill_value": "random"}})
129 |     imp_b.fit_transform(dfs.df_bayes_log)
130 | 
131 | def test_pmm_lrd_imputer():
132 |     """Test pmm and lrd work for numerical column of PredictiveImputer."""
133 |     # test pmm first - test kwargs and params
134 |     imp_pmm = SingleImputer(strategy={"y":"pmm"},
135 |                             imp_kwgs={"y": {"fill_value": "random",
136 |                                       "copy_x": False}})
137 |     imp_pmm.fit_transform(dfs.df_bayes_reg)
138 | 
139 |     # test lrd second - test kwargs and params
140 |     imp_lrd = SingleImputer(strategy={"y":"lrd"},
141 |                             imp_kwgs={"y": {"fill_value": "random",
142 |                                       "copy_x": False}})
143 |     imp_lrd.fit_transform(dfs.df_bayes_reg)
144 | 
145 | def test_normal_unit_variance_imputer():
146 |     """Test normal unit variance imputer for numerical column"""
147 |     imp_pmm = SingleImputer(strategy={"y":"normal unit variance"},)
148 |     imp_pmm.fit_transform(dfs.df_bayes_reg)
149 | 
150 | def test_partial_dependence_imputer():
151 |     """Test to ensure that edge case for partial dependence whandled"""
152 |     imp = SingleImputer(strategy='stochastic')
153 |     imp.fit_transform(dfs.df_partial_dependence)
154 | 


--------------------------------------------------------------------------------
/tests/test_utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_utils/__init__.py


--------------------------------------------------------------------------------
/tests/test_utils/test_checks.py:
--------------------------------------------------------------------------------
  1 | """Tests written to ensure the decorators in the utils package work correctly.
  2 | 
  3 | Tests use the pytest library. The tests in this module ensure the following:
  4 | - `check_data_structure` requires pandas DataFrame.
  5 | - `check_data_structure` raises errors for any other type of data structure.
  6 | - `check_missingness` enforces datasets have observed and missing values.
  7 | - `check_missingness` raises errors for fully missing datasets.
  8 | - `check_missingness` raises errors for time series missing in datasets.
  9 | - `remove_nan_columns` removes columns if the entire column is missing.
 10 | """
 11 | 
 12 | import pytest
 13 | import numpy as np
 14 | import pandas as pd
 15 | from autoimpute.utils.checks import check_data_structure, check_missingness
 16 | from autoimpute.utils.checks import check_nan_columns
 17 | 
 18 | @check_data_structure
 19 | def check_data(data):
 20 |     """Helper function to test data structure decorator."""
 21 |     return data
 22 | 
 23 | @check_missingness
 24 | def check_miss(data):
 25 |     """Helper function to test missingness decorator."""
 26 |     return data
 27 | 
 28 | @check_nan_columns
 29 | def check_nan_cols(data):
 30 |     """Helper function to test removal of NaN columns."""
 31 |     return data
 32 | 
 33 | def data_structures_not_allowed():
 34 |     """Types that should throw an error for `check_data_structure`."""
 35 |     str_ = "string"
 36 |     int_ = 1
 37 |     float_ = 1.0
 38 |     set_ = set([1, 2, 3])
 39 |     dict_ = dict(a=str_, b=int_, c=float_)
 40 |     list_ = [str_, int_, float_]
 41 |     tuple_ = tuple(list_)
 42 |     arr_ = np.array([1, 2, 3, 4])
 43 |     ser_ = pd.Series({"a": arr_})
 44 |     return [str_, int_, float_, set_, dict_, list_, tuple_, arr_, ser_]
 45 | 
 46 | def data_stuctures_allowed():
 47 |     """Types that should not throw an error and should return a valid array."""
 48 |     df_ = pd.DataFrame({
 49 |         "A": [1, 2, 3, 4],
 50 |         "B": ["a", "b", "c", "d"]
 51 |     })
 52 |     return [df_]
 53 | 
 54 | def missingness_not_allowed():
 55 |     """Can't impute datasets that are fully complete or incomplete."""
 56 |     df_none = pd.DataFrame({
 57 |         "A": [np.nan, np.nan, np.nan],
 58 |         "B": [None, None, None]
 59 |     })
 60 |     #df_ts = pd.DataFrame({
 61 |     #    "date": ["2018-05-01", "2018-05-02", "2018-05-03",
 62 |     #             "2018-05-04", "2018-05-05", "2018-05-06",
 63 |     #             "2018-05-07", "2018-05-08", "2018-05-09"],
 64 |     #    "stats": [3, 4, np.nan, 15, 7, np.nan, 26, 25, 62]
 65 |     #})
 66 |     #df_ts["date"] = pd.to_datetime(df_ts["date"], utc=True)
 67 |     #df_ts.loc[[1, 3], "date"] = np.nan
 68 |     return [df_none]
 69 | 
 70 | @pytest.mark.parametrize("ds", data_structures_not_allowed())
 71 | def test_data_structures_not_allowed(ds):
 72 |     """Ensure data structure helper raises TypeError for disallowed types.
 73 | 
 74 |     Utilizes the pytest.mark.parametize method to run test on numerous data
 75 |     structures. Those data structures are returned from the helper method
 76 |     `data_structures_not_allowed()`, which returns a list of data structures.
 77 |     Each item within this list should cause this method to throw an error.
 78 |     Each item in the list takes the on the "ds" name in pytest.
 79 | 
 80 |     Args:
 81 |         ds (any -> iterator): any data structure within an iterator. `ds` is
 82 |             alias each item in the iterator takes when being tested.
 83 | 
 84 |     Returns:
 85 |         None: raises errors when improper types passed.
 86 | 
 87 |     Raises:
 88 |         TypeError: data structure `ds` is not allowed.
 89 |     """
 90 |     with pytest.raises(TypeError):
 91 |         check_data(ds)
 92 | 
 93 | @pytest.mark.parametrize("ds", data_stuctures_allowed())
 94 | def test_data_structures_allowed(ds):
 95 |     """Ensure data structure helper returns expected types.
 96 | 
 97 |     Utilizes the pytest.mark.parametize method to run test on numerous data
 98 |     structures. Those data structures are returned from the helper method
 99 |     `data_structures_allowed()`, which right now returns a DataFrame/Series.
100 | 
101 |     Args:
102 |         ds (any -> iterator): any data structure within an iterator. `ds` is
103 |             alias each item in the iterator takes when being tested.
104 | 
105 |     Returns:
106 |         None: asserts that the appropriate type has been passed.
107 |     """
108 |     assert isinstance(ds, pd.DataFrame)
109 | 
110 | @pytest.mark.parametrize("ds", missingness_not_allowed())
111 | def test_missingness_not_allowed(ds):
112 |     """Ensure missingness helper raises ValueError for fully missing DataFrame.
113 | 
114 |     Also utilizes the pytest.mark.parametize method to run test. Tests run on
115 |     items in iterator returned from `missingness_not_allowed()`, which right
116 |     now returns a fully missing DataFrame and a time series DataFrame with
117 |     missingness in the time series itself.
118 | 
119 |     Args:
120 |         ds (any -> iterator): any data structure within an iterator. `ds` is
121 |             alias each item in the iterator takes when being tested.
122 | 
123 |     Returns:
124 |         None: raises error because DataFrame is fully missing.
125 | 
126 |     Raises:
127 |         ValueError: if the DataFrame is fully missing.
128 |     """
129 |     with pytest.raises(ValueError):
130 |         check_miss(ds)
131 | 
132 | def test_nan_columns():
133 |     """Check missing columns throw error with check_nan_columns decorator.
134 | 
135 |     The `check_nan_columns` decorator should throw an error if any columns
136 |     in the dataframe have all values missing. This test simulates data in a
137 |     DataFrame where two of the columns have all missing values. The DataFrame
138 |     is below. In this case, `B` and `C` should generate an error because they
139 |     contain all missing values. Error message should capture both columns.
140 | 
141 |     Args:
142 |         None: DataFrame hard-coded internally.
143 | 
144 |     Returns:
145 |         None: asserts that fully missing columns generate error.
146 |     """
147 |     df = pd.DataFrame({
148 |         "A": [1, np.nan, 3, 4],
149 |         "B": [None, None, None, None],
150 |         "C": [np.nan, np.nan, np.nan, np.nan],
151 |         "D": ["a", "b", None, "d"]
152 |     })
153 |     assert pd.isnull(df["B"]).all()
154 |     assert pd.isnull(df["C"]).all()
155 |     with pytest.raises(ValueError):
156 |         check_nan_cols(df)
157 | 


--------------------------------------------------------------------------------
/tests/test_utils/test_patterns.py:
--------------------------------------------------------------------------------
  1 | """Tests written to ensure the patterns in the utils package work correctly.
  2 | 
  3 | The methods tested below are ports from MICE, an excellent R package for
  4 | handling missing data. The author of MICE (Van Buuren) is also the author of
  5 | Flexible Imputation of Missing Data (FIMD). The `df_general` variable below
  6 | is a simulation of the "general" pattern in section 4.1. The subsequent
  7 | dataframes are hard coded results of patterns from FIMD. They are used to
  8 | verify that this implementation in Python is working as expected. The methods
  9 | being tested (i.e. inbound, outbound, flux, etc.) all use `df_general` to
 10 | calculate their respective statistic. This allows for comparison to results
 11 | from FIMD, section 4.1.
 12 | 
 13 | Tests use the pytest library. The tests in this module ensure the following:
 14 | - `test_md_locations` checks missingness identified properly as 1/0.
 15 | - `test_md_pattern` checks against result from MICE md.pattern.
 16 | - `test_md_pairs` checks against result from MICE md.pairs
 17 | - `test_inbound` checks against inbound calc in 4.1 (no explicit method)
 18 | - `test_outbound` checks against outbound calc in 4.1 (no explicit method)
 19 | - `test_flux` checks against MICE flux.
 20 | """
 21 | 
 22 | import numpy as np
 23 | import pandas as pd
 24 | from autoimpute.utils.patterns import md_locations, md_pairs, md_pattern
 25 | from autoimpute.utils.patterns import inbound, outbound, flux
 26 | 
 27 | df_general = pd.DataFrame({
 28 |     "A": [1, 5, 9, 6, 12, 11, np.nan, np.nan],
 29 |     "B": [2, 4, 3, 6, 11, np.nan, np.nan, np.nan],
 30 |     "C": [-1, 1, np.nan, np.nan, np.nan, -1, 1, 0]
 31 | })
 32 | 
 33 | df_pattern = pd.DataFrame({
 34 |     "count": [2, 3, 1, 2],
 35 |     "A": [1, 1, 1, 0],
 36 |     "B": [1, 1, 0, 0],
 37 |     "C": [1, 0, 1, 1],
 38 |     "nmis": [0, 1, 1, 2]
 39 | })
 40 | 
 41 | dict_pairs = {}
 42 | ci = ["A", "B", "C"]
 43 | create_df = lambda v: pd.DataFrame(v, columns=ci, index=ci)
 44 | dict_pairs["rr"] = create_df([[6, 5, 3], [5, 5, 2], [3, 2, 5]])
 45 | dict_pairs["rm"] = create_df([[0, 1, 3], [0, 0, 3], [2, 3, 0]])
 46 | dict_pairs["mr"] = create_df([[0, 0, 2], [1, 0, 3], [3, 3, 0]])
 47 | dict_pairs["mm"] = create_df([[2, 2, 0], [2, 3, 0], [0, 0, 3]])
 48 | 
 49 | df_inbound = create_df([[0, 1/3, 1], [0, 0, 1], [1, 1, 0]]).T
 50 | df_outbound = create_df([[0, 0, 0.4], [1/6, 0, 0.6], [0.5, 0.6, 0]]).T
 51 | 
 52 | df_flux = pd.DataFrame({
 53 |     "pobs": [0.75, 0.625, 0.625],
 54 |     "influx": [0.125, 0.250, 0.375],
 55 |     "outflux": [0.5, 0.375, 0.625]
 56 | }, index=ci)
 57 | 
 58 | def test_md_locations():
 59 |     """Test to ensure that missingness locations are identified.
 60 | 
 61 |     Missingness locations should equal np.isnan for each col.
 62 |     Assert that md_locations returns a DataFrame, and then check
 63 |     that each column equals what is expected from np.isnan.
 64 | 
 65 |     Args:
 66 |         None: DataFrame for testing created internally.
 67 | 
 68 |     Returns:
 69 |         None: asserts locations for missingness are as expected.
 70 |     """
 71 |     md_loc = md_locations(df_general)
 72 |     assert isinstance(md_loc, pd.DataFrame)
 73 |     assert all(md_loc["A"] == np.isnan(df_general["A"]))
 74 |     assert all(md_loc["B"] == np.isnan(df_general["B"]))
 75 |     assert all(md_loc["C"] == np.isnan(df_general["C"]))
 76 | 
 77 | def test_md_pattern():
 78 |     """Test that missing data pattern equal to expected results.
 79 | 
 80 |     `df_pattern` name assigned to DataFrame in module's scope that contains
 81 |     the expected pattern from VB 4.1 `md.pattern()` example in R. Result
 82 |     is hard coded, and python version tested with assertions below.
 83 | 
 84 |     Args:
 85 |         None: DataFrame for testing created internally.
 86 | 
 87 |     Returns:
 88 |         None: asserts missingness patterns are as expected.
 89 |     """
 90 |     md_pat = md_pattern(df_general)
 91 |     assert isinstance(md_pat, pd.DataFrame)
 92 |     assert all(md_pat["count"] == df_pattern["count"])
 93 |     assert all(md_pat[["A", "B", "C"]] == df_pattern[["A", "B", "C"]])
 94 |     assert all(md_pat["nmis"] == df_pattern["nmis"])
 95 | 
 96 | def test_md_pairs():
 97 |     """Test that missing data pairs equal to expected results.
 98 | 
 99 |     `dict_pairs` contains 4 keys - one for each pair expected. The pairs
100 |     are `rr`, `mr`, `rm`, and `mm`. Pair types described in the docstrings
101 |     of the md_pairs method in the utils.patterns module. Missing data pairs
102 |     should equal expected pairs from VB 4.1 `md.pairs() in R.
103 | 
104 |     Args:
105 |         None: Pairs for testing created internally.
106 | 
107 |     Returns:
108 |         None: asserts pairs are as expected.
109 |     """
110 |     md_pair = md_pairs(df_general)
111 |     assert isinstance(md_pair, dict)
112 |     assert all(md_pair["rr"] == dict_pairs["rr"])
113 |     assert all(md_pair["mr"] == dict_pairs["mr"])
114 |     assert all(md_pair["rm"] == dict_pairs["rm"])
115 |     assert all(md_pair["mm"] == dict_pairs["mm"])
116 | 
117 | def test_inbound():
118 |     """Test that the inbound statistic equal to expected results.
119 | 
120 |     `df_inbound` contains hard-coded expected result. Tested against the
121 |     inbound function, which takes `df_general` as an input.
122 | 
123 |     Args:
124 |         None: DataFrame for testing created internally.
125 | 
126 |     Returns:
127 |         None: asserts inbound statistic results are as expected.
128 |     """
129 |     inbound_ = inbound(df_general)
130 |     assert isinstance(inbound_, pd.DataFrame)
131 |     assert all(inbound_["A"] == df_inbound["A"])
132 |     assert all(inbound_["B"] == df_inbound["B"])
133 |     assert all(inbound_["C"] == df_inbound["C"])
134 | 
135 | def test_outbound():
136 |     """Test that the outbound statistic equal to expected results.
137 | 
138 |     `df_outbound` contains hard-coded expected result. Tested against the
139 |     outbound function, which takes `df_general` as an input.
140 | 
141 |     Args:
142 |         None: DataFrame for testing created internally.
143 | 
144 |     Returns:
145 |         None: asserts outbound statistic results are as expected.
146 |     """
147 |     outbound_ = outbound(df_general)
148 |     print(df_inbound)
149 |     assert isinstance(outbound_, pd.DataFrame)
150 |     assert all(outbound_["A"] == df_outbound["A"])
151 |     assert all(outbound_["B"] == df_outbound["B"])
152 |     assert all(outbound_["C"] == df_outbound["C"])
153 | 
154 | def test_flux():
155 |     """Test that the flux coeffs and proportions equal to expected results.
156 | 
157 |     `df_flux` contains hard-coded expected result. Tested against the
158 |     influx, outflux, and proportions functions, which all take `df_general`
159 |     as an input.
160 | 
161 |     Args:
162 |         None: DataFrame for testing created internally.
163 | 
164 |     Returns:
165 |         None: asserts pobs, influx, and outflux results are as expected.
166 |     """
167 |     flux_ = flux(df_general)
168 |     assert isinstance(flux_, pd.DataFrame)
169 |     assert all(flux_["pobs"] == df_flux["pobs"])
170 |     assert all(flux_["influx"] == df_flux["influx"])
171 |     assert all(flux_["outflux"] == df_flux["outflux"])
172 | 


--------------------------------------------------------------------------------
/tests/test_visuals/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kearnz/autoimpute/b5ab3f42aabb6c4535c94ca2b0ab9d147cc93393/tests/test_visuals/__init__.py


--------------------------------------------------------------------------------