├── CreditScoringToolkit.py ├── requirements-dev.txt ├── pytest.ini ├── images ├── score_kde.png ├── event_range_5.png ├── roc_auc_curve.png ├── event_range_10.png ├── score_histogram.png ├── scoring_method.png ├── Usage Example_22_0.png ├── Usage Example_22_1.png ├── Usage Example_24_1.png ├── Usage Example_24_3.png └── feature_importance.png ├── reports ├── iv_report.png ├── roc_curve.png ├── score_kde.png ├── event_rate_5.png ├── event_rate_10.png └── score_histogram.png ├── requirements.txt ├── CHANGELOG.md ├── woe_credit_scoring ├── __init__.py ├── reporter.py ├── base.py ├── encoder.py ├── scoring.py ├── normalizer.py ├── binning.py └── autocreditscoring.py ├── pyproject.toml ├── tests ├── unit │ ├── test_normalizer.py │ ├── test_feature_selectors.py │ └── test_encoder.py └── integration │ ├── test_autocreditscoring_pipeline.py │ └── test_manual_pipeline.py ├── .gitignore ├── README.md └── LICENSE /CreditScoringToolkit.py: -------------------------------------------------------------------------------- 1 | from woe_credit_scoring import * 2 | -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | pytest 2 | pandas 3 | scikit-learn 4 | -------------------------------------------------------------------------------- /pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | minversion = 6.0 3 | testpaths = tests 4 | -------------------------------------------------------------------------------- /images/score_kde.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/score_kde.png -------------------------------------------------------------------------------- /reports/iv_report.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/iv_report.png -------------------------------------------------------------------------------- /reports/roc_curve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/roc_curve.png -------------------------------------------------------------------------------- /reports/score_kde.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/score_kde.png -------------------------------------------------------------------------------- /images/event_range_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/event_range_5.png -------------------------------------------------------------------------------- /images/roc_auc_curve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/roc_auc_curve.png -------------------------------------------------------------------------------- /reports/event_rate_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/event_rate_5.png -------------------------------------------------------------------------------- /images/event_range_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/event_range_10.png -------------------------------------------------------------------------------- /images/score_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/score_histogram.png -------------------------------------------------------------------------------- /images/scoring_method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/scoring_method.png -------------------------------------------------------------------------------- /reports/event_rate_10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/event_rate_10.png -------------------------------------------------------------------------------- /images/Usage Example_22_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/Usage Example_22_0.png -------------------------------------------------------------------------------- /images/Usage Example_22_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/Usage Example_22_1.png -------------------------------------------------------------------------------- /images/Usage Example_24_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/Usage Example_24_1.png -------------------------------------------------------------------------------- /images/Usage Example_24_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/Usage Example_24_3.png -------------------------------------------------------------------------------- /images/feature_importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/images/feature_importance.png -------------------------------------------------------------------------------- /reports/score_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JGFuentesC/woe_credit_scoring/HEAD/reports/score_histogram.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.21.0 2 | pandas>=1.3.0 3 | scikit-learn>=1.0.0 4 | seaborn>=0.11.0 5 | matplotlib>=3.4.0 6 | scipy>=1.7.0 7 | ipykernel>=6.0.0 8 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | ## [2.0.4] - 2025-12-20 4 | 5 | ### Features 6 | - Refactored the project into a multi-file package structure for better organization and maintainability. 7 | - Added a comprehensive suite of unit and integration tests to ensure code quality and stability. 8 | 9 | ### Fixes 10 | - Resolved several minor bugs identified during the refactoring and testing process. 11 | 12 | ### Chore 13 | - Added development dependencies and pytest configuration for a formal testing process. 14 | -------------------------------------------------------------------------------- /woe_credit_scoring/__init__.py: -------------------------------------------------------------------------------- 1 | # woe_credit_scoring/__init__.py 2 | 3 | from .normalizer import DiscreteNormalizer 4 | from .reporter import frequency_table 5 | from .base import WoeBaseFeatureSelector 6 | from .binning import Discretizer, WoeContinuousFeatureSelector, WoeDiscreteFeatureSelector, IVCalculator 7 | from .encoder import WoeEncoder 8 | from .scoring import CreditScoring 9 | from .autocreditscoring import AutoCreditScoring 10 | 11 | 12 | __all__ = [ 13 | "DiscreteNormalizer", "frequency_table", "WoeBaseFeatureSelector", 14 | "Discretizer", "WoeEncoder", "WoeContinuousFeatureSelector", "WoeDiscreteFeatureSelector", 15 | "CreditScoring", "AutoCreditScoring", "IVCalculator" 16 | ] 17 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=61.0"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "woe_credit_scoring" 7 | version = "2.0.4" 8 | description = "A toolkit for Credit Scoring using Weight of Evidence (WoE) and Logistic Regression" 9 | readme = "README.md" 10 | requires-python = ">=3.10" 11 | classifiers = [ 12 | "Programming Language :: Python :: 3", 13 | "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", 14 | "Operating System :: OS Independent", 15 | ] 16 | dependencies = [ 17 | "numpy>=1.21.0", 18 | "pandas>=1.3.0", 19 | "scikit-learn>=1.0.0", 20 | "seaborn>=0.11.0", 21 | "matplotlib>=3.4.0", 22 | "scipy>=1.7.0", 23 | ] 24 | 25 | [project.urls] 26 | "Homepage" = "https://github.com/JGFuentesC/woe_credit_scoring" 27 | "Bug Tracker" = "https://github.com/JGFuentesC/woe_credit_scoring/issues" 28 | 29 | [tool.setuptools.packages.find] 30 | where = ["."] 31 | 32 | [tool.setuptools] 33 | py-modules = ["CreditScoringToolkit"] 34 | -------------------------------------------------------------------------------- /woe_credit_scoring/reporter.py: -------------------------------------------------------------------------------- 1 | from typing import Union, List 2 | import pandas as pd 3 | import logging 4 | 5 | logger = logging.getLogger("CreditScoringToolkit") 6 | 7 | def frequency_table(df: pd.DataFrame, variables: Union[List[str], str]) -> None: 8 | """ 9 | Displays a frequency table for the specified variables in the DataFrame. 10 | 11 | Args: 12 | df (pd.DataFrame): The input DataFrame. 13 | variables (Union[List[str], str]): List of variables (column names) to generate frequency tables for. 14 | 15 | Returns: 16 | None 17 | """ 18 | if not isinstance(df, pd.DataFrame): 19 | raise TypeError("The first argument must be a pandas DataFrame.") 20 | 21 | if isinstance(variables, str): 22 | variables = [variables] 23 | 24 | if not isinstance(variables, list) or not all(isinstance(var, str) for var in variables): 25 | raise TypeError( 26 | "The second argument must be a string or a list of strings.") 27 | 28 | for variable in variables: 29 | if variable not in df.columns: 30 | logger.warning(f"{variable} not found in DataFrame columns.") 31 | continue 32 | 33 | frequency_df = df[variable].value_counts().to_frame().sort_index() 34 | frequency_df.columns = ['Abs. Freq.'] 35 | frequency_df['Rel. Freq.'] = frequency_df['Abs. Freq.'] / \ 36 | frequency_df['Abs. Freq.'].sum() 37 | frequency_df[['Cum. Abs. Freq.', 'Cum. Rel. Freq.'] 38 | ] = frequency_df.cumsum() 39 | 40 | print(f'**** Frequency Table for {variable} ****\n') 41 | print(frequency_df) 42 | print("\n" * 3) 43 | -------------------------------------------------------------------------------- /tests/unit/test_normalizer.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import numpy as np 4 | import pytest 5 | from CreditScoringToolkit import DiscreteNormalizer 6 | 7 | @pytest.fixture 8 | def sample_data(): 9 | 10 | data = { 11 | 'feature1': ['A', 'A', 'B', 'B', 'B', 'C', 'D', 'D', np.nan], 12 | 'feature2': ['X', 'X', 'Y', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'] 13 | } 14 | return pd.DataFrame(data) 15 | 16 | def test_small_category_aggregation(sample_data): 17 | 18 | dn = DiscreteNormalizer(normalization_threshold=0.3, default_category='SMALL') 19 | dn.fit(sample_data[['feature1']]) 20 | transformed = dn.transform(sample_data[['feature1']]) 21 | expected_values = ['SMALL', 'SMALL', 'B', 'B', 'B', 'SMALL', 'SMALL', 'SMALL', 'SMALL'] 22 | assert transformed['feature1'].tolist() == expected_values 23 | 24 | def test_missing_value_handling(sample_data): 25 | 26 | dn = DiscreteNormalizer(normalization_threshold=0.1) 27 | dn.fit(sample_data[['feature1']]) 28 | transformed = dn.transform(sample_data[['feature1']]) 29 | assert 'MISSING' in transformed['feature1'].unique() 30 | assert transformed['feature1'].iloc[8] == 'MISSING' 31 | 32 | def test_unseen_categories(): 33 | 34 | train_data = pd.DataFrame({'feature1': ['A', 'A', 'B', 'B', 'B']}) 35 | test_data = pd.DataFrame({'feature1': ['A', 'C', 'B']}) 36 | dn = DiscreteNormalizer() 37 | dn.fit(train_data) 38 | transformed = dn.transform(test_data) 39 | 40 | assert transformed['feature1'].tolist() == ['A', 'B', 'B'] 41 | 42 | def test_no_small_categories(sample_data): 43 | 44 | dn = DiscreteNormalizer(normalization_threshold=0.1) 45 | dn.fit(sample_data[['feature2']]) 46 | transformed = dn.transform(sample_data[['feature2']]) 47 | assert 'SMALL CATEGORIES' not in transformed['feature2'].unique() 48 | expected_values = ['X', 'X', 'Y', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'] 49 | assert transformed['feature2'].tolist() == expected_values 50 | -------------------------------------------------------------------------------- /tests/integration/test_autocreditscoring_pipeline.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import pytest 3 | from sklearn.metrics import roc_auc_score 4 | from woe_credit_scoring.autocreditscoring import AutoCreditScoring 5 | 6 | @pytest.fixture(scope='module') 7 | def data(): 8 | 9 | train = pd.read_csv('example_data/train.csv') 10 | valid = pd.read_csv('example_data/valid.csv') 11 | return train, valid 12 | 13 | def test_autocreditscoring_pipeline(data): 14 | 15 | train, valid = data 16 | varc = [v for v in train.columns if v.startswith('C_')] 17 | vard = [v for v in train.columns if v.startswith('D_')] 18 | 19 | # We use the original train data for fitting, the class will split it internally 20 | full_train_for_acs = pd.concat([train, valid]).reset_index(drop=True) 21 | 22 | acs = AutoCreditScoring( 23 | data=full_train_for_acs, # The class will perform its own train/test split 24 | target='TARGET', 25 | continuous_features=varc, 26 | discrete_features=vard 27 | ) 28 | 29 | fit_params = { 30 | 'iv_feature_threshold': 0.1, 31 | 'max_discretization_bins': 5, 32 | 'discretization_method': 'quantile', 33 | 'create_reporting': False, 34 | 'target_proportion_tolerance': 0.1 # Increase tolerance for small dataset 35 | } 36 | acs.fit(**fit_params) 37 | 38 | 39 | assert acs.credit_scoring.scorecard is not None 40 | assert not acs.credit_scoring.scorecard.empty 41 | 42 | # Use the original validation set for prediction 43 | predictions = acs.predict(valid) 44 | assert 'score' in predictions.columns 45 | 46 | # To get probabilities, we must manually apply the pipeline and use the fitted model 47 | # Accessing private method for testing purposes 48 | valid_woe = acs._AutoCreditScoring__apply_pipeline(valid) 49 | valid_proba = acs.model.predict_proba(valid_woe)[:, 1] 50 | 51 | auc = roc_auc_score(y_true=valid['TARGET'], y_score=valid_proba) 52 | 53 | assert auc > 0.65 -------------------------------------------------------------------------------- /tests/unit/test_feature_selectors.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import numpy as np 4 | import pytest 5 | from CreditScoringToolkit import WoeDiscreteFeatureSelector, WoeContinuousFeatureSelector 6 | 7 | @pytest.fixture 8 | def feature_selection_data(): 9 | 10 | data = { 11 | 'discrete_feature_good': ['A'] * 5 + ['B'] * 5, 12 | 'discrete_feature_bad': ['C'] * 9 + ['D'] * 1, 13 | 'continuous_feature_good': np.arange(10), 14 | 'continuous_feature_bad': np.concatenate([np.zeros(9), np.ones(1)]), 15 | 'target': [0,0,0,0,1, 0,1,1,1,1] 16 | } 17 | return pd.DataFrame(data) 18 | 19 | def test_woe_discrete_feature_selector(feature_selection_data): 20 | 21 | selector = WoeDiscreteFeatureSelector() 22 | selector.fit( 23 | feature_selection_data[['discrete_feature_good', 'discrete_feature_bad']], 24 | feature_selection_data['target'], 25 | iv_threshold=0.1 26 | ) 27 | 28 | assert 'discrete_feature_good' in selector.selected_features 29 | assert 'discrete_feature_bad' not in selector.selected_features 30 | 31 | transformed = selector.transform(feature_selection_data[['discrete_feature_good', 'discrete_feature_bad']]) 32 | assert list(transformed.columns) == ['discrete_feature_good'] 33 | 34 | def test_woe_continuous_feature_selector(feature_selection_data): 35 | 36 | selector = WoeContinuousFeatureSelector() 37 | selector.fit( 38 | feature_selection_data[['continuous_feature_good', 'continuous_feature_bad']], 39 | feature_selection_data['target'], 40 | iv_threshold=0.1, 41 | method='quantile', 42 | max_bins=2 43 | ) 44 | 45 | assert len(selector.selected_features) == 1 46 | assert selector.selected_features[0]['root_feature'] == 'continuous_feature_good' 47 | 48 | transformed = selector.transform(feature_selection_data[['continuous_feature_good', 'continuous_feature_bad']]) 49 | 50 | 51 | assert any(col.startswith('disc_continuous_feature_good') for col in transformed.columns) 52 | assert not any(col.startswith('disc_continuous_feature_bad') for col in transformed.columns) 53 | -------------------------------------------------------------------------------- /tests/integration/test_manual_pipeline.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import pytest 4 | from sklearn.linear_model import LogisticRegression 5 | from sklearn.metrics import roc_auc_score 6 | 7 | from CreditScoringToolkit import ( 8 | DiscreteNormalizer, 9 | WoeEncoder, 10 | WoeContinuousFeatureSelector, 11 | WoeDiscreteFeatureSelector, 12 | CreditScoring 13 | ) 14 | 15 | @pytest.fixture(scope='module') 16 | def data(): 17 | 18 | train = pd.read_csv('example_data/train.csv') 19 | valid = pd.read_csv('example_data/valid.csv') 20 | return train, valid 21 | 22 | def test_manual_pipeline_auc(data): 23 | 24 | train, valid = data 25 | 26 | vard = [v for v in train.columns if v.startswith('D_')] 27 | varc = [v for v in train.columns if v.startswith('C_')] 28 | 29 | # 1. Normalization 30 | dn = DiscreteNormalizer(normalization_threshold=0.05, default_category='SMALL CATEGORIES') 31 | dn.fit(train[vard]) 32 | train_norm = dn.transform(train[vard]) 33 | valid_norm = dn.transform(valid[vard]) 34 | 35 | # 2. Feature Selection 36 | wcf = WoeContinuousFeatureSelector() 37 | wdf = WoeDiscreteFeatureSelector() 38 | 39 | wcf.fit(train[varc], train['TARGET'], method='quantile', iv_threshold=0.1, max_bins=5) 40 | wdf.fit(train_norm, train['TARGET'], iv_threshold=0.1) 41 | 42 | train_selected = pd.concat([wdf.transform(train_norm), wcf.transform(train[varc])], axis=1) 43 | valid_selected = pd.concat([wdf.transform(valid_norm), wcf.transform(valid[varc])], axis=1) 44 | 45 | features = list(train_selected.columns) 46 | assert len(features) > 0 47 | 48 | # 3. WoE Encoding 49 | we = WoeEncoder() 50 | we.fit(train_selected, train['TARGET']) 51 | 52 | train_woe = we.transform(train_selected) 53 | valid_woe = we.transform(valid_selected) 54 | 55 | # 4. Model Training 56 | lr = LogisticRegression() 57 | lr.fit(train_woe, train['TARGET']) 58 | 59 | # 5. Scoring and Validation 60 | cs = CreditScoring() 61 | cs.fit(train_woe, we, lr) 62 | 63 | valid_pred_proba = lr.predict_proba(valid_woe)[:, 1] 64 | auc = roc_auc_score(y_true=valid['TARGET'], y_score=valid_pred_proba) 65 | 66 | assert auc > 0.65 67 | assert cs.scorecard is not None 68 | assert not cs.scorecard.empty 69 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | **/__pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # Distribution / packaging 7 | .Python 8 | build/ 9 | develop-eggs/ 10 | dist/ 11 | downloads/ 12 | eggs/ 13 | .eggs/ 14 | lib/ 15 | lib64/ 16 | parts/ 17 | sdist/ 18 | var/ 19 | wheels/ 20 | pip-wheel-metadata/ 21 | share/python-wheels/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | MANIFEST 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .nox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | *.py,cover 48 | .hypothesis/ 49 | .pytest_cache/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | local_settings.py 58 | db.sqlite3 59 | db.sqlite3-journal 60 | 61 | # Flask stuff: 62 | instance/ 63 | .webassets-cache 64 | 65 | # Scrapy stuff: 66 | .scrapy 67 | 68 | # PyBuilder 69 | target/ 70 | 71 | # Jupyter Notebook 72 | .ipynb_checkpoints 73 | 74 | # IPython 75 | profile_default/ 76 | ipython_config.py 77 | 78 | # pyenv 79 | .python-version 80 | 81 | # pipenv 82 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 83 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 84 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 85 | # install all needed dependencies. 86 | #Pipfile.lock 87 | 88 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 89 | __pypackages__/ 90 | 91 | # Celery stuff 92 | celerybeat-schedule 93 | celerybeat.pid 94 | 95 | # SageMath parsed files 96 | *.sage.py 97 | 98 | # Environments 99 | .env 100 | .venv 101 | env/ 102 | venv/ 103 | ENV/ 104 | env.bak/ 105 | venv.bak/ 106 | 107 | # Spyder project settings 108 | .spyderproject 109 | .spyproject 110 | 111 | # Rope project settings 112 | .ropeproject 113 | 114 | # mkdocs documentation 115 | /site 116 | 117 | # mypy 118 | .mypy_cache/ 119 | .dmypy.json 120 | dmypy.json 121 | 122 | # Pyre type checker 123 | .pyre/ 124 | 125 | #pip 126 | 127 | **/.venv/ -------------------------------------------------------------------------------- /woe_credit_scoring/base.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | import numpy as np 3 | import pandas as pd 4 | 5 | class WoeBaseFeatureSelector: 6 | """ 7 | Base class for selecting features based on their Weight of Evidence (WoE) 8 | transformation and Information Value (IV) statistic. 9 | 10 | This class provides foundational methods for evaluating and selecting 11 | features by transforming them using WoE and calculating their IV. 12 | The IV statistic is used to measure the predictive power of each feature 13 | with respect to a binary target variable. Features with higher IV values 14 | are considered more predictive. 15 | 16 | The class includes methods to compute the IV statistic, check for 17 | monotonic risk behavior, and other utility functions that can be extended 18 | by subclasses to implement specific feature selection strategies. 19 | 20 | Attributes: 21 | None 22 | 23 | Methods: 24 | _information_value(X, y): Computes the IV statistic for a given feature. 25 | _check_monotonic(X, y): Checks if a feature exhibits monotonic risk behavior. 26 | """ 27 | 28 | def __init__(self): 29 | pass 30 | 31 | @staticmethod 32 | def _information_value(X: pd.Series, y: pd.Series) -> Union[float, None]: 33 | """ 34 | Computes information value (IV) statistic. 35 | 36 | Args: 37 | X (pd.Series): Discretized predictors data. 38 | y (pd.Series): Dichotomic response feature. 39 | 40 | Returns: 41 | Union[float, None]: IV statistic or None if IV is infinite. 42 | 43 | Reference: 44 | For more details on the Information Value statistic, see 45 | http://arxiv.org/pdf/2309.13183 46 | """ 47 | aux = pd.concat([X, y], axis=1) 48 | aux.columns = ['x', 'y'] 49 | aux = aux.assign(nrow=1) 50 | aux = aux.pivot_table(index='x', columns='y', 51 | values='nrow', aggfunc='sum', fill_value=0) 52 | aux /= aux.sum() 53 | aux['woe'] = np.log(aux[0] / aux[1]) 54 | aux['iv'] = (aux[0] - aux[1]) * aux['woe'] 55 | iv = aux['iv'].sum() 56 | return None if np.isinf(iv) else iv 57 | 58 | @staticmethod 59 | def _check_monotonic(X: pd.Series, y: pd.Series) -> bool: 60 | """ 61 | Validates if a given discretized feature has monotonic risk behavior. 62 | 63 | Args: 64 | X (pd.Series): Discretized predictors data. 65 | y (pd.Series): Dichotomic response feature. 66 | 67 | Returns: 68 | bool: Whether or not the feature has monotonic risk. 69 | """ 70 | aux = pd.concat([X, y], axis=1) 71 | aux.columns = ['x', 'y'] 72 | aux = aux.loc[aux['x'] != 'MISSING'].reset_index(drop=True) 73 | aux = aux.groupby('x').mean() 74 | y_values = list(aux['y']) 75 | return (len(y_values) >= 2) and (sorted(y_values) == y_values or sorted(y_values, reverse=True) == y_values) 76 | -------------------------------------------------------------------------------- /tests/unit/test_encoder.py: -------------------------------------------------------------------------------- 1 | 2 | import pandas as pd 3 | import numpy as np 4 | import pytest 5 | from CreditScoringToolkit import WoeEncoder 6 | 7 | @pytest.fixture 8 | def woe_data(): 9 | 10 | data = { 11 | 'feature': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'], 12 | 'target': [0, 0, 1, 0, 1, 0, 1, 1, 1] 13 | } 14 | return pd.DataFrame(data) 15 | 16 | def test_woe_calculation(woe_data): 17 | 18 | encoder = WoeEncoder() 19 | encoder.fit(woe_data[['feature']], woe_data['target']) 20 | woe_table = pd.DataFrame.from_dict(encoder._woe_encoding_map['feature'], orient='index', columns=['woe']) 21 | 22 | 23 | 24 | # Correct WoE is log( P(0) / P(1) ) 25 | p0 = woe_data['target'].value_counts(normalize=True)[0] 26 | p1 = 1 - p0 27 | 28 | # For category A 29 | p0_A = woe_data[woe_data['feature'] == 'A']['target'].value_counts(normalize=True)[0] 30 | p1_A = 1 - p0_A 31 | expected_A = np.log((p0_A / p0) / (p1_A / p1)) if p1_A > 0 and p0_A > 0 else 0 32 | 33 | # For category B 34 | p0_B = woe_data[woe_data['feature'] == 'B']['target'].value_counts(normalize=True)[0] 35 | p1_B = 1 - p0_B 36 | expected_B = np.log((p0_B / p0) / (p1_B / p1)) if p1_B > 0 and p0_B > 0 else 0 37 | 38 | # For category C 39 | p0_C = woe_data[woe_data['feature'] == 'C']['target'].value_counts(normalize=True)[0] 40 | p1_C = 1 - p0_C 41 | expected_C = np.log((p0_C / p0) / (p1_C / p1)) if p1_C > 0 and p0_C > 0 else 0 42 | 43 | assert np.isclose(woe_table.loc['A', 'woe'], np.log( (2/4) / (1/5))) 44 | assert np.isclose(woe_table.loc['B', 'woe'], np.log( (1/4) / (1/5))) 45 | assert np.isclose(woe_table.loc['C', 'woe'], np.log( (1/4) / (3/5))) 46 | 47 | def test_woe_transform(woe_data): 48 | 49 | encoder = WoeEncoder() 50 | encoder.fit(woe_data[['feature']], woe_data['target']) 51 | transformed = encoder.transform(woe_data[['feature']]) 52 | 53 | # Correct WoE is log(% of 0s / % of 1s) 54 | # P(0|A) = 2/4 = 0.5, P(1|A) = 1/5 = 0.2 -> log(0.5/0.2) is not right 55 | # It should be log ( (count_0 / total_0) / (count_1 / total_1) ) 56 | total_0 = 4 57 | total_1 = 5 58 | 59 | p0_A = (2/total_0) 60 | p1_A = (1/total_1) 61 | expected_A = np.log(p0_A / p1_A) 62 | 63 | p0_B = (1/total_0) 64 | p1_B = (1/total_1) 65 | expected_B = np.log(p0_B / p1_B) 66 | 67 | p0_C = (1/total_0) 68 | p1_C = (3/total_1) 69 | expected_C = np.log(p0_C / p1_C) 70 | 71 | assert np.isclose(transformed['feature'].iloc[0], expected_A) 72 | assert np.isclose(transformed['feature'].iloc[3], expected_B) 73 | assert np.isclose(transformed['feature'].iloc[5], expected_C) 74 | 75 | def test_woe_inverse_transform(woe_data): 76 | 77 | encoder = WoeEncoder() 78 | encoder.fit(woe_data[['feature']], woe_data['target']) 79 | transformed = encoder.transform(woe_data[['feature']]) 80 | inversed = encoder.inverse_transform(transformed) 81 | 82 | pd.testing.assert_frame_equal(inversed, woe_data[['feature']]) 83 | 84 | def test_woe_with_missing_values(): 85 | 86 | data = { 87 | 'feature': ['A', 'A', 'B', 'MISSING'], 88 | 'target': [0, 1, 0, 1] 89 | } 90 | df = pd.DataFrame(data) 91 | encoder = WoeEncoder() 92 | encoder.fit(df[['feature']], df['target']) 93 | transformed = encoder.transform(df[['feature']]) 94 | 95 | assert 'MISSING' in encoder._woe_encoding_map['feature'] 96 | assert not transformed.isnull().values.any() 97 | -------------------------------------------------------------------------------- /woe_credit_scoring/encoder.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | from collections import ChainMap 3 | import numpy as np 4 | import pandas as pd 5 | 6 | class WoeEncoder: 7 | """ 8 | WoeEncoder is a class for encoding discrete features into Weight of Evidence (WoE) values. 9 | 10 | WoE is a commonly used technique in credit scoring and other binary classification problems. 11 | It transforms categorical features into continuous values based on the log odds of the target variable. 12 | 13 | This class provides methods to fit the WoE transformation based on input data, transform new data using the learned WoE encoding, 14 | and inverse transform WoE encoded data back to the original categorical values. 15 | 16 | Attributes: 17 | features (list): List of feature names to be encoded. 18 | _woe_encoding_map (dict): Dictionary mapping features to their WoE encoding. 19 | __is_fitted (bool): Flag indicating whether the encoder has been fitted. 20 | _woe_reverse_map (dict): Dictionary mapping WoE values back to original feature values. 21 | 22 | Reference: 23 | For more details on the Weight of Evidence (WoE) encoding, see 24 | http://listendata.com/2015/03/weight-of-evidence-woe-and-information.html 25 | """ 26 | 27 | def __init__(self) -> None: 28 | self.features = None 29 | self._woe_encoding_map = None 30 | self.__is_fitted = False 31 | self._woe_reverse_map = None 32 | 33 | def fit(self, X: pd.DataFrame, y: pd.Series, target_col: str = 'binary_target') -> None: 34 | """Learns WoE encoding. 35 | 36 | Args: 37 | X (pd.DataFrame): Data with discrete features. 38 | y (pd.Series): Dichotomic response. 39 | target_col (str): Name of the target column to be created in the dataframe. 40 | """ 41 | aux = X.copy() 42 | self.features = list(aux.columns) 43 | aux[target_col] = y 44 | self._woe_encoding_map = dict(ChainMap( 45 | *map(lambda feature: self._woe_transformation(aux, feature, target_col), self.features))) 46 | self.__is_fitted = True 47 | 48 | @staticmethod 49 | def _woe_transformation(X: pd.DataFrame, feature: str, bin_target: str) -> Dict[str, Dict]: 50 | """Calculates WoE Map between discrete space and log odds space. 51 | 52 | Args: 53 | X (pd.DataFrame): Discrete data including dichotomic response feature. 54 | feature (str): Name of the feature for getting the map. 55 | bin_target (str): Name of the dichotomic response feature. 56 | 57 | Returns: 58 | dict: Key is the name of the feature, value is the WoE Map. 59 | 60 | Raises: 61 | ValueError: If bin_target column has more than 2 categories. 62 | """ 63 | if X[bin_target].nunique() != 2: 64 | raise ValueError( 65 | f"The target column '{bin_target}' must have exactly 2 unique values.") 66 | 67 | aux = X[[feature, bin_target]].copy().assign(n_row=1) 68 | aux = aux.pivot_table(index=feature, columns=bin_target, 69 | values='n_row', aggfunc='sum', fill_value=0) 70 | aux /= aux.sum() 71 | aux['woe'] = np.log(aux[0] / aux[1]) 72 | aux = aux.drop(columns=[0, 1]) 73 | return {feature: aux['woe'].to_dict()} 74 | 75 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 76 | """Performs WoE transformation. 77 | 78 | Args: 79 | X (pd.DataFrame): Discrete data to be transformed. 80 | 81 | Raises: 82 | Exception: If fit method not called previously. 83 | 84 | Returns: 85 | pd.DataFrame: WoE encoded data. 86 | """ 87 | if not self.__is_fitted: 88 | raise Exception( 89 | 'Please call fit method first with the required parameters') 90 | 91 | aux = X.copy() 92 | for feature, woe_map in self._woe_encoding_map.items(): 93 | aux[feature] = aux[feature].replace(woe_map) 94 | return aux 95 | 96 | def inverse_transform(self, X: pd.DataFrame) -> pd.DataFrame: 97 | """Performs Inverse WoE transformation. 98 | 99 | Args: 100 | X (pd.DataFrame): WoE data to be transformed. 101 | 102 | Raises: 103 | Exception: If fit method not called previously. 104 | 105 | Returns: 106 | pd.DataFrame: WoE encoded data. 107 | """ 108 | if not self.__is_fitted: 109 | raise Exception( 110 | 'Please call fit method first with the required parameters') 111 | 112 | aux = X.copy() 113 | self._woe_reverse_map = {feature: {v: k for k, v in woe_map.items( 114 | )} for feature, woe_map in self._woe_encoding_map.items()} 115 | for feature, woe_map in self._woe_reverse_map.items(): 116 | aux[feature] = aux[feature].replace(woe_map) 117 | return aux 118 | -------------------------------------------------------------------------------- /woe_credit_scoring/scoring.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Dict 2 | from collections import ChainMap 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.linear_model import LogisticRegression 6 | import logging 7 | from .encoder import WoeEncoder 8 | 9 | logger = logging.getLogger("CreditScoringToolkit") 10 | 11 | class CreditScoring: 12 | """ 13 | Implements credit risk scorecards following the methodology proposed in 14 | Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons. 15 | 16 | This class provides methods to fit a logistic regression model to the provided data, 17 | transform the data using Weight of Evidence (WoE) encoding, and generate a scorecard 18 | that maps the model's coefficients to a scoring system. The scorecard can then be used 19 | to convert new data into credit scores. 20 | 21 | Attributes: 22 | logistic_regression (Optional[LogisticRegression]): Fitted logistic regression model. 23 | pdo (Optional[int]): Points to Double the Odds. 24 | base_odds (Optional[int]): Base odds at the base score. 25 | base_score (Optional[int]): Base score for calibration. 26 | betas (Optional[list]): Coefficients of the logistic regression model. 27 | alpha (Optional[float]): Intercept of the logistic regression model. 28 | factor (Optional[float]): Factor used in score calculation. 29 | offset (Optional[float]): Offset used in score calculation. 30 | features (Optional[Dict[str, float]]): Mapping of feature names to their coefficients. 31 | n (Optional[int]): Number of features. 32 | scorecard (Optional[pd.DataFrame]): DataFrame containing the scorecard. 33 | scoring_map (Optional[Dict[str, Dict[str, int]]]): Mapping of features to their score mappings. 34 | __is_fitted (bool): Indicates whether the model has been fitted. 35 | """ 36 | 37 | logistic_regression: Optional[LogisticRegression] = None 38 | pdo: Optional[int] = None 39 | base_odds: Optional[int] = None 40 | base_score: Optional[int] = None 41 | betas: Optional[list] = None 42 | alpha: Optional[float] = None 43 | factor: Optional[float] = None 44 | offset: Optional[float] = None 45 | features: Optional[Dict[str, float]] = None 46 | n: Optional[int] = None 47 | scorecard: Optional[pd.DataFrame] = None 48 | scoring_map: Optional[Dict[str, Dict[str, int]]] = None 49 | __is_fitted: bool = False 50 | 51 | def __init__(self, pdo: int = 20, base_score: int = 400, base_odds: int = 1) -> None: 52 | """Initializes Credit Scoring object. 53 | 54 | Args: 55 | pdo (int, optional): Points to Double the Odd's _. Defaults to 20. 56 | base_score (int, optional): Default score for calibration. Defaults to 400. 57 | base_odds (int, optional): Odd's base at base_score . Defaults to 1. 58 | """ 59 | self.pdo = pdo 60 | self.base_score = base_score 61 | self.base_odds = base_odds 62 | self.factor = self.pdo / np.log(2) 63 | self.offset = self.base_score - self.factor * np.log(self.base_odds) 64 | 65 | @staticmethod 66 | def _get_scorecard(X: pd.DataFrame, feature: str) -> pd.DataFrame: 67 | """Generates scorecard points for a given feature 68 | 69 | Args: 70 | X (pd.DataFrame): Feature Data 71 | feature (str): Predictor 72 | 73 | Returns: 74 | pd.DataFrame: Feature, Attribute and respective points 75 | """ 76 | sc = X[[feature, f'P_{feature}']].copy( 77 | ).drop_duplicates().reset_index(drop=True) 78 | sc = sc.rename(columns={feature: 'attribute', 79 | f'P_{feature}': 'points'}) 80 | sc.insert(0, 'feature', feature) 81 | return sc 82 | 83 | def fit(self, Xw: pd.DataFrame, woe_encoder: WoeEncoder, logistic_regression: LogisticRegression) -> None: 84 | """Learns scoring map 85 | 86 | Args: 87 | Xw (pd.DataFrame): WoE transformed data 88 | woe_encoder (WoeEncoder): WoE encoder fitted object 89 | logistic_regression (LogisticRegression): Fitted logistic regression model 90 | """ 91 | X = Xw.copy() 92 | self.betas = list(logistic_regression.coef_[0]) 93 | self.alpha = logistic_regression.intercept_[0] 94 | self.features = dict(zip(Xw.columns, self.betas)) 95 | self.n = len(self.betas) 96 | for feature, beta in self.features.items(): 97 | X[f'P_{feature}'] = np.floor( 98 | (-X[feature] * beta + self.alpha / self.n) * self.factor + self.offset / self.n).astype(int) 99 | features = list(self.features.keys()) 100 | X[features] = woe_encoder.inverse_transform(X[features]) 101 | self.scorecard = pd.concat( 102 | map(lambda f: self._get_scorecard(X, f), features)) 103 | self.scorecard = self.scorecard.groupby(['feature', 'attribute']).max() 104 | self.scoring_map = dict(ChainMap(*[{f: d[['attribute', 'points']].set_index('attribute')[ 105 | 'points'].to_dict()} for f, d in self.scorecard.reset_index().groupby('feature')])) 106 | self.__is_fitted = True 107 | 108 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 109 | """Converts discrete data to scores 110 | 111 | Args: 112 | X (pd.DataFrame): Discrete predictor data 113 | 114 | Raises: 115 | Exception: If fit method is not called first. 116 | Exception: If a fitted feature is not present in data. 117 | 118 | Returns: 119 | pd.DataFrame: Total score and scores for each feature 120 | """ 121 | if not self.__is_fitted: 122 | raise Exception( 123 | 'Please call fit method first with the required parameters') 124 | else: 125 | aux = X.copy() 126 | features = list(self.scoring_map.keys()) 127 | non_present_features = [ 128 | f for f in features if f not in aux.columns] 129 | if len(non_present_features) > 0: 130 | logger.exception( 131 | f'{",".join(non_present_features)} feature{"s" if len(non_present_features) > 1 else ""} not present in data') 132 | raise Exception("Missing features") 133 | else: 134 | for feature, points_map in self.scoring_map.items(): 135 | aux[feature] = aux[feature].replace(points_map) 136 | aux['score'] = aux[features].sum(axis=1) 137 | return aux 138 | -------------------------------------------------------------------------------- /woe_credit_scoring/normalizer.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | from collections import ChainMap 3 | from itertools import repeat 4 | import numpy as np 5 | import pandas as pd 6 | 7 | class DiscreteNormalizer: 8 | """ 9 | DiscreteNormalizer is a class for normalizing discrete data based on a specified relative frequency threshold. 10 | 11 | This class provides methods to fit a normalization model to discrete data and transform the data according to the learned normalization mapping. 12 | It handles missing values by assigning them to a specific category and groups infrequent categories into a default category. 13 | If the default category does not meet the relative frequency threshold, it is mapped to the most frequent category. 14 | 15 | Attributes: 16 | MISSING_VALUE (str): Placeholder for missing values. 17 | DEFAULT_THRESHOLD (float): Default threshold for considering a category as relevant. 18 | normalization_threshold (float): Threshold for considering a category as relevant. 19 | default_category (str): Name for the default grouping/new categories. 20 | normalization_map (dict): Mapping of original categories to normalized categories. 21 | features (list): List of feature names in the input data. 22 | new_categories (dict): Dictionary of new categories identified during transformation. 23 | X (pd.DataFrame): The input data used for fitting the model. 24 | __is_fitted (bool): Flag indicating whether the model has been fitted. 25 | 26 | Methods: 27 | fit(X): Learns discrete normalization mapping from the input data. 28 | transform(X): Transforms discrete data into its normalized form. 29 | _prepare_feature(feature): Prepares a feature by filling missing values and converting to string. 30 | _get_normalization_map(X, feature, threshold, default_category): Creates the normalization map for a given feature. 31 | """ 32 | MISSING_VALUE = 'MISSING' 33 | DEFAULT_THRESHOLD = 0.05 34 | 35 | def __init__(self, normalization_threshold: float = DEFAULT_THRESHOLD, default_category: str = 'OTHER') -> None: 36 | """ 37 | Args: 38 | normalization_threshold (float, optional): Threshold for considering a category as relevant. Defaults to 0.05. 39 | default_category (str, optional): Given name for the default grouping/new categories. Defaults to 'OTHER'. 40 | """ 41 | self.__is_fitted = False 42 | self.normalization_threshold = normalization_threshold 43 | self.default_category = default_category 44 | self.normalization_map = None 45 | self.features = None 46 | self.new_categories = {} 47 | self.X = None 48 | 49 | def fit(self, X: pd.DataFrame) -> None: 50 | """Learns discrete normalization mapping taking into account the following rules: 51 | 1. All missing values will be filled with the category 'MISSING' 52 | 2. Categories which relative frequency is less than normalization threshold will be mapped to default_category 53 | 3. If default_category as a group doesn't reach the relative frequency threshold, then it will be mapped to the most frequent category 54 | 55 | Args: 56 | X (pd.DataFrame): Data to be normalized 57 | 58 | Raises: 59 | TypeError: If provided data is not a pandas DataFrame object 60 | """ 61 | if not isinstance(X, pd.DataFrame): 62 | raise TypeError('Please use a Pandas DataFrame object') 63 | 64 | self.X = X.copy() 65 | self.features = list(self.X.columns) 66 | self.normalization_map = {} 67 | 68 | for feat in self.features: 69 | self.X[feat] = self._prepare_feature(self.X[feat]) 70 | 71 | self.normalization_map = dict(ChainMap(*map( 72 | lambda feat: self._get_normalization_map( 73 | self.X, feat, self.normalization_threshold, self.default_category), 74 | self.features 75 | ))) 76 | self.__is_fitted = True 77 | 78 | @staticmethod 79 | def _prepare_feature(feature: pd.Series) -> pd.Series: 80 | """Prepares a feature by filling missing values and converting to string.""" 81 | return feature.fillna(DiscreteNormalizer.MISSING_VALUE).astype(str) 82 | 83 | @staticmethod 84 | def _get_normalization_map(X: pd.DataFrame, feature: str, threshold: float, default_category: str) -> Dict: 85 | """Creates the normalization map and the list of existing categories for a given feature. 86 | 87 | Args: 88 | X (pd.DataFrame): Data with discrete features 89 | feature (str): Feature to be analyzed 90 | threshold (float): Threshold for considering a category as relevant. Defaults to 0.05. 91 | default_category (str): Given name for the default grouping/new categories. Defaults to 'OTHER'. 92 | 93 | Returns: 94 | dict: Feature is the key and value is a dictionary which keys are the replacement map and the list of existing categories. 95 | """ 96 | aux = X[feature].value_counts(normalize=True).to_frame() 97 | aux.columns = [feature] 98 | aux['mapping'] = np.where( 99 | aux[feature] < threshold, default_category, aux.index) 100 | mode = aux.head(1)['mapping'].values[0] 101 | 102 | if aux.loc[aux['mapping'] == default_category][feature].sum() < threshold: 103 | aux['mapping'] = aux['mapping'].replace({default_category: mode}) 104 | 105 | aux = aux.drop(feature, axis=1) 106 | return { 107 | feature: { 108 | 'replacement_map': aux.loc[aux.index != aux['mapping']]['mapping'].to_dict(), 109 | 'existing_categories': list(aux.index), 110 | 'mode': mode 111 | } 112 | } 113 | 114 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 115 | """Transforms discrete data into its normalized form. 116 | 117 | Args: 118 | X (pd.DataFrame): Data to be transformed 119 | 120 | Raises: 121 | Exception: If fit method not called previously 122 | Exception: If features analyzed during fit are not present in X 123 | 124 | Returns: 125 | pd.DataFrame: Normalized discrete data 126 | """ 127 | if not self.__is_fitted: 128 | raise Exception( 129 | 'Please call fit method first with the required parameters') 130 | 131 | aux = X.copy() 132 | features = list(self.normalization_map.keys()) 133 | non_present_features = [f for f in features if f not in X.columns] 134 | 135 | if non_present_features: 136 | raise Exception( 137 | f"Missing features: {', '.join(non_present_features)}") 138 | 139 | for feat in features: 140 | aux[feat] = self._prepare_feature(aux[feat]) 141 | mapping = self.normalization_map[feat]['replacement_map'] 142 | existing_categories = self.normalization_map[feat]['existing_categories'] 143 | new_categories = [ 144 | cat for cat in aux[feat].unique() if cat not in existing_categories] 145 | 146 | if new_categories: 147 | self.new_categories.update({feat: new_categories}) 148 | replacement = self.default_category if self.default_category in existing_categories else self.normalization_map[ 149 | feat]['mode'] 150 | aux[feat] = aux[feat].replace( 151 | dict(zip(new_categories, repeat(replacement)))) 152 | 153 | aux[feat] = aux[feat].replace(mapping) 154 | 155 | return aux 156 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 |
3 | 4 | 5 | 6 | [![Contributors][contributors-shield]][contributors-url] 7 | 8 | [![Forks][forks-shield]][forks-url] 9 | 10 | [![Stargazers][stars-shield]][stars-url] 11 | 12 | [![Issues][issues-shield]][issues-url] 13 | 14 | [![GPLv3 License][license-shield]][license-url] 15 | 16 | [![LinkedIn][linkedin-shield]][linkedin-url] 17 | 18 | 19 | 20 |

Credit Scoring Toolkit

21 | 22 | 23 | 24 |

25 | 26 | In finance is a common practice to create risk scorecards to assess the credit worthiness for a given customer. Unfortunately, out of the box credit scoring tools are quite expensive and scatter, that's why we created this toolkit: to empower all credit scoring practicioners and spread the use of weight of evidence based scoring techniques for alternative uses cases (virtually any binary classification problem). 27 | 28 |
29 | Explore the documentation» 30 |
31 | Report Bug 32 | 33 | Request Feature 34 |

35 | 36 | 37 | 38 | 39 |
40 | Table of Contents 41 |
    42 |
  1. About The Project
  2. 43 |
      44 |
    1. Discrete Normalizer
    2. 45 |
    3. Discretizer
    4. 46 |
    5. WoeEncoder
    6. 47 |
    7. WoeBaseFeatureSelector
    8. 48 |
    9. WoeContinuousFeatureSelector
    10. 49 |
    11. WoeDiscreteFeatureSelector
    12. 50 |
    13. CreditScoring
    14. 51 |
    15. IVCalculator
    16. 52 |
    17. Built With
    18. 53 |
    54 |
  3. Installation
  4. 55 |
  5. Usage
  6. 56 |
  7. Contributing
  8. 57 |
  9. License
  10. 58 |
  11. Contact
  12. 59 |
  13. Citing
  14. 60 |
  15. Acknowledgments
  16. 61 |
62 |
63 | 64 | 65 | ## About The Project 66 | 67 | The general process for creating Weight of Evidence based scorecards is illustrated in the figure below : 68 | 69 | ![alt text](images/scoring_method.png) 70 | 71 | For that matter, we implemented the following classes to address the necesary steps to perform 72 | credit scoring transformation: 73 | 74 | ### DiscreteNormalizer 75 | Class for normalizing discrete data for a given relative frequency threshold 76 | ### Discretizer 77 | Class for discretizing continuous data into bins using several methods 78 | ### WoeEncoder 79 | Class for encoding discrete features into Weight of Evidence(WoE) transformation 80 | ### WoeBaseFeatureSelector 81 | Base class for selecting features based on their WoE transformation and 82 | Information Value statistic. 83 | ### WoeContinuousFeatureSelector 84 | Class for selecting continuous features based on their WoE transformation and 85 | Information Value statistic. 86 | ### WoeDiscreteFeatureSelector 87 | Class for selecting discrete features based on their WoE transformation and 88 | Information Value statistic. 89 | ### CreditScoring 90 | Implements credit risk scorecards following the methodology proposed in 91 | Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons. 92 | ### IVCalculator 93 | A utility class to quickly calculate Information Value (IV) for both continuous and discrete features. This class provides a simple interface that abstracts away the manual steps of discretization and normalization, making it easy to assess feature predictive power. 94 | 95 | ### Built With 96 | 97 | * [Python](https://www.python.org/) 98 | * [Numpy](https://numpy.org/) 99 | * [Pandas](https://pandas.pydata.org/) 100 | * [Jupyter](https://jupyter.org/) 101 | * [Scikit-Learn](https://scikit-learn.org/stable/) 102 | * [Matplotlib](https://matplotlib.org/) 103 | * [Seaborn](https://seaborn.pydata.org/) 104 | * [SciPy](https://scipy.org/) 105 | 106 |

(back to top)

107 | 108 | 109 | 110 | ## Installation 111 | 112 | You can simply install the module using pip 113 | 114 | * pip 115 | 116 | ```sh 117 | 118 | pip install woe-credit-scoring 119 | 120 | ``` 121 | 122 | 123 |

(back to top)

124 | 125 | ## Usage 126 | 127 | The new `AutoCreditScoring` class provides a streamlined way to train a credit scoring model, generate reports, and make predictions. Here's a quick example of how to use it: 128 | 129 | ### Dependencies 130 | 131 | ```python 132 | import pandas as pd 133 | from CreditScoringToolkit import AutoCreditScoring 134 | import warnings 135 | warnings.filterwarnings("ignore", category=UserWarning, module="sklearn.preprocessing._discretization") 136 | ``` 137 | 138 | ### Reading example data 139 | 140 | ```python 141 | # Read example data for train and validation (loan applications) 142 | train = pd.read_csv('example_data/train.csv') 143 | valid = pd.read_csv('example_data/valid.csv') 144 | ``` 145 | 146 | ### Defining feature type 147 | 148 | ```python 149 | # Assign features lists by type 150 | vard = [v for v in train.columns if v.startswith('D_')] 151 | varc = [v for v in train.columns if v.startswith('C_')] 152 | ``` 153 | 154 | ### Automated Credit Scoring 155 | 156 | The `AutoCreditScoring` class handles the entire workflow, from feature selection and WoE transformation to model training and scoring. 157 | 158 | ```python 159 | # If you prefer, use AutoCreditScoring class to perform all the steps in a single call with additional features 160 | # like outlier detection and treatment, feature selection, reporting and more. 161 | from CreditScoringToolkit import AutoCreditScoring 162 | 163 | kwargs = {'iv_feature_threshold':0.05, 164 | 'max_discretization_bins':6, 165 | 'strictly_monotonic':True, 166 | 'create_reporting':True, 167 | 'discretization_method':'dcc'} 168 | acs = AutoCreditScoring(train,'TARGET',varc,vard) 169 | acs.fit(**kwargs) 170 | 171 | # You can also save the reports to a folder in PNG format 172 | acs.save_reports('reports') 173 | ``` 174 | 175 | This will generate several reports, including: 176 | 177 | - Score distribution histograms and KDE plots 178 | - Event rate by score range plots 179 | - Feature importance based on Information Value 180 | - ROC curve for the model 181 | 182 | ![png](images/score_histogram.png) 183 | ![png](images/score_kde.png) 184 | ![png](images/event_range_10.png) 185 | ![png](images/feature_importance.png) 186 | ![png](images/roc_auc_curve.png) 187 | 188 | ### Making Predictions 189 | 190 | Once the model is trained, you can use the `predict` method to score new data. 191 | 192 | ```python 193 | predictions = acs.predict(valid) 194 | predictions.head() 195 | ``` 196 | 197 | This will return a DataFrame with the individual point contributions for each feature (`pts_*` columns) and the final score. 198 | 199 | ### IV Calculator 200 | 201 | The `IVCalculator` class provides a quick and easy way to calculate Information Value (IV) for your features without going through the entire credit scoring workflow. This is useful for initial feature analysis and selection. 202 | 203 | ```python 204 | from CreditScoringToolkit import IVCalculator 205 | 206 | # Initialize IVCalculator with your data 207 | iv_calculator = IVCalculator( 208 | data=train, 209 | target='TARGET', 210 | continuous_features=varc, 211 | discrete_features=vard 212 | ) 213 | 214 | # Calculate IV for all features 215 | iv_report = iv_calculator.calculate_iv( 216 | max_discretization_bins=5, 217 | strictly_monotonic=False, 218 | discretization_method='quantile', 219 | discrete_normalization_threshold=0.05 220 | ) 221 | 222 | # Display the report 223 | print(iv_report) 224 | ``` 225 | 226 | The output will be a DataFrame with columns: 227 | - `feature`: Feature name 228 | - `iv`: Information Value 229 | - `feature_type`: 'continuous' or 'discrete' 230 | 231 | This allows you to quickly identify which features have the most predictive power before building your full credit scoring model. 232 | 233 |

(back to top)

234 | 235 | 236 | 237 | 238 | 239 | ## Contributing 240 | 241 | If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". 242 | 243 | Don't forget to give the project a star! Thanks again! 244 | 245 | 1. Fork the Project 246 | 247 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 248 | 249 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 250 | 251 | 4. Push to the Branch (`git push origin feature/AmazingFeature`) 252 | 253 | 5. Open a Pull Request 254 | 255 |

(back to top)

256 | 257 | ## License 258 | 259 | Distributed under the GNU General Public License v3.0 License. See `LICENSE` for more information. 260 |

(back to top)

261 | 262 | ## Contact 263 | 264 | José G Fuentes - [@jgusteacher](https://twitter.com/jgusteacher) - jose.gustavo.fuentes@comunidad.unam.mx 265 | 266 | 267 | Project Link: [https://github.com/JGFuentesC/woe_credit_scoring](https://github.com/JGFuentesC/woe_credit_scoring) 268 | 269 |

(back to top)

270 | 271 | ## Citing 272 | If you use this software in scientific publications, we would appreciate citations to the following paper: 273 | 274 | [Combination of Unsupervised Discretization Methods for Credit Risk](https://journals.plos.org/plosone/article/authors?id=10.1371/journal.pone.0289130) José G. Fuentes Cabrera, Hugo A. Pérez Vicente, Sebastián Maldonado,Jonás Velasco 275 | 276 |

(back to top)

277 | 278 | ## Acknowledgments 279 | 280 | 281 | * [Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.](https://books.google.com.mx/books?hl=es&lr=&id=SEbCeN3-kEUC&oi=fnd&pg=PT7&dq=siddiqi&ots=RvTR0RbOlQ&sig=_V4Iz1q_Hi_GwLAxrp-7tuHrOWY&redir_esc=y#v=onepage&q=siddiqi&f=false). For his amazing textbook. 282 | 283 | * [@othneildrew](https://github.com/othneildrew/Best-README-Template). For his amazing README template 284 | 285 | * [Demo data](https://www.kaggle.com/code/gauravduttakiit/risk-analytics-in-banking-financial-services-1/data). For providing example data. 286 | 287 | 288 |

(back to top)

289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | [contributors-shield]: https://img.shields.io/github/contributors/JGFuentesC/woe_credit_scoring.svg?style=for-the-badge 299 | 300 | [contributors-url]: https://github.com/JGFuentesC/woe_credit_scoring/graphs/contributors 301 | 302 | [forks-shield]: https://img.shields.io/github/forks/JGFuentesC/woe_credit_scoring.svg?style=for-the-badge 303 | 304 | [forks-url]: https://github.com/JGFuentesC/woe_credit_scoring/network/members 305 | 306 | [stars-shield]: https://img.shields.io/github/stars/JGFuentesC/woe_credit_scoring.svg?style=for-the-badge 307 | 308 | [stars-url]: https://github.com/JGFuentesC/woe_credit_scoring/stargazers 309 | 310 | [issues-shield]: https://img.shields.io/github/issues/JGFuentesC/woe_credit_scoring.svg?style=for-the-badge 311 | 312 | [issues-url]: https://github.com/JGFuentesC/woe_credit_scoring/issues 313 | 314 | [license-shield]: https://img.shields.io/github/license/JGFuentesC/woe_credit_scoring.svg?style=for-the-badge 315 | 316 | [license-url]: https://github.com/JGFuentesC/woe_credit_scoring/blob/master/LICENSE.txt 317 | 318 | [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555 319 | 320 | [linkedin-url]: https://linkedin.com/in/josegustavofuentescabrera 321 | 322 | 323 | 324 | 325 | -------------------------------------------------------------------------------- /woe_credit_scoring/binning.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, List, Union, Tuple, Optional 2 | from multiprocessing import Pool 3 | from functools import reduce 4 | import numpy as np 5 | import pandas as pd 6 | from sklearn.preprocessing import KBinsDiscretizer 7 | from sklearn.mixture import GaussianMixture 8 | import logging 9 | from .base import WoeBaseFeatureSelector 10 | from .normalizer import DiscreteNormalizer 11 | 12 | logger = logging.getLogger("CreditScoringToolkit") 13 | 14 | class Discretizer: 15 | """ 16 | Discretizer class for transforming continuous data into discrete bins. 17 | 18 | This class provides methods to fit a discretization model to continuous data and transform the data into discrete bins. 19 | It supports multiple discretization strategies including 'uniform', 'quantile', 'kmeans', and 'gaussian'. 20 | The class uses parallel processing to speed up the computation when dealing with large datasets. 21 | 22 | Attributes: 23 | min_segments (int): Minimum number of bins to create. 24 | max_segments (int): Maximum number of bins to create. 25 | strategy (str): Discretization strategy to use. 26 | X (pd.DataFrame): The input data used for fitting the model. 27 | features (List[str]): List of feature names in the input data. 28 | edges_map (Dict): Dictionary mapping features to their respective bin edges. 29 | __is_fitted (bool): Flag indicating whether the model has been fitted. 30 | 31 | Methods: 32 | _make_pool(func, params, threads): Executes a function with a set of parameters using pooling threads. 33 | fit(X, n_threads): Learns discretization edges from the input data. 34 | transform(X, n_threads): Transforms continuous data into its discrete form. 35 | _discretize(X, feature, nbins, strategy): Discretizes a series into a specified number of bins using the given strategy. 36 | _encode(X, feature, nbins, edges, strategy): Encodes a continuous feature into a discrete bin. 37 | """ 38 | 39 | def __init__(self, min_segments: int = 2, max_segments: int = 5, strategy: str = 'quantile') -> None: 40 | self.__is_fitted = False 41 | self.X = None 42 | self.min_segments = min_segments 43 | self.max_segments = max_segments 44 | self.strategy = strategy 45 | self.features = None 46 | self.edges_map = {} 47 | 48 | @staticmethod 49 | def _make_pool(func, params: List[Tuple], threads: int) -> List: 50 | """ 51 | Executes a function with a set of parameters using pooling threads. 52 | 53 | Args: 54 | func (function): Function to be executed. 55 | params (list): List of tuples, each tuple is a parameter combination. 56 | threads (int): Number of pooling threads to use. 57 | 58 | Returns: 59 | list: All execution results in a list. 60 | """ 61 | with Pool(threads) as pool: 62 | data = pool.starmap(func, params) 63 | return data 64 | 65 | def fit(self, X: pd.DataFrame, n_threads: int = 1) -> None: 66 | """ 67 | Learns discretization edges. 68 | 69 | Args: 70 | X (pd.DataFrame): Data to be discretized. 71 | n_threads (int, optional): Number of pooling threads. Defaults to 1. 72 | """ 73 | self.X = X.copy() 74 | self.features = list(self.X.columns) 75 | self.edges_map = self._make_pool( 76 | self._discretize, 77 | [(self.X, feat, nbins, self.strategy) for feat in self.features for nbins in range( 78 | self.min_segments, self.max_segments + 1)], 79 | threads=n_threads 80 | ) 81 | self.__is_fitted = True 82 | 83 | @staticmethod 84 | def _discretize(X: pd.DataFrame, feature: str, nbins: int, strategy: str) -> Dict: 85 | """ 86 | Discretizes a series in a particular number of bins using the given strategy. 87 | 88 | Args: 89 | X (pd.DataFrame): Data to be discretized. 90 | feature (str): Feature name. 91 | nbins (int): Number of expected bins. 92 | strategy (str): {'uniform', 'quantile', 'kmeans', 'gaussian'}, discretization method to be used. 93 | 94 | Returns: 95 | dict: Discretized data. 96 | 97 | Reference: 98 | For more details on the discretization strategies, see 99 | https://journals.plos.org/plosone/article/authors?id=10.1371/journal.pone.0289130 100 | """ 101 | aux = X[[feature]].copy() 102 | has_missing = aux[feature].isnull().any() 103 | if has_missing: 104 | nonmiss = aux.dropna().reset_index(drop=True) 105 | else: 106 | nonmiss = aux.copy() 107 | 108 | if strategy != 'gaussian': 109 | if nonmiss[feature].nunique() > 1: 110 | n_bins = min(nbins, nonmiss[feature].nunique()) 111 | kb = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy=strategy) 112 | kb.fit(nonmiss[[feature]]) 113 | edges = list(kb.bin_edges_[0]) 114 | return {'feature': feature, 'nbins': nbins, 'edges': [-np.inf] + edges[1:-1] + [np.inf]} 115 | else: 116 | edges = [-np.inf, np.inf] 117 | return {'feature': feature, 'nbins': nbins, 'edges': edges} 118 | else: 119 | gm = GaussianMixture(n_components=nbins) 120 | gm.fit(nonmiss[[feature]]) 121 | nonmiss['cluster'] = gm.predict(nonmiss[[feature]]) 122 | edges = nonmiss.groupby('cluster')[feature].agg( 123 | ['min', 'max']).sort_values(by='min') 124 | edges = sorted(set(edges['min'].tolist() + edges['max'].tolist())) 125 | return {'feature': feature, 'nbins': nbins, 'edges': [-np.inf] + edges[1:-1] + [np.inf]} 126 | 127 | @staticmethod 128 | def _encode(X: pd.DataFrame, feature: str, nbins: int, edges: List[float], strategy: str) -> pd.DataFrame: 129 | """ 130 | Encodes continuous feature into a discrete bin. 131 | 132 | Args: 133 | X (pd.DataFrame): Continuous data. 134 | feature (str): Feature to be encoded. 135 | nbins (int): Number of encoding bins. 136 | edges (list): Bin edges list. 137 | strategy (str): {'uniform', 'quantile', 'kmeans', 'gaussian'}, discretization strategy. 138 | 139 | Returns: 140 | pd.DataFrame: Encoded data. 141 | """ 142 | aux = pd.cut(X[feature], bins=edges, include_lowest=True) 143 | aux = pd.Series(np.where(aux.isnull(), 'MISSING', aux) 144 | ).to_frame().astype(str) 145 | discretized_feature_name = f'disc_{feature}_{nbins}_{strategy}' 146 | aux.columns = [discretized_feature_name] 147 | return aux 148 | 149 | def transform(self, X: pd.DataFrame, n_threads: int = 1) -> pd.DataFrame: 150 | """ 151 | Transforms continuous data into its discrete form. 152 | 153 | Args: 154 | X (pd.DataFrame): Data to be discretized. 155 | n_threads (int, optional): Number of pooling threads to speed computation. Defaults to 1. 156 | 157 | Raises: 158 | Exception: If fit method not called previously. 159 | Exception: If features analyzed during fit are not present in X. 160 | 161 | Returns: 162 | pd.DataFrame: Discretized Data. 163 | """ 164 | if not self.__is_fitted: 165 | raise Exception( 166 | 'Please call fit method first with the required parameters') 167 | 168 | aux = X.copy() 169 | features = list(set(edge['feature'] for edge in self.edges_map)) 170 | non_present_features = [f for f in features if f not in X.columns] 171 | if non_present_features: 172 | raise Exception( 173 | f"Missing features: {', '.join(non_present_features)}") 174 | 175 | encoded_data = self._make_pool( 176 | self._encode, 177 | [(X, edge_map['feature'], edge_map['nbins'], edge_map['edges'], 178 | self.strategy) for edge_map in self.edges_map], 179 | threads=n_threads 180 | ) 181 | 182 | result = reduce(lambda x, y: pd.merge( 183 | x, y, left_index=True, right_index=True, how='inner'), encoded_data).copy() 184 | return result 185 | 186 | 187 | class WoeContinuousFeatureSelector(WoeBaseFeatureSelector): 188 | """ 189 | WoeContinuousFeatureSelector is a class for selecting continuous features based on their Weight of Evidence (WoE) transformation and Information Value (IV) statistic. 190 | 191 | This class provides methods to fit a model that evaluates continuous features by discretizing them into bins, transforming them using WoE, and calculating their IV. 192 | It supports multiple discretization strategies including 'quantile', 'uniform', 'kmeans', 'gaussian', 'dcc', and 'dec'. 193 | The class can also enforce monotonic risk behavior for the selected features if required. 194 | 195 | Attributes: 196 | selected_features (Optional[List[Dict[str, Union[str, float]]]]): List of selected features with their respective IV values. 197 | __is_fitted (bool): Flag indicating whether the model has been fitted. 198 | _Xd (Optional[pd.DataFrame]): DataFrame containing the discretized features. 199 | discretizers (Optional[List[Discretizer]]): List of Discretizer objects used for discretizing the features. 200 | iv_report (Optional[pd.DataFrame]): DataFrame containing the IV report for all features. 201 | 202 | Methods: 203 | fit(X, y, method, iv_threshold, min_bins, max_bins, n_threads, strictly_monotonic): Learns the best features given an IV threshold and optional monotonic risk restriction. 204 | transform(X): Converts continuous features to their best discretization. 205 | """ 206 | selected_features: Optional[List[Dict[str, Union[str, float]]]] = None 207 | __is_fitted: bool = False 208 | _Xd: Optional[pd.DataFrame] = None 209 | discretizers: Optional[List[Discretizer]] = None 210 | iv_report: Optional[pd.DataFrame] = None 211 | 212 | def __init__(self) -> None: 213 | super().__init__() 214 | 215 | def fit(self, X: pd.DataFrame, y: pd.Series, method: str = 'quantile', iv_threshold: float = 0.1, 216 | min_bins: int = 2, max_bins: int = 5, n_threads: int = 1, strictly_monotonic: bool = False) -> None: 217 | """ 218 | Learns the best features given an IV threshold. Monotonic risk restriction can be applied. 219 | 220 | Args: 221 | X (pd.DataFrame): Predictors data. 222 | y (pd.Series): Dichotomic response feature. 223 | method (str, optional): Discretization technique. Options are {'quantile', 'uniform', 'kmeans', 'gaussian', 'dcc', 'dec'}. 224 | Defaults to 'quantile'. 225 | iv_threshold (float, optional): IV value for a feature to be included in final selection. Defaults to 0.1. 226 | min_bins (int, optional): Minimum number of discretization bins. Defaults to 2. 227 | max_bins (int, optional): Maximum number of discretization bins. Defaults to 5. 228 | n_threads (int, optional): Number of multiprocessing threads. Defaults to 1. 229 | strictly_monotonic (bool, optional): Indicates if only monotonic risk features should be selected. Defaults to False. 230 | 231 | Raises: 232 | Exception: If strictly_monotonic=True and no monotonic feature is present in the final selection. 233 | Exception: If method is not in {'quantile', 'uniform', 'kmeans', 'gaussian', 'dcc', 'dec'}. 234 | Exception: If X is not a pandas DataFrame. 235 | Exception: If y is not a pandas Series. 236 | 237 | Reference: 238 | For more information about the dcc and dec methods please refer to the following paper: 239 | https://journals.plos.org/plosone/article/authors?id=10.1371/journal.pone.0289130 240 | """ 241 | if not isinstance(X, pd.DataFrame): 242 | raise TypeError('X must be a pandas DataFrame') 243 | if not isinstance(y, pd.Series): 244 | raise TypeError('y must be a pandas Series') 245 | 246 | cont_features = list(X.columns) 247 | methods = ['quantile', 'uniform', 'kmeans', 'gaussian'] 248 | 249 | if method not in methods + ['dcc', 'dec']: 250 | raise Exception('Invalid method, options are quantile, uniform, kmeans, gaussian, dcc and dec') 251 | 252 | if method in methods: 253 | discretizers = [Discretizer(strategy=method, min_segments=min_bins, max_segments=max_bins)] 254 | else: 255 | discretizers = [Discretizer(strategy=m, min_segments=min_bins, max_segments=max_bins) for m in methods] 256 | 257 | for disc in discretizers: 258 | disc.fit(X[cont_features], n_threads=n_threads) 259 | 260 | self.discretizers = discretizers 261 | self._Xd = pd.concat([disc.transform(X[cont_features]) for disc in discretizers], axis=1) 262 | disc_features = list(self._Xd.columns) 263 | self._Xd['binary_target'] = y 264 | 265 | if strictly_monotonic: 266 | mono = {feature: self._check_monotonic(self._Xd[feature], self._Xd['binary_target']) for feature in disc_features} 267 | mono = {x: y for x, y in mono.items() if y} 268 | if not mono: 269 | raise Exception('There is no monotonic feature.\n Please try turning strictly_monotonic parameter to False or increase the number of bins') 270 | disc_features = list(mono.keys()) 271 | 272 | iv = [(feature, self._information_value(self._Xd[feature], self._Xd['binary_target'])) for feature in disc_features] 273 | self.iv_report = pd.DataFrame(iv, columns=['feature', 'iv']).dropna().reset_index(drop=True) 274 | self.iv_report['relevant'] = self.iv_report['iv'] >= iv_threshold 275 | 276 | self.iv_report['root_feature'] = self.iv_report['feature'].apply(lambda x: "_".join(x.split('_')[1:-2])) 277 | self.iv_report['nbins'] = self.iv_report['feature'].apply(lambda x: x.split('_')[-2]) 278 | self.iv_report['method'] = self.iv_report['feature'].apply(lambda x: x.split('_')[-1]) 279 | 280 | sort_columns = ['root_feature', 'iv', 'nbins'] if method in methods + ['dcc'] else ['root_feature', 'method', 'iv', 'nbins'] 281 | self.iv_report = self.iv_report.sort_values(by=sort_columns, ascending=[True, False, True] if method in methods + ['dcc'] else [True, True, False, True]).reset_index(drop=True) 282 | self.iv_report['index'] = self.iv_report.groupby('root_feature').cumcount() + 1 if method in methods + ['dcc'] else self.iv_report.groupby(['root_feature', 'method']).cumcount() + 1 283 | 284 | self.iv_report = self.iv_report.loc[self.iv_report['index'] == 1].reset_index(drop=True) 285 | self.iv_report['selected'] = self.iv_report['feature'].isin(self.iv_report['feature']) 286 | self.iv_report = self.iv_report.sort_values(by=['selected', 'relevant'], ascending=[False, False]) 287 | cont_features = list(set(self.iv_report.loc[self.iv_report['relevant']]['root_feature'])) 288 | if len(cont_features) == 0: 289 | raise Exception('No relevant feature found. Please try increasing the number of bins or changing the discretization method') 290 | for disc in self.discretizers: 291 | disc.fit(X[cont_features], n_threads=n_threads) 292 | self.selected_features =self.iv_report[self.iv_report['relevant']].drop('index', axis=1).to_dict(orient='records') 293 | self.__is_fitted = True 294 | 295 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 296 | """ 297 | Converts continuous features to their best discretization. 298 | 299 | Args: 300 | X (pd.DataFrame): Continuous predictors data. 301 | 302 | Raises: 303 | Exception: If fit method is not called first. 304 | Exception: If a fitted feature is not present in data. 305 | Exception: If X is not a pandas DataFrame. 306 | 307 | Returns: 308 | pd.DataFrame: Best discretization transformed data. 309 | """ 310 | if not self.__is_fitted: 311 | raise Exception( 312 | 'Please call fit method first with the required parameters') 313 | 314 | if not isinstance(X, pd.DataFrame): 315 | raise TypeError('X must be a pandas DataFrame') 316 | 317 | aux = X.copy() 318 | features = list(set([feature['root_feature'] 319 | for feature in self.selected_features])) 320 | non_present_features = [f for f in features if f not in X.columns] 321 | 322 | if non_present_features: 323 | logger.exception(f'{", ".join(non_present_features)} feature{"s" if len(non_present_features) > 1 else ""} not present in data') 324 | raise Exception("Missing features") 325 | 326 | aux = pd.concat([disc.transform(X[features]) 327 | for disc in self.discretizers], axis=1) 328 | aux = aux[[feature['feature'] for feature in self.selected_features]] 329 | return aux 330 | 331 | 332 | class WoeDiscreteFeatureSelector(WoeBaseFeatureSelector): 333 | """ 334 | WoeDiscreteFeatureSelector is a class for selecting discrete features based on their Weight of Evidence (WoE) 335 | transformation and Information Value (IV) statistic. This class inherits from WoeBaseFeatureSelector and provides 336 | methods to fit the model to the data and transform the data by keeping only the selected features. 337 | 338 | The fit method evaluates each feature's predictive power by calculating its IV and selects features that meet 339 | a specified IV threshold. The transform method then filters the dataset to include only these selected features. 340 | 341 | Attributes: 342 | iv_report (pd.DataFrame): A DataFrame containing the IV values and selection status of each feature. 343 | selected_features (dict[str, float]): A dictionary of selected features and their corresponding IV values. 344 | __is_fitted (bool): A flag indicating whether the fit method has been called. 345 | """ 346 | iv_report: pd.DataFrame = None 347 | 348 | def __init__(self) -> None: 349 | super().__init__() 350 | 351 | def fit(self, X: pd.DataFrame, y: pd.Series, iv_threshold: float = 0.1) -> None: 352 | """Learns best features given an IV threshold. 353 | 354 | Args: 355 | X (pd.DataFrame): Discrete predictors data 356 | y (pd.Series): Dichotomic response feature 357 | iv_threshold (float, optional): IV value for a feature to be included in final selection. Defaults to 0.1. 358 | """ 359 | disc_features: list[str] = list(X.columns) 360 | aux: pd.DataFrame = X.copy() 361 | aux['binary_target'] = y 362 | iv: list[tuple[str, float]] = [(feature, self._information_value( 363 | aux[feature], aux['binary_target'])) for feature in disc_features] 364 | self.iv_report = pd.DataFrame(iv, columns=['feature', 'iv']).dropna().reset_index(drop=True) 365 | self.iv_report['selected'] = self.iv_report['iv'] >= iv_threshold 366 | self.iv_report = self.iv_report.sort_values('selected', ascending=False) 367 | disc_features = list(self.iv_report.loc[self.iv_report['selected']]['feature']) 368 | if len(disc_features) == 0: 369 | raise Exception( 370 | 'No relevant feature found. Please try increasing the IV threshold') 371 | self.selected_features: dict[str, float] =self.iv_report.loc[self.iv_report['selected']].set_index('feature')[ 372 | 'iv'].to_dict() 373 | self.__is_fitted: bool = True 374 | 375 | def transform(self, X: pd.DataFrame) -> pd.DataFrame: 376 | """Transforms data keeping only the selected features 377 | 378 | Args: 379 | X (pd.DataFrame): Discrete predictors data 380 | 381 | Raises: 382 | Exception: If fit method is not called first. 383 | Exception: If a fitted feature is not present in data. 384 | 385 | Returns: 386 | pd.DataFrame: Data containing best discrete features 387 | """ 388 | if not self.__is_fitted: 389 | raise Exception( 390 | 'Please call fit method first with the required parameters') 391 | else: 392 | aux: pd.DataFrame = X.copy() 393 | features: list[str] = [ 394 | feature for feature in self.selected_features.keys()] 395 | non_present_features: list[str] = [ 396 | f for f in features if f not in X.columns] 397 | if len(non_present_features) > 0: 398 | logger.exception( 399 | f'{",".join(non_present_features)} feature{"s" if len(non_present_features) > 1 else ""} not present in data') 400 | raise Exception("Missing features") 401 | else: 402 | aux = aux[features] 403 | return aux 404 | 405 | 406 | class IVCalculator: 407 | """ 408 | A class to calculate the Information Value (IV) for both discrete and continuous features. 409 | It provides a simple interface that abstracts away the manual steps of discretization and normalization. 410 | 411 | Example: 412 | >>> from woe_credit_scoring import IVCalculator 413 | >>> import pandas as pd 414 | >>> data = pd.read_csv('example_data/hmeq.csv') 415 | >>> iv_calculator = IVCalculator( 416 | ... data=data, 417 | ... target='BAD', 418 | ... continuous_features=['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'], 419 | ... discrete_features=['REASON', 'JOB'] 420 | ... ) 421 | >>> iv_report = iv_calculator.calculate_iv() 422 | >>> print(iv_report) 423 | 424 | """ 425 | def __init__(self, data: pd.DataFrame, target: str, continuous_features: List[str] = None, discrete_features: List[str] = None): 426 | """ 427 | Initializes the IVCalculator object. 428 | 429 | Args: 430 | data (pd.DataFrame): The input data containing features and target. 431 | target (str): The target variable name. 432 | continuous_features (List[str], optional): List of continuous feature names. Defaults to None. 433 | discrete_features (List[str], optional): List of discrete feature names. Defaults to None. 434 | """ 435 | if not isinstance(data, pd.DataFrame): 436 | raise TypeError("data must be a pandas DataFrame.") 437 | if target not in data.columns: 438 | raise ValueError(f"Target column '{target}' not found in the DataFrame.") 439 | 440 | self.data = data 441 | self.target = target 442 | self.continuous_features = continuous_features if continuous_features is not None else [] 443 | self.discrete_features = discrete_features if discrete_features is not None else [] 444 | 445 | if not self.continuous_features and not self.discrete_features: 446 | logger.warning("No continuous or discrete features provided.") 447 | 448 | def calculate_iv(self, 449 | max_discretization_bins: int = 5, 450 | strictly_monotonic: bool = False, 451 | discretization_method: str = 'quantile', 452 | n_threads: int = 1, 453 | discrete_normalization_threshold: float = 0.05, 454 | discrete_normalization_default_category: str = 'OTHER' 455 | ) -> pd.DataFrame: 456 | """ 457 | Calculates the Information Value (IV) for the provided features. 458 | 459 | Args: 460 | max_discretization_bins (int, optional): The maximum number of bins for discretization. Defaults to 5. 461 | strictly_monotonic (bool, optional): Whether to enforce strictly monotonic WoE transformation for continuous features. Defaults to False. 462 | discretization_method (str, optional): The method for discretization ('quantile', 'uniform', 'kmeans', 'gaussian', 'dcc', 'dec'). Defaults to 'quantile'. 463 | n_threads (int, optional): The number of threads to use for parallel processing. Defaults to 1. 464 | discrete_normalization_threshold (float, optional): The threshold for discrete feature normalization. Defaults to 0.05. 465 | discrete_normalization_default_category (str, optional): The default category for discrete feature normalization. Defaults to 'OTHER'. 466 | 467 | Returns: 468 | pd.DataFrame: A DataFrame containing the IV report for all features, sorted by IV in descending order. 469 | """ 470 | iv_reports = [] 471 | 472 | if self.continuous_features: 473 | logger.info("Calculating IV for continuous features...") 474 | woe_continuous_selector = WoeContinuousFeatureSelector() 475 | try: 476 | woe_continuous_selector.fit( 477 | self.data[self.continuous_features], 478 | self.data[self.target], 479 | max_bins=max_discretization_bins, 480 | strictly_monotonic=strictly_monotonic, 481 | iv_threshold=-np.inf, # Using a very low threshold to get IV for all features 482 | method=discretization_method, 483 | n_threads=n_threads 484 | ) 485 | iv_report_continuous = woe_continuous_selector.iv_report 486 | iv_report_continuous = iv_report_continuous[['root_feature', 'iv']].rename(columns={'root_feature': 'feature'}) 487 | iv_report_continuous['feature_type'] = 'continuous' 488 | iv_reports.append(iv_report_continuous) 489 | logger.info("IV for continuous features calculated successfully.") 490 | except Exception as e: 491 | logger.error(f"Could not calculate IV for continuous features. Error: {e}") 492 | 493 | if self.discrete_features: 494 | logger.info("Calculating IV for discrete features...") 495 | try: 496 | dn = DiscreteNormalizer( 497 | normalization_threshold=discrete_normalization_threshold, 498 | default_category=discrete_normalization_default_category 499 | ) 500 | dn.fit(self.data[self.discrete_features]) 501 | normalized_discrete_data = dn.transform(self.data[self.discrete_features]) 502 | 503 | woe_discrete_selector = WoeDiscreteFeatureSelector() 504 | woe_discrete_selector.fit( 505 | normalized_discrete_data, 506 | self.data[self.target], 507 | iv_threshold=-np.inf # Using a very low threshold to get IV for all features 508 | ) 509 | iv_report_discrete = woe_discrete_selector.iv_report[['feature', 'iv']] 510 | iv_report_discrete['feature_type'] = 'discrete' 511 | iv_reports.append(iv_report_discrete) 512 | logger.info("IV for discrete features calculated successfully.") 513 | except Exception as e: 514 | logger.error(f"Could not calculate IV for discrete features. Error: {e}") 515 | 516 | if not iv_reports: 517 | logger.warning("IV calculation did not produce any results.") 518 | return pd.DataFrame(columns=['feature', 'iv', 'feature_type']) 519 | 520 | final_iv_report = pd.concat(iv_reports, axis=0).sort_values('iv', ascending=False).reset_index(drop=True) 521 | return final_iv_report 522 | -------------------------------------------------------------------------------- /woe_credit_scoring/autocreditscoring.py: -------------------------------------------------------------------------------- 1 | from typing import List, Optional 2 | import numpy as np 3 | import pandas as pd 4 | from scipy.stats.mstats import winsorize 5 | from sklearn.linear_model import LogisticRegression 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.metrics import roc_auc_score, roc_curve, auc 8 | import matplotlib.pyplot as plt 9 | import seaborn as sns 10 | from collections import ChainMap 11 | import os 12 | import logging 13 | from .normalizer import DiscreteNormalizer 14 | from .binning import WoeContinuousFeatureSelector, WoeDiscreteFeatureSelector 15 | from .encoder import WoeEncoder 16 | from .scoring import CreditScoring 17 | 18 | logger = logging.getLogger("CreditScoringToolkit") 19 | 20 | class AutoCreditScoring: 21 | """ 22 | A class used to perform automated credit scoring using logistic regression and Weight of Evidence (WoE) transformation. 23 | Attributes 24 | ---------- 25 | continuous_features : List[str] 26 | List of continuous feature names. 27 | discrete_features : List[str] 28 | List of discrete feature names. 29 | target : str 30 | The target variable name. 31 | data : pd.DataFrame 32 | The input data containing features and target. 33 | train : pd.DataFrame 34 | The training dataset. 35 | valid : pd.DataFrame 36 | The validation dataset. 37 | apply_multicolinearity : bool, optional 38 | Whether to apply multicollinearity treatment (default is False). 39 | iv_feature_threshold : float, optional 40 | The Information Value (IV) threshold for feature selection (default is 0.05). 41 | treat_outliers : bool, optional 42 | Whether to treat outliers in continuous features (default is False). 43 | outlier_threshold : float, optional 44 | The threshold for outlier treatment (default is 0.01). 45 | min_score : int, optional 46 | The minimum score for the credit scoring model (default is 400). 47 | max_score : int, optional 48 | The maximum score for the credit scoring model (default is 900). 49 | max_discretization_bins : int, optional 50 | The maximum number of bins for discretization (default is 5). 51 | discrete_normalization_threshold : float, optional 52 | The threshold for discrete feature normalization (default is 0.05). 53 | discrete_normalization_default_category : str, optional 54 | The default category for discrete feature normalization (default is 'OTHER'). 55 | transformation : Optional[str], optional 56 | The transformation method to be applied (default is None). 57 | model : Optional[LogisticRegression], optional 58 | The logistic regression model (default is None). 59 | max_iter : int, optional 60 | The maximum number of iterations for partitioning data (default is 5). 61 | train_size : float, optional 62 | The proportion of data to be used for training (default is 0.7). 63 | target_proportion_tolerance : float, optional 64 | The tolerance for target proportion difference between train and valid datasets (default is 0.01). 65 | strictly_monotonic : bool, optional 66 | Whether to enforce strictly monotonic WoE transformation (default is True). 67 | discretization_method : str, optional 68 | The method for discretization (default is 'quantile'). 69 | n_threads : int, optional 70 | The number of threads to use for parallel processing (default is 1). 71 | overfitting_tolerance : float, optional 72 | The tolerance for overfitting detection (default is 0.01). 73 | create_reporting : bool, optional 74 | Whether to create reporting after model fitting (default is False). 75 | is_fitted : bool, optional 76 | Whether the model has been fitted (default is False). 77 | Methods 78 | ------- 79 | __init__(self, data: pd.DataFrame, target: str, continuous_features: List[str]=None, discrete_features: List[str]=None) 80 | Initializes the AutoCreditScoring object with data, target, and feature lists. 81 | fit(self, target_proportion_tolerance: float = None, treat_outliers: bool = None, discrete_normalization_threshold: float = None, discrete_normalization_default_category: str = None, max_discretization_bins: int = None, strictly_monotonic: bool = None, iv_feature_threshold: float = None, discretization_method: str = None, n_threads: int = None, overfitting_tolerance: float = None, min_score: int = None, max_score: int = None, create_reporting: bool = None, verbose: bool = False) 82 | Fits the credit scoring model to the data with optional parameters for customization. 83 | __partition_data(self) 84 | Partitions the data into training and validation sets while ensuring target proportion compatibility. 85 | __outlier_treatment(self) 86 | Applies outlier treatment to continuous features in the training dataset. 87 | __normalize_discrete(self) 88 | Normalizes discrete features in the training dataset. 89 | __feature_selection(self) 90 | Performs feature selection based on Information Value (IV) for continuous and discrete features. 91 | __woe_transformation(self) 92 | Applies Weight of Evidence (WoE) transformation to the selected features. 93 | __apply_pipeline(self, data: pd.DataFrame) -> pd.DataFrame 94 | Applies the entire preprocessing and transformation pipeline to new data. 95 | __train_model(self) 96 | Trains the logistic regression model on the transformed training data. 97 | __scoring(self) 98 | Generates credit scores for the training and validation datasets. 99 | __reporting(self) 100 | Creates various reports and visualizations for model evaluation and interpretation. 101 | save_reports(self, folder: str = '.') 102 | Saves the generated reports and visualizations to the specified folder. 103 | predict(self, X: pd.DataFrame) -> pd.DataFrame 104 | Predicts scores for a given raw dataset. 105 | fit_predict(self, **kwargs) -> pd.DataFrame 106 | Fits the model and returns the scores for the entire dataset. 107 | """ 108 | continuous_features: List[str] 109 | discrete_features: List[str] 110 | target: str 111 | data: pd.DataFrame 112 | train: pd.DataFrame 113 | valid: pd.DataFrame 114 | iv_feature_threshold: float = 0.05 115 | treat_outliers: bool = False 116 | outlier_threshold: float = 0.01 117 | min_score = 400 118 | max_score = 900 119 | max_discretization_bins = 5 120 | discrete_normalization_threshold = 0.05 121 | discrete_normalization_default_category = 'OTHER' 122 | transformation: Optional[str] = None 123 | model: Optional[LogisticRegression] = None 124 | max_iter: int = 5 125 | train_size: float = 0.7 126 | target_proportion_tolerance: float = 0.01 127 | max_discretization_bins:int=6 128 | strictly_monotonic:bool=True 129 | discretization_method:str = 'quantile' 130 | n_threads:int = 1 131 | overfitting_tolerance:float = 0.01 132 | create_reporting:bool = False 133 | is_fitted:bool = False 134 | 135 | def __init__(self, data: pd.DataFrame, target: str, continuous_features: List[str]=None, discrete_features: List[str]=None): 136 | self.data = data 137 | self.continuous_features = continuous_features 138 | self.discrete_features = discrete_features 139 | self.target = target 140 | 141 | def fit(self, 142 | target_proportion_tolerance:float = None, 143 | train_proportion:float = None, 144 | treat_outliers:bool = None, 145 | discrete_normalization_threshold:float = None, 146 | discrete_normalization_default_category:str = None, 147 | max_discretization_bins:int = None, 148 | strictly_monotonic:bool = None, 149 | iv_feature_threshold:float = None, 150 | discretization_method:str = None, 151 | n_threads:int = None, 152 | overfitting_tolerance:float = None, 153 | min_score:int = None, 154 | max_score:int = None, 155 | create_reporting:bool = None, 156 | verbose:bool=False): 157 | 158 | #Train proportion control 159 | if train_proportion is not None: 160 | self.train_size = train_proportion 161 | 162 | # Verbosity control 163 | if verbose: 164 | logger.setLevel(logging.INFO) 165 | else: 166 | logger.setLevel(logging.WARNING) 167 | # Check if continuous_features is provided 168 | if self.continuous_features is None: 169 | self.continuous_features = [] 170 | logger.warning("No continuous features provided") 171 | # Check if discrete_features is provided 172 | if self.discrete_features is None: 173 | self.discrete_features = [] 174 | logger.warning("No discrete features provided") 175 | if len(self.continuous_features)==0 and len(self.discrete_features)==0: 176 | logger.error("No features provided") 177 | raise RuntimeError("No features provided") 178 | 179 | # Check if target_proportion_tolerance is provided 180 | if target_proportion_tolerance is not None: 181 | self.target_proportion_tolerance = target_proportion_tolerance 182 | # Partition data 183 | self.__partition_data() 184 | 185 | #Check if treat_outliers is provided 186 | if len(self.continuous_features)>0 and treat_outliers is not None: 187 | self.treat_outliers = treat_outliers 188 | self.__outlier_treatment() 189 | 190 | # Check if discrete_normalization_threshold is provided 191 | if discrete_normalization_threshold is not None: 192 | self.discrete_normalization_threshold = discrete_normalization_threshold 193 | # Check if discrete_normalization_default_category is provided 194 | if discrete_normalization_default_category is not None: 195 | self.discrete_normalization_default_category = discrete_normalization_default_category 196 | if len(self.discrete_features)==0: 197 | logger.warning("No discrete features provided") 198 | else: 199 | if len(self.discrete_features)>0: 200 | # Normalize discrete features 201 | self.__normalize_discrete() 202 | 203 | #Check feature selection parameters 204 | if max_discretization_bins is not None: 205 | self.max_discretization_bins = max_discretization_bins 206 | if strictly_monotonic is not None: 207 | self.strictly_monotonic = strictly_monotonic 208 | if iv_feature_threshold is not None: 209 | self.iv_feature_threshold = iv_feature_threshold 210 | if discretization_method is not None: 211 | self.discretization_method = discretization_method 212 | if n_threads is not None: 213 | self.n_threads = n_threads 214 | 215 | # Feature selection 216 | self.__feature_selection() 217 | 218 | # Woe transformation 219 | self.__woe_transformation() 220 | 221 | # Check if overfitting_tolerance is provided 222 | if overfitting_tolerance is not None: 223 | self.overfitting_tolerance = overfitting_tolerance 224 | # Train model 225 | self.__train_model() 226 | 227 | # Check if min_score is provided 228 | if min_score is not None: 229 | self.min_score = min_score 230 | # Check if max_score is provided 231 | if max_score is not None : 232 | self.max_score = max_score 233 | # Check if min_score is less than max_score 234 | if self.min_score>=self.max_score: 235 | logger.error("min_score should be less than max_score") 236 | raise RuntimeError("min_score should be less than max_score") 237 | # Scoring 238 | self.__scoring() 239 | 240 | # Check if create_reporting is provided 241 | if create_reporting is not None: 242 | self.create_reporting = create_reporting 243 | # Reporting 244 | if self.create_reporting: 245 | self.__reporting() 246 | self.is_fitted = True 247 | 248 | def predict(self, X: pd.DataFrame) -> pd.DataFrame: 249 | """ 250 | Predicts scores for a given raw dataset. 251 | 252 | The input data should have the same features as the training data. 253 | The method applies the same pipeline of transformations as used during training. 254 | 255 | Args: 256 | X (pd.DataFrame): Raw data to be scored. 257 | 258 | Returns: 259 | pd.DataFrame: A DataFrame with scores and feature contributions. 260 | 261 | Raises: 262 | Exception: If the model is not fitted yet. 263 | ValueError: If the input data is missing required features. 264 | """ 265 | if not self.is_fitted: 266 | raise Exception("This AutoCreditScoring instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.") 267 | 268 | required_features = self.continuous_features + self.discrete_features 269 | missing_features = [f for f in required_features if f not in X.columns] 270 | if missing_features: 271 | raise ValueError(f"The following required columns are missing from the input data: {', '.join(missing_features)}") 272 | 273 | aux = X.copy() 274 | 275 | # Apply the full pipeline 276 | data_woe = self.__apply_pipeline(aux) 277 | 278 | # Inverse transform to get discrete bins 279 | data_discrete_binned = self.woe_encoder.inverse_transform(data_woe) 280 | 281 | # Get scores 282 | scored_data = self.credit_scoring.transform(data_discrete_binned) 283 | 284 | rename_dict = { 285 | binned_name: f"pts_{original_name}" 286 | for binned_name, original_name in self.feature_name_mapping.items() 287 | if binned_name in scored_data.columns 288 | } 289 | scored_data.rename(columns=rename_dict, inplace=True) 290 | 291 | scored_data['score'] = scored_data['score'].astype(float) 292 | 293 | # Clip scores to the defined range 294 | scored_data['score'] = scored_data['score'].clip(self.min_score, self.max_score) 295 | 296 | # Add score ranges 297 | for k in [5, 10]: 298 | step = (self.max_score - self.min_score) / k 299 | bins = np.arange(self.min_score, self.max_score + step, step) 300 | scored_data[f'range_score_{k}'] = pd.cut(scored_data['score'], bins=bins, include_lowest=True) 301 | 302 | return scored_data 303 | 304 | def __partition_data(self): 305 | logger.info("Partitioning data...") 306 | self.train, self.valid = train_test_split(self.data, train_size=self.train_size) 307 | self.train.reset_index(drop=True, inplace=True) 308 | self.valid.reset_index(drop=True, inplace=True) 309 | # Check if target proportions are compatible between train and valid 310 | logger.info("Checking partition proportions...") 311 | iter = 1 312 | while(np.abs(self.train[self.target].mean()-self.valid[self.target].mean())>self.target_proportion_tolerance): 313 | logger.info(f"Partitioning data...Iteration {iter}") 314 | logger.info(f"Train target proportion: {self.train[self.target].mean()}") 315 | logger.info(f"Valid target proportion: {self.valid[self.target].mean()}") 316 | self.train, self.valid = train_test_split(self.data, train_size=self.train_size) 317 | self.train.reset_index(drop=True, inplace=True) 318 | self.valid.reset_index(drop=True, inplace=True) 319 | iter+=1 320 | if iter>self.max_iter: 321 | logger.error("Could not find a compatible partition") 322 | raise RuntimeError("Could not find a compatible partition") 323 | 324 | if iter>1: 325 | logger.info(f"Partitioning data...Done after {iter} iterations") 326 | logger.info(f"Train shape: {self.train.shape}", ) 327 | logger.info(f"Test shape: {self.valid.shape}") 328 | logger.info(f"Train target proportion: {self.train[self.target].mean()}") 329 | logger.info(f"Valid target proportion: {self.valid[self.target].mean()}") 330 | 331 | def __outlier_treatment(self): 332 | logger.info("Outlier treatment...") 333 | before = self.train[self.continuous_features].mean() 334 | for f in self.continuous_features: 335 | self.train[f] = winsorize(self.train[f], limits=[self.outlier_threshold, self.outlier_threshold]) 336 | after = self.train[self.continuous_features].mean() 337 | report = pd.DataFrame({'Before':before,'After':after}) 338 | logger.info("Mean statistics before and after outlier treatment") 339 | logger.info(f'\n\n{report}\n') 340 | logger.info("Outlier treatment...Done") 341 | 342 | def __normalize_discrete(self): 343 | logger.info("Discrete normalization...") 344 | logger.info(f"Discrete features: {self.discrete_features}") 345 | dn = DiscreteNormalizer(normalization_threshold=self.discrete_normalization_threshold, 346 | default_category=self.discrete_normalization_default_category) 347 | dn.fit(self.train[self.discrete_features]) 348 | self.train_discrete_normalized = dn.transform(self.train[self.discrete_features]) 349 | logger.info("Checking if normalization produced unary columns") 350 | self.unary_columns = [c for c in self.train_discrete_normalized.columns if self.train_discrete_normalized[c].nunique()==1] 351 | if len(self.unary_columns)>0: 352 | logger.warning(f"Normalization produced unary columns: {self.unary_columns}") 353 | logger.warning(f"Removing unary columns from discrete features") 354 | self.discrete_features = [f for f in self.discrete_features if f not in self.unary_columns] 355 | logger.warning(f"Discrete features after unary columns removal: {self.discrete_features}") 356 | else: 357 | logger.info("No unary columns produced by normalization") 358 | if len(self.discrete_features)==0: 359 | logger.warning("No discrete features left after normalization") 360 | else: 361 | dn.fit(self.train[self.discrete_features]) 362 | self.train_discrete_normalized = dn.transform(self.train[self.discrete_features]) 363 | self.discrete_normalizer = dn 364 | logger.info("Discrete normalization...Done") 365 | 366 | def __feature_selection(self): 367 | try: 368 | logger.info("Feature selection...") 369 | if len(self.continuous_features)>0: 370 | logger.info("Continuous features selection...") 371 | woe_continuous_selector = WoeContinuousFeatureSelector() 372 | woe_continuous_selector.fit(self.train[self.continuous_features], self.train[self.target], 373 | max_bins=self.max_discretization_bins, 374 | strictly_monotonic=self.strictly_monotonic, 375 | iv_threshold=self.iv_feature_threshold, 376 | method=self.discretization_method, 377 | n_threads=self.n_threads) 378 | self.iv_report_continuous = pd.DataFrame(woe_continuous_selector.selected_features) 379 | self.full_iv_report_continuous = woe_continuous_selector.iv_report.copy() 380 | self.continuous_candidate = woe_continuous_selector.transform(self.train[self.continuous_features]) 381 | logger.info(f'\n\n{self.iv_report_continuous}\n\n') 382 | self.woe_continuous_selector = woe_continuous_selector 383 | logger.info(f"Continuous features selection...Done") 384 | if len(self.discrete_features)>0: 385 | logger.info("Discrete features selection...") 386 | woe_discrete_selector = WoeDiscreteFeatureSelector() 387 | woe_discrete_selector.fit(self.train_discrete_normalized, self.train[self.target],self.iv_feature_threshold) 388 | self.iv_report_discrete = pd.Series(woe_discrete_selector.selected_features).to_frame('iv').reset_index().rename(columns={'index':'feature'}).sort_values('iv',ascending=False) 389 | self.full_iv_report_discrete = woe_discrete_selector.iv_report.copy() 390 | self.discrete_candidate = woe_discrete_selector.transform(self.train_discrete_normalized) 391 | logger.info(f'\n\n{self.iv_report_discrete}\n\n') 392 | self.woe_discrete_selector = woe_discrete_selector 393 | logger.info("Discrete features selection...Done") 394 | 395 | if len(self.continuous_features)>0 and len(self.discrete_features)>0: 396 | logger.info("Merging continuous and discrete features...") 397 | self.train_candidate = pd.concat([self.continuous_candidate, self.discrete_candidate], axis=1) 398 | logger.info("Merging continuous and discrete features...Done") 399 | elif len(self.continuous_features)>0: 400 | self.train_candidate = self.continuous_candidate 401 | elif len(self.discrete_features)>0: 402 | self.train_candidate = self.discrete_candidate 403 | self.candidate_features = list(self.train_candidate.columns) 404 | if len(self.candidate_features)==0: 405 | logger.error("No features selected") 406 | raise RuntimeError("No features selected") 407 | logger.info(f"Selected features ({len(self.candidate_features)}): {self.candidate_features}") 408 | 409 | self.feature_name_mapping = {} 410 | if len(self.continuous_features)>0 and hasattr(self, 'iv_report_continuous'): 411 | self.feature_name_mapping.update(self.iv_report_continuous.set_index('feature')['root_feature'].to_dict()) 412 | 413 | if len(self.discrete_features)>0 and hasattr(self, 'woe_discrete_selector') and self.woe_discrete_selector.selected_features: 414 | self.feature_name_mapping.update({f: f for f in self.woe_discrete_selector.selected_features.keys()}) 415 | 416 | logger.info("Feature selection...Done") 417 | except Exception as err: 418 | logger.error(f"Error in feature selection: {err}") 419 | raise err 420 | 421 | def __woe_transformation(self): 422 | self.woe_encoder = WoeEncoder() 423 | self.woe_encoder.fit(self.train_candidate, self.train[self.target]) 424 | self.train_woe = self.woe_encoder.transform(self.train_candidate) 425 | if self.train_woe.isna().max().max(): 426 | logger.error("NAs found in transformed data") 427 | raise RuntimeError("NAs found in transformed data, Maybe tiny missing in continuous?") 428 | 429 | def __apply_pipeline(self,data:pd.DataFrame)->pd.DataFrame: 430 | try: 431 | if len(self.continuous_features)>0: 432 | if self.treat_outliers: 433 | for f in self.continuous_features: 434 | data[f] = winsorize(data[f], limits=[self.outlier_threshold, self.outlier_threshold]) 435 | data_continuous_candidate = self.woe_continuous_selector.transform(data[self.continuous_features]) 436 | if len(self.discrete_features)>0: 437 | data_discrete_normalized = self.discrete_normalizer.transform(data[self.discrete_features]) 438 | data_discrete_candidate = self.woe_discrete_selector.transform(data_discrete_normalized) 439 | if len(self.continuous_features)>0 and len(self.discrete_features)==0: 440 | data_candidate = data_continuous_candidate.copy() 441 | if len(self.continuous_features)==0 and len(self.discrete_features)>0: 442 | data_candidate = data_discrete_candidate.copy() 443 | if len(self.continuous_features)>0 and len(self.discrete_features)>0: 444 | data_candidate = pd.concat([data_continuous_candidate, data_discrete_candidate], axis=1) 445 | data_woe = self.woe_encoder.transform(data_candidate) 446 | if data_woe.isna().max().max(): 447 | logger.error("NAs found in transformed data") 448 | raise RuntimeError("NAs found in transformed data, Maybe tiny missing in continuous?") 449 | return data_woe 450 | except Exception as err: 451 | logger.error(f"Error applying pipeline: {err}") 452 | raise err 453 | 454 | def __train_model(self): 455 | logger.info("Training model...") 456 | lr = LogisticRegression() 457 | lr.fit(self.train_woe,self.train[self.target]) 458 | self.model = lr 459 | self.valid_woe = self.__apply_pipeline(self.valid) 460 | self.auc_train = roc_auc_score(y_score=lr.predict_proba(self.train_woe)[:,1],y_true=self.train[self.target]) 461 | self.auc_valid = roc_auc_score(y_score=lr.predict_proba(self.valid_woe)[:,1],y_true=self.valid[self.target]) 462 | logger.info(f"AUC for training: {self.auc_train}") 463 | logger.info(f"AUC for validation:{self.auc_valid}") 464 | self.betas = lr.coef_[0] 465 | self.alpha = lr.intercept_[0] 466 | if any([np.abs(b)<0.0001 for b in self.betas]): 467 | logger.warning("Some betas are close to zero, consider removing features") 468 | logger.warning(f"Betas: {dict(zip(self.candidate_features,self.betas))}") 469 | logger.warning(f"Suspicious features: {[f for f,b in zip(self.candidate_features,self.betas) if np.abs(b)<0.0001]}") 470 | if abs(self.auc_train-self.auc_valid)>self.overfitting_tolerance: 471 | logger.warning(f"Overfitting detected, review your hyperparameters. train_auc: {self.auc_train}, valid_auc: {self.auc_valid}") 472 | self.logistic_model = lr 473 | logger.info("Training model...Done") 474 | 475 | def __scoring(self): 476 | logger.info("Scoring...") 477 | cs = CreditScoring() 478 | cs.fit(self.train_woe, self.woe_encoder, self.logistic_model) 479 | self.credit_scoring = cs 480 | 481 | # Get original scores to find min/max for scaling 482 | scored_train_orig = self.credit_scoring.transform(self.woe_encoder.inverse_transform(self.train_woe)) 483 | scored_valid_orig = self.credit_scoring.transform(self.woe_encoder.inverse_transform(self.valid_woe)) 484 | 485 | self.min_output_score = min(scored_train_orig['score'].min(), scored_valid_orig['score'].min()) 486 | self.max_output_score = max(scored_train_orig['score'].max(), scored_valid_orig['score'].max()) 487 | 488 | logger.info(f"Min output score: {self.min_output_score}") 489 | logger.info(f"Max output score: {self.max_output_score}") 490 | logger.info(f"Linear transformation to a {self.min_score}-{self.max_score} scale") 491 | 492 | n = self.credit_scoring.n 493 | 494 | if self.max_output_score == self.min_output_score: 495 | logger.warning("All scores are the same, cannot apply linear transformation. Setting all scores to the average of min_score and max_score.") 496 | avg_score = (self.min_score + self.max_score) / 2 497 | self.credit_scoring.scorecard['points'] = np.floor(avg_score / n).astype(int) 498 | else: 499 | # Scaling parameters 500 | a = (self.max_score - self.min_score) / (self.max_output_score - self.min_output_score) 501 | b = self.min_score - a * self.min_output_score 502 | # Update scorecard points 503 | self.credit_scoring.scorecard['points'] = np.floor(a * self.credit_scoring.scorecard['points'] + b / n).astype(int) 504 | 505 | # Update scoring_map from the updated scorecard 506 | self.credit_scoring.scoring_map = dict(ChainMap(*[{f: d[['attribute', 'points']].set_index('attribute')['points'].to_dict()} for f, d in self.credit_scoring.scorecard.reset_index().groupby('feature')])) 507 | 508 | # Recalculate scores with the updated scorecard 509 | self.scored_train = self.credit_scoring.transform(self.woe_encoder.inverse_transform(self.train_woe)) 510 | self.scored_valid = self.credit_scoring.transform(self.woe_encoder.inverse_transform(self.valid_woe)) 511 | 512 | self.scored_train['score'] = self.scored_train['score'].astype(float) 513 | self.scored_valid['score'] = self.scored_valid['score'].astype(float) 514 | 515 | self.scored_train['score'] = self.scored_train['score'].clip(self.min_score, self.max_score) 516 | self.scored_valid['score'] = self.scored_valid['score'].clip(self.min_score, self.max_score) 517 | 518 | logger.info(f'Transformed min score: {self.scored_train["score"].min()}') 519 | logger.info(f'Transformed max score: {self.scored_train["score"].max()}') 520 | 521 | for k in [5,10]: 522 | step = (self.max_score-self.min_score)/k 523 | bins = np.arange(self.min_score, self.max_score+step, step) 524 | self.scored_train[f'range_score_{k}'] = pd.cut(self.scored_train['score'],bins=bins,include_lowest=True) 525 | self.scored_valid[f'range_score_{k}'] = pd.cut(self.scored_valid['score'],bins=bins,include_lowest=True) 526 | logger.info("Scoring...Done") 527 | 528 | def __reporting(self): 529 | logger.info("Reporting...") 530 | # Distribution images 531 | logger.info("Score Distribution images...") 532 | fig, ax = plt.subplots() 533 | sns.histplot(self.scored_train['score'], kde=False, stat='density', ax=ax, label='Train') 534 | sns.histplot(self.scored_valid['score'], kde=False, stat='density', ax=ax, label='Valid') 535 | ax.set_title("Score histogram") 536 | ax.legend() 537 | self.score_histogram_fig = fig 538 | 539 | fig, ax = plt.subplots() 540 | sns.kdeplot(self.scored_train['score'], ax=ax, label='Train') 541 | sns.kdeplot(self.scored_valid['score'], ax=ax, label='Valid') 542 | ax.set_title("Score KDE") 543 | ax.legend() 544 | self.score_kde_fig = fig 545 | # Event rate images 546 | logger.info("Event rate images...") 547 | self.event_rate_figs = [] 548 | for k in [5,10]: 549 | fig, ax = plt.subplots() 550 | ax = pd.crosstab(self.scored_train[f'range_score_{k}'], self.train[self.target], normalize='index').plot(kind='bar', stacked=True, ax=ax) 551 | ax.set_title(f"Event rate by score range ({k} bins)") 552 | setattr(self,f'event_rate_fig_{k}',fig) 553 | self.event_rate_figs.append(fig) 554 | # IV report 555 | logger.info("IV report...") 556 | iv_reports = [] 557 | if hasattr(self, 'iv_report_continuous'): 558 | iv_reports.append(self.iv_report_continuous[['root_feature','iv']].rename(columns={'root_feature':'feature'})) 559 | if hasattr(self, 'iv_report_discrete'): 560 | iv_reports.append(self.iv_report_discrete[['feature','iv']]) 561 | 562 | if iv_reports: 563 | self.iv_report = pd.concat(iv_reports, axis=0).sort_values('iv', ascending=False) 564 | fig, ax = plt.subplots() 565 | sns.barplot(data=self.iv_report, x='iv', y='feature', ax=ax) 566 | ax.set_title("IV report") 567 | else: 568 | self.iv_report = pd.DataFrame(columns=['feature', 'iv']) 569 | fig, ax = plt.subplots() 570 | ax.text(0.5, 0.5, "No IV report generated.", horizontalalignment='center', verticalalignment='center') 571 | ax.set_title("IV report") 572 | self.iv_report_fig = fig 573 | # ROC Curve 574 | logger.info("ROC Curve...") 575 | fpr_train, tpr_train, _ = roc_curve(self.train[self.target], self.model.predict_proba(self.train_woe)[:, 1]) 576 | fpr_valid, tpr_valid, _ = roc_curve(self.valid[self.target], self.model.predict_proba(self.valid_woe)[:, 1]) 577 | roc_auc_train = auc(fpr_train, tpr_train) 578 | roc_auc_valid = auc(fpr_valid, tpr_valid) 579 | fig, ax = plt.subplots() 580 | ax.plot(fpr_train, tpr_train, color='blue', lw=2, label=f'Train ROC curve (area = {roc_auc_train:.2f})') 581 | ax.plot(fpr_valid, tpr_valid, color='red', lw=2, label=f'Valid ROC curve (area = {roc_auc_valid:.2f})') 582 | ax.plot([0, 1], [0, 1], color='grey', lw=2, linestyle='--') 583 | ax.set_xlim([0.0, 1.0]) 584 | ax.set_ylim([0.0, 1.05]) 585 | ax.set_xlabel('False Positive Rate') 586 | ax.set_ylabel('True Positive Rate') 587 | ax.set_title('Receiver Operating Characteristic') 588 | ax.legend(loc="lower right") 589 | self.roc_curve_fig = fig 590 | logger.info("ROC Curve...Done") 591 | 592 | def save_reports(self,folder='.'): 593 | if not self.create_reporting: 594 | raise RuntimeError("Reports were not generated. Please run fit() with create_reporting=True before saving reports.") 595 | 596 | if not os.path.exists(folder): 597 | os.makedirs(folder) 598 | self.score_histogram_fig.savefig(f'{folder}/score_histogram.png') 599 | self.score_kde_fig.savefig(f'{folder}/score_kde.png') 600 | self.iv_report_fig.savefig(f'{folder}/iv_report.png') 601 | for k in [5,10]: 602 | getattr(self,f'event_rate_fig_{k}').savefig(f'{folder}/event_rate_{k}.png') 603 | self.roc_curve_fig.savefig(f'{folder}/roc_curve.png') 604 | logger.info(f"Reports saved in {folder}") 605 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU General Public License is a free, copyleft license for 11 | software and other kinds of works. 12 | 13 | The licenses for most software and other practical works are designed 14 | to take away your freedom to share and change the works. By contrast, 15 | the GNU General Public License is intended to guarantee your freedom to 16 | share and change all versions of a program--to make sure it remains free 17 | software for all its users. We, the Free Software Foundation, use the 18 | GNU General Public License for most of our software; it applies also to 19 | any other work released this way by its authors. You can apply it to 20 | your programs, too. 21 | 22 | When we speak of free software, we are referring to freedom, not 23 | price. Our General Public Licenses are designed to make sure that you 24 | have the freedom to distribute copies of free software (and charge for 25 | them if you wish), that you receive source code or can get it if you 26 | want it, that you can change the software or use pieces of it in new 27 | free programs, and that you know you can do these things. 28 | 29 | To protect your rights, we need to prevent others from denying you 30 | these rights or asking you to surrender the rights. Therefore, you have 31 | certain responsibilities if you distribute copies of the software, or if 32 | you modify it: responsibilities to respect the freedom of others. 33 | 34 | For example, if you distribute copies of such a program, whether 35 | gratis or for a fee, you must pass on to the recipients the same 36 | freedoms that you received. You must make sure that they, too, receive 37 | or can get the source code. And you must show them these terms so they 38 | know their rights. 39 | 40 | Developers that use the GNU GPL protect your rights with two steps: 41 | (1) assert copyright on the software, and (2) offer you this License 42 | giving you legal permission to copy, distribute and/or modify it. 43 | 44 | For the developers' and authors' protection, the GPL clearly explains 45 | that there is no warranty for this free software. For both users' and 46 | authors' sake, the GPL requires that modified versions be marked as 47 | changed, so that their problems will not be attributed erroneously to 48 | authors of previous versions. 49 | 50 | Some devices are designed to deny users access to install or run 51 | modified versions of the software inside them, although the manufacturer 52 | can do so. This is fundamentally incompatible with the aim of 53 | protecting users' freedom to change the software. The systematic 54 | pattern of such abuse occurs in the area of products for individuals to 55 | use, which is precisely where it is most unacceptable. Therefore, we 56 | have designed this version of the GPL to prohibit the practice for those 57 | products. If such problems arise substantially in other domains, we 58 | stand ready to extend this provision to those domains in future versions 59 | of the GPL, as needed to protect the freedom of users. 60 | 61 | Finally, every program is threatened constantly by software patents. 62 | States should not allow patents to restrict development and use of 63 | software on general-purpose computers, but in those that do, we wish to 64 | avoid the special danger that patents applied to a free program could 65 | make it effectively proprietary. To prevent this, the GPL assures that 66 | patents cannot be used to render the program non-free. 67 | 68 | The precise terms and conditions for copying, distribution and 69 | modification follow. 70 | 71 | TERMS AND CONDITIONS 72 | 73 | 0. Definitions. 74 | 75 | "This License" refers to version 3 of the GNU General Public License. 76 | 77 | "Copyright" also means copyright-like laws that apply to other kinds of 78 | works, such as semiconductor masks. 79 | 80 | "The Program" refers to any copyrightable work licensed under this 81 | License. Each licensee is addressed as "you". "Licensees" and 82 | "recipients" may be individuals or organizations. 83 | 84 | To "modify" a work means to copy from or adapt all or part of the work 85 | in a fashion requiring copyright permission, other than the making of an 86 | exact copy. The resulting work is called a "modified version" of the 87 | earlier work or a work "based on" the earlier work. 88 | 89 | A "covered work" means either the unmodified Program or a work based 90 | on the Program. 91 | 92 | To "propagate" a work means to do anything with it that, without 93 | permission, would make you directly or secondarily liable for 94 | infringement under applicable copyright law, except executing it on a 95 | computer or modifying a private copy. Propagation includes copying, 96 | distribution (with or without modification), making available to the 97 | public, and in some countries other activities as well. 98 | 99 | To "convey" a work means any kind of propagation that enables other 100 | parties to make or receive copies. Mere interaction with a user through 101 | a computer network, with no transfer of a copy, is not conveying. 102 | 103 | An interactive user interface displays "Appropriate Legal Notices" 104 | to the extent that it includes a convenient and prominently visible 105 | feature that (1) displays an appropriate copyright notice, and (2) 106 | tells the user that there is no warranty for the work (except to the 107 | extent that warranties are provided), that licensees may convey the 108 | work under this License, and how to view a copy of this License. If 109 | the interface presents a list of user commands or options, such as a 110 | menu, a prominent item in the list meets this criterion. 111 | 112 | 1. Source Code. 113 | 114 | The "source code" for a work means the preferred form of the work 115 | for making modifications to it. "Object code" means any non-source 116 | form of a work. 117 | 118 | A "Standard Interface" means an interface that either is an official 119 | standard defined by a recognized standards body, or, in the case of 120 | interfaces specified for a particular programming language, one that 121 | is widely used among developers working in that language. 122 | 123 | The "System Libraries" of an executable work include anything, other 124 | than the work as a whole, that (a) is included in the normal form of 125 | packaging a Major Component, but which is not part of that Major 126 | Component, and (b) serves only to enable use of the work with that 127 | Major Component, or to implement a Standard Interface for which an 128 | implementation is available to the public in source code form. A 129 | "Major Component", in this context, means a major essential component 130 | (kernel, window system, and so on) of the specific operating system 131 | (if any) on which the executable work runs, or a compiler used to 132 | produce the work, or an object code interpreter used to run it. 133 | 134 | The "Corresponding Source" for a work in object code form means all 135 | the source code needed to generate, install, and (for an executable 136 | work) run the object code and to modify the work, including scripts to 137 | control those activities. However, it does not include the work's 138 | System Libraries, or general-purpose tools or generally available free 139 | programs which are used unmodified in performing those activities but 140 | which are not part of the work. For example, Corresponding Source 141 | includes interface definition files associated with source files for 142 | the work, and the source code for shared libraries and dynamically 143 | linked subprograms that the work is specifically designed to require, 144 | such as by intimate data communication or control flow between those 145 | subprograms and other parts of the work. 146 | 147 | The Corresponding Source need not include anything that users 148 | can regenerate automatically from other parts of the Corresponding 149 | Source. 150 | 151 | The Corresponding Source for a work in source code form is that 152 | same work. 153 | 154 | 2. Basic Permissions. 155 | 156 | All rights granted under this License are granted for the term of 157 | copyright on the Program, and are irrevocable provided the stated 158 | conditions are met. This License explicitly affirms your unlimited 159 | permission to run the unmodified Program. The output from running a 160 | covered work is covered by this License only if the output, given its 161 | content, constitutes a covered work. This License acknowledges your 162 | rights of fair use or other equivalent, as provided by copyright law. 163 | 164 | You may make, run and propagate covered works that you do not 165 | convey, without conditions so long as your license otherwise remains 166 | in force. You may convey covered works to others for the sole purpose 167 | of having them make modifications exclusively for you, or provide you 168 | with facilities for running those works, provided that you comply with 169 | the terms of this License in conveying all material for which you do 170 | not control copyright. Those thus making or running the covered works 171 | for you must do so exclusively on your behalf, under your direction 172 | and control, on terms that prohibit them from making any copies of 173 | your copyrighted material outside their relationship with you. 174 | 175 | Conveying under any other circumstances is permitted solely under 176 | the conditions stated below. Sublicensing is not allowed; section 10 177 | makes it unnecessary. 178 | 179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 180 | 181 | No covered work shall be deemed part of an effective technological 182 | measure under any applicable law fulfilling obligations under article 183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 184 | similar laws prohibiting or restricting circumvention of such 185 | measures. 186 | 187 | When you convey a covered work, you waive any legal power to forbid 188 | circumvention of technological measures to the extent such circumvention 189 | is effected by exercising rights under this License with respect to 190 | the covered work, and you disclaim any intention to limit operation or 191 | modification of the work as a means of enforcing, against the work's 192 | users, your or third parties' legal rights to forbid circumvention of 193 | technological measures. 194 | 195 | 4. Conveying Verbatim Copies. 196 | 197 | You may convey verbatim copies of the Program's source code as you 198 | receive it, in any medium, provided that you conspicuously and 199 | appropriately publish on each copy an appropriate copyright notice; 200 | keep intact all notices stating that this License and any 201 | non-permissive terms added in accord with section 7 apply to the code; 202 | keep intact all notices of the absence of any warranty; and give all 203 | recipients a copy of this License along with the Program. 204 | 205 | You may charge any price or no price for each copy that you convey, 206 | and you may offer support or warranty protection for a fee. 207 | 208 | 5. Conveying Modified Source Versions. 209 | 210 | You may convey a work based on the Program, or the modifications to 211 | produce it from the Program, in the form of source code under the 212 | terms of section 4, provided that you also meet all of these conditions: 213 | 214 | a) The work must carry prominent notices stating that you modified 215 | it, and giving a relevant date. 216 | 217 | b) The work must carry prominent notices stating that it is 218 | released under this License and any conditions added under section 219 | 7. This requirement modifies the requirement in section 4 to 220 | "keep intact all notices". 221 | 222 | c) You must license the entire work, as a whole, under this 223 | License to anyone who comes into possession of a copy. This 224 | License will therefore apply, along with any applicable section 7 225 | additional terms, to the whole of the work, and all its parts, 226 | regardless of how they are packaged. This License gives no 227 | permission to license the work in any other way, but it does not 228 | invalidate such permission if you have separately received it. 229 | 230 | d) If the work has interactive user interfaces, each must display 231 | Appropriate Legal Notices; however, if the Program has interactive 232 | interfaces that do not display Appropriate Legal Notices, your 233 | work need not make them do so. 234 | 235 | A compilation of a covered work with other separate and independent 236 | works, which are not by their nature extensions of the covered work, 237 | and which are not combined with it such as to form a larger program, 238 | in or on a volume of a storage or distribution medium, is called an 239 | "aggregate" if the compilation and its resulting copyright are not 240 | used to limit the access or legal rights of the compilation's users 241 | beyond what the individual works permit. Inclusion of a covered work 242 | in an aggregate does not cause this License to apply to the other 243 | parts of the aggregate. 244 | 245 | 6. Conveying Non-Source Forms. 246 | 247 | You may convey a covered work in object code form under the terms 248 | of sections 4 and 5, provided that you also convey the 249 | machine-readable Corresponding Source under the terms of this License, 250 | in one of these ways: 251 | 252 | a) Convey the object code in, or embodied in, a physical product 253 | (including a physical distribution medium), accompanied by the 254 | Corresponding Source fixed on a durable physical medium 255 | customarily used for software interchange. 256 | 257 | b) Convey the object code in, or embodied in, a physical product 258 | (including a physical distribution medium), accompanied by a 259 | written offer, valid for at least three years and valid for as 260 | long as you offer spare parts or customer support for that product 261 | model, to give anyone who possesses the object code either (1) a 262 | copy of the Corresponding Source for all the software in the 263 | product that is covered by this License, on a durable physical 264 | medium customarily used for software interchange, for a price no 265 | more than your reasonable cost of physically performing this 266 | conveying of source, or (2) access to copy the 267 | Corresponding Source from a network server at no charge. 268 | 269 | c) Convey individual copies of the object code with a copy of the 270 | written offer to provide the Corresponding Source. This 271 | alternative is allowed only occasionally and noncommercially, and 272 | only if you received the object code with such an offer, in accord 273 | with subsection 6b. 274 | 275 | d) Convey the object code by offering access from a designated 276 | place (gratis or for a charge), and offer equivalent access to the 277 | Corresponding Source in the same way through the same place at no 278 | further charge. You need not require recipients to copy the 279 | Corresponding Source along with the object code. If the place to 280 | copy the object code is a network server, the Corresponding Source 281 | may be on a different server (operated by you or a third party) 282 | that supports equivalent copying facilities, provided you maintain 283 | clear directions next to the object code saying where to find the 284 | Corresponding Source. Regardless of what server hosts the 285 | Corresponding Source, you remain obligated to ensure that it is 286 | available for as long as needed to satisfy these requirements. 287 | 288 | e) Convey the object code using peer-to-peer transmission, provided 289 | you inform other peers where the object code and Corresponding 290 | Source of the work are being offered to the general public at no 291 | charge under subsection 6d. 292 | 293 | A separable portion of the object code, whose source code is excluded 294 | from the Corresponding Source as a System Library, need not be 295 | included in conveying the object code work. 296 | 297 | A "User Product" is either (1) a "consumer product", which means any 298 | tangible personal property which is normally used for personal, family, 299 | or household purposes, or (2) anything designed or sold for incorporation 300 | into a dwelling. In determining whether a product is a consumer product, 301 | doubtful cases shall be resolved in favor of coverage. For a particular 302 | product received by a particular user, "normally used" refers to a 303 | typical or common use of that class of product, regardless of the status 304 | of the particular user or of the way in which the particular user 305 | actually uses, or expects or is expected to use, the product. A product 306 | is a consumer product regardless of whether the product has substantial 307 | commercial, industrial or non-consumer uses, unless such uses represent 308 | the only significant mode of use of the product. 309 | 310 | "Installation Information" for a User Product means any methods, 311 | procedures, authorization keys, or other information required to install 312 | and execute modified versions of a covered work in that User Product from 313 | a modified version of its Corresponding Source. The information must 314 | suffice to ensure that the continued functioning of the modified object 315 | code is in no case prevented or interfered with solely because 316 | modification has been made. 317 | 318 | If you convey an object code work under this section in, or with, or 319 | specifically for use in, a User Product, and the conveying occurs as 320 | part of a transaction in which the right of possession and use of the 321 | User Product is transferred to the recipient in perpetuity or for a 322 | fixed term (regardless of how the transaction is characterized), the 323 | Corresponding Source conveyed under this section must be accompanied 324 | by the Installation Information. But this requirement does not apply 325 | if neither you nor any third party retains the ability to install 326 | modified object code on the User Product (for example, the work has 327 | been installed in ROM). 328 | 329 | The requirement to provide Installation Information does not include a 330 | requirement to continue to provide support service, warranty, or updates 331 | for a work that has been modified or installed by the recipient, or for 332 | the User Product in which it has been modified or installed. Access to a 333 | network may be denied when the modification itself materially and 334 | adversely affects the operation of the network or violates the rules and 335 | protocols for communication across the network. 336 | 337 | Corresponding Source conveyed, and Installation Information provided, 338 | in accord with this section must be in a format that is publicly 339 | documented (and with an implementation available to the public in 340 | source code form), and must require no special password or key for 341 | unpacking, reading or copying. 342 | 343 | 7. Additional Terms. 344 | 345 | "Additional permissions" are terms that supplement the terms of this 346 | License by making exceptions from one or more of its conditions. 347 | Additional permissions that are applicable to the entire Program shall 348 | be treated as though they were included in this License, to the extent 349 | that they are valid under applicable law. If additional permissions 350 | apply only to part of the Program, that part may be used separately 351 | under those permissions, but the entire Program remains governed by 352 | this License without regard to the additional permissions. 353 | 354 | When you convey a copy of a covered work, you may at your option 355 | remove any additional permissions from that copy, or from any part of 356 | it. (Additional permissions may be written to require their own 357 | removal in certain cases when you modify the work.) You may place 358 | additional permissions on material, added by you to a covered work, 359 | for which you have or can give appropriate copyright permission. 360 | 361 | Notwithstanding any other provision of this License, for material you 362 | add to a covered work, you may (if authorized by the copyright holders of 363 | that material) supplement the terms of this License with terms: 364 | 365 | a) Disclaiming warranty or limiting liability differently from the 366 | terms of sections 15 and 16 of this License; or 367 | 368 | b) Requiring preservation of specified reasonable legal notices or 369 | author attributions in that material or in the Appropriate Legal 370 | Notices displayed by works containing it; or 371 | 372 | c) Prohibiting misrepresentation of the origin of that material, or 373 | requiring that modified versions of such material be marked in 374 | reasonable ways as different from the original version; or 375 | 376 | d) Limiting the use for publicity purposes of names of licensors or 377 | authors of the material; or 378 | 379 | e) Declining to grant rights under trademark law for use of some 380 | trade names, trademarks, or service marks; or 381 | 382 | f) Requiring indemnification of licensors and authors of that 383 | material by anyone who conveys the material (or modified versions of 384 | it) with contractual assumptions of liability to the recipient, for 385 | any liability that these contractual assumptions directly impose on 386 | those licensors and authors. 387 | 388 | All other non-permissive additional terms are considered "further 389 | restrictions" within the meaning of section 10. If the Program as you 390 | received it, or any part of it, contains a notice stating that it is 391 | governed by this License along with a term that is a further 392 | restriction, you may remove that term. If a license document contains 393 | a further restriction but permits relicensing or conveying under this 394 | License, you may add to a covered work material governed by the terms 395 | of that license document, provided that the further restriction does 396 | not survive such relicensing or conveying. 397 | 398 | If you add terms to a covered work in accord with this section, you 399 | must place, in the relevant source files, a statement of the 400 | additional terms that apply to those files, or a notice indicating 401 | where to find the applicable terms. 402 | 403 | Additional terms, permissive or non-permissive, may be stated in the 404 | form of a separately written license, or stated as exceptions; 405 | the above requirements apply either way. 406 | 407 | 8. Termination. 408 | 409 | You may not propagate or modify a covered work except as expressly 410 | provided under this License. Any attempt otherwise to propagate or 411 | modify it is void, and will automatically terminate your rights under 412 | this License (including any patent licenses granted under the third 413 | paragraph of section 11). 414 | 415 | However, if you cease all violation of this License, then your 416 | license from a particular copyright holder is reinstated (a) 417 | provisionally, unless and until the copyright holder explicitly and 418 | finally terminates your license, and (b) permanently, if the copyright 419 | holder fails to notify you of the violation by some reasonable means 420 | prior to 60 days after the cessation. 421 | 422 | Moreover, your license from a particular copyright holder is 423 | reinstated permanently if the copyright holder notifies you of the 424 | violation by some reasonable means, this is the first time you have 425 | received notice of violation of this License (for any work) from that 426 | copyright holder, and you cure the violation prior to 30 days after 427 | your receipt of the notice. 428 | 429 | Termination of your rights under this section does not terminate the 430 | licenses of parties who have received copies or rights from you under 431 | this License. If your rights have been terminated and not permanently 432 | reinstated, you do not qualify to receive new licenses for the same 433 | material under section 10. 434 | 435 | 9. Acceptance Not Required for Having Copies. 436 | 437 | You are not required to accept this License in order to receive or 438 | run a copy of the Program. Ancillary propagation of a covered work 439 | occurring solely as a consequence of using peer-to-peer transmission 440 | to receive a copy likewise does not require acceptance. However, 441 | nothing other than this License grants you permission to propagate or 442 | modify any covered work. These actions infringe copyright if you do 443 | not accept this License. Therefore, by modifying or propagating a 444 | covered work, you indicate your acceptance of this License to do so. 445 | 446 | 10. Automatic Licensing of Downstream Recipients. 447 | 448 | Each time you convey a covered work, the recipient automatically 449 | receives a license from the original licensors, to run, modify and 450 | propagate that work, subject to this License. You are not responsible 451 | for enforcing compliance by third parties with this License. 452 | 453 | An "entity transaction" is a transaction transferring control of an 454 | organization, or substantially all assets of one, or subdividing an 455 | organization, or merging organizations. If propagation of a covered 456 | work results from an entity transaction, each party to that 457 | transaction who receives a copy of the work also receives whatever 458 | licenses to the work the party's predecessor in interest had or could 459 | give under the previous paragraph, plus a right to possession of the 460 | Corresponding Source of the work from the predecessor in interest, if 461 | the predecessor has it or can get it with reasonable efforts. 462 | 463 | You may not impose any further restrictions on the exercise of the 464 | rights granted or affirmed under this License. For example, you may 465 | not impose a license fee, royalty, or other charge for exercise of 466 | rights granted under this License, and you may not initiate litigation 467 | (including a cross-claim or counterclaim in a lawsuit) alleging that 468 | any patent claim is infringed by making, using, selling, offering for 469 | sale, or importing the Program or any portion of it. 470 | 471 | 11. Patents. 472 | 473 | A "contributor" is a copyright holder who authorizes use under this 474 | License of the Program or a work on which the Program is based. The 475 | work thus licensed is called the contributor's "contributor version". 476 | 477 | A contributor's "essential patent claims" are all patent claims 478 | owned or controlled by the contributor, whether already acquired or 479 | hereafter acquired, that would be infringed by some manner, permitted 480 | by this License, of making, using, or selling its contributor version, 481 | but do not include claims that would be infringed only as a 482 | consequence of further modification of the contributor version. For 483 | purposes of this definition, "control" includes the right to grant 484 | patent sublicenses in a manner consistent with the requirements of 485 | this License. 486 | 487 | Each contributor grants you a non-exclusive, worldwide, royalty-free 488 | patent license under the contributor's essential patent claims, to 489 | make, use, sell, offer for sale, import and otherwise run, modify and 490 | propagate the contents of its contributor version. 491 | 492 | In the following three paragraphs, a "patent license" is any express 493 | agreement or commitment, however denominated, not to enforce a patent 494 | (such as an express permission to practice a patent or covenant not to 495 | sue for patent infringement). To "grant" such a patent license to a 496 | party means to make such an agreement or commitment not to enforce a 497 | patent against the party. 498 | 499 | If you convey a covered work, knowingly relying on a patent license, 500 | and the Corresponding Source of the work is not available for anyone 501 | to copy, free of charge and under the terms of this License, through a 502 | publicly available network server or other readily accessible means, 503 | then you must either (1) cause the Corresponding Source to be so 504 | available, or (2) arrange to deprive yourself of the benefit of the 505 | patent license for this particular work, or (3) arrange, in a manner 506 | consistent with the requirements of this License, to extend the patent 507 | license to downstream recipients. "Knowingly relying" means you have 508 | actual knowledge that, but for the patent license, your conveying the 509 | covered work in a country, or your recipient's use of the covered work 510 | in a country, would infringe one or more identifiable patents in that 511 | country that you have reason to believe are valid. 512 | 513 | If, pursuant to or in connection with a single transaction or 514 | arrangement, you convey, or propagate by procuring conveyance of, a 515 | covered work, and grant a patent license to some of the parties 516 | receiving the covered work authorizing them to use, propagate, modify 517 | or convey a specific copy of the covered work, then the patent license 518 | you grant is automatically extended to all recipients of the covered 519 | work and works based on it. 520 | 521 | A patent license is "discriminatory" if it does not include within 522 | the scope of its coverage, prohibits the exercise of, or is 523 | conditioned on the non-exercise of one or more of the rights that are 524 | specifically granted under this License. You may not convey a covered 525 | work if you are a party to an arrangement with a third party that is 526 | in the business of distributing software, under which you make payment 527 | to the third party based on the extent of your activity of conveying 528 | the work, and under which the third party grants, to any of the 529 | parties who would receive the covered work from you, a discriminatory 530 | patent license (a) in connection with copies of the covered work 531 | conveyed by you (or copies made from those copies), or (b) primarily 532 | for and in connection with specific products or compilations that 533 | contain the covered work, unless you entered into that arrangement, 534 | or that patent license was granted, prior to 28 March 2007. 535 | 536 | Nothing in this License shall be construed as excluding or limiting 537 | any implied license or other defenses to infringement that may 538 | otherwise be available to you under applicable patent law. 539 | 540 | 12. No Surrender of Others' Freedom. 541 | 542 | If conditions are imposed on you (whether by court order, agreement or 543 | otherwise) that contradict the conditions of this License, they do not 544 | excuse you from the conditions of this License. If you cannot convey a 545 | covered work so as to satisfy simultaneously your obligations under this 546 | License and any other pertinent obligations, then as a consequence you may 547 | not convey it at all. For example, if you agree to terms that obligate you 548 | to collect a royalty for further conveying from those to whom you convey 549 | the Program, the only way you could satisfy both those terms and this 550 | License would be to refrain entirely from conveying the Program. 551 | 552 | 13. Use with the GNU Affero General Public License. 553 | 554 | Notwithstanding any other provision of this License, you have 555 | permission to link or combine any covered work with a work licensed 556 | under version 3 of the GNU Affero General Public License into a single 557 | combined work, and to convey the resulting work. The terms of this 558 | License will continue to apply to the part which is the covered work, 559 | but the special requirements of the GNU Affero General Public License, 560 | section 13, concerning interaction through a network will apply to the 561 | combination as such. 562 | 563 | 14. Revised Versions of this License. 564 | 565 | The Free Software Foundation may publish revised and/or new versions of 566 | the GNU General Public License from time to time. Such new versions will 567 | be similar in spirit to the present version, but may differ in detail to 568 | address new problems or concerns. 569 | 570 | Each version is given a distinguishing version number. If the 571 | Program specifies that a certain numbered version of the GNU General 572 | Public License "or any later version" applies to it, you have the 573 | option of following the terms and conditions either of that numbered 574 | version or of any later version published by the Free Software 575 | Foundation. If the Program does not specify a version number of the 576 | GNU General Public License, you may choose any version ever published 577 | by the Free Software Foundation. 578 | 579 | If the Program specifies that a proxy can decide which future 580 | versions of the GNU General Public License can be used, that proxy's 581 | public statement of acceptance of a version permanently authorizes you 582 | to choose that version for the Program. 583 | 584 | Later license versions may give you additional or different 585 | permissions. However, no additional obligations are imposed on any 586 | author or copyright holder as a result of your choosing to follow a 587 | later version. 588 | 589 | 15. Disclaimer of Warranty. 590 | 591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 599 | 600 | 16. Limitation of Liability. 601 | 602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 610 | SUCH DAMAGES. 611 | 612 | 17. Interpretation of Sections 15 and 16. 613 | 614 | If the disclaimer of warranty and limitation of liability provided 615 | above cannot be given local legal effect according to their terms, 616 | reviewing courts shall apply local law that most closely approximates 617 | an absolute waiver of all civil liability in connection with the 618 | Program, unless a warranty or assumption of liability accompanies a 619 | copy of the Program in return for a fee. 620 | 621 | END OF TERMS AND CONDITIONS 622 | 623 | How to Apply These Terms to Your New Programs 624 | 625 | If you develop a new program, and you want it to be of the greatest 626 | possible use to the public, the best way to achieve this is to make it 627 | free software which everyone can redistribute and change under these terms. 628 | 629 | To do so, attach the following notices to the program. It is safest 630 | to attach them to the start of each source file to most effectively 631 | state the exclusion of warranty; and each file should have at least 632 | the "copyright" line and a pointer to where the full notice is found. 633 | 634 | 635 | Copyright (C) 636 | 637 | This program is free software: you can redistribute it and/or modify 638 | it under the terms of the GNU General Public License as published by 639 | the Free Software Foundation, either version 3 of the License, or 640 | (at your option) any later version. 641 | 642 | This program is distributed in the hope that it will be useful, 643 | but WITHOUT ANY WARRANTY; without even the implied warranty of 644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 645 | GNU General Public License for more details. 646 | 647 | You should have received a copy of the GNU General Public License 648 | along with this program. If not, see . 649 | 650 | Also add information on how to contact you by electronic and paper mail. 651 | 652 | If the program does terminal interaction, make it output a short 653 | notice like this when it starts in an interactive mode: 654 | 655 | Copyright (C) 656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 657 | This is free software, and you are welcome to redistribute it 658 | under certain conditions; type `show c' for details. 659 | 660 | The hypothetical commands `show w' and `show c' should show the appropriate 661 | parts of the General Public License. Of course, your program's commands 662 | might be different; for a GUI interface, you would use an "about box". 663 | 664 | You should also get your employer (if you work as a programmer) or school, 665 | if any, to sign a "copyright disclaimer" for the program, if necessary. 666 | For more information on this, and how to apply and follow the GNU GPL, see 667 | . 668 | 669 | The GNU General Public License does not permit incorporating your program 670 | into proprietary programs. If your program is a subroutine library, you 671 | may consider it more useful to permit linking proprietary applications with 672 | the library. If this is what you want to do, use the GNU Lesser General 673 | Public License instead of this License. But first, please read 674 | . 675 | --------------------------------------------------------------------------------