├── education dist.jpg ├── applicant income dist.jpg ├── gender distribution.jpg ├── loan approval status.jpg ├── marital status dist.jpg ├── self employment dist.jpg ├── loan status vs loan amount.jpg ├── loan status vs property area.jpg ├── loan status vs applicant income.jpg ├── loan status vs credit history.jpg ├── loan status vs coapplicant income.jpg ├── CONTRIBUTING.md ├── LICENSE └── README.md /education dist.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/education dist.jpg -------------------------------------------------------------------------------- /applicant income dist.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/applicant income dist.jpg -------------------------------------------------------------------------------- /gender distribution.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/gender distribution.jpg -------------------------------------------------------------------------------- /loan approval status.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan approval status.jpg -------------------------------------------------------------------------------- /marital status dist.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/marital status dist.jpg -------------------------------------------------------------------------------- /self employment dist.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/self employment dist.jpg -------------------------------------------------------------------------------- /loan status vs loan amount.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan status vs loan amount.jpg -------------------------------------------------------------------------------- /loan status vs property area.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan status vs property area.jpg -------------------------------------------------------------------------------- /loan status vs applicant income.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan status vs applicant income.jpg -------------------------------------------------------------------------------- /loan status vs credit history.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan status vs credit history.jpg -------------------------------------------------------------------------------- /loan status vs coapplicant income.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DavieObi/Loan-Prediction-Project/HEAD/loan status vs coapplicant income.jpg -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | 2 | ```markdown 3 | # Contributing to Loan Prediction Project 4 | 5 | Thank you for your interest in contributing to this project! We welcome contributions that improve the functionality, performance, and usability of the application. 6 | 7 | ## How to Contribute 8 | 9 | 1. Fork the repository. 10 | 2. Create a new branch for your feature or bug fix. 11 | 3. Make your changes and commit them with descriptive messages. 12 | 4. Push your changes to your forked repository. 13 | 5. Submit a pull request detailing your changes. 14 | 15 | Please ensure that your contributions adhere to the project's coding standards and include appropriate tests where applicable. 16 | 17 | ## Code of Conduct 18 | 19 | By participating in this project, you agree to abide by our Code of Conduct. Let's keep the community respectful and welcoming for all. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 DavieObi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Loan Prediction Project 2 | 3 | ## Project Overview 4 | This project focuses on building a machine learning model to predict loan approval status based on various applicant and loan-related features. The goal is to assist financial institutions in making informed decisions regarding loan applications, thereby reducing risk and improving efficiency in the loan approval process. 5 | 6 | ## Problem Statement 7 | In the financial industry, accurately assessing the creditworthiness of loan applicants is crucial to minimize financial losses due to defaults. Manual assessment can be time-consuming and prone to human error. There is a need for an automated system that can reliably predict whether a loan application will be approved or rejected based on historical data. 8 | 9 | ## Project Objective 10 | The primary objective of this project is to develop and evaluate a machine learning model capable of predicting loan approval status (`Loan_Status`) with high accuracy. This involves: 11 | * Performing Exploratory Data Analysis (EDA) to understand the dataset characteristics and identify key relationships. 12 | * Handling missing values and outliers in the dataset. 13 | * Transforming categorical features into numerical format suitable for machine learning algorithms. 14 | * Training a classification model (Support Vector Classifier) to predict loan status. 15 | * Evaluating the model's performance using appropriate metrics such as accuracy, precision, recall, and F1-score. 16 | 17 | ## Column Dictionary 18 | 19 | | Column Name | Description | Data Type (Original) | 20 | |:-------------------|:--------------------------------------------------------------------------|:---------------------| 21 | | `Loan_ID` | Unique identifier for each loan application. | Object | 22 | | `Gender` | Applicant's gender (Male/Female). | Object | 23 | | `Married` | Applicant's marital status (Yes/No). | Object | 24 | | `Dependents` | Number of dependents the applicant has (0, 1, 2, 3+). | Object | 25 | | `Education` | Applicant's education level (Graduate/Not Graduate). | Object | 26 | | `Self_Employed` | Whether the applicant is self-employed (Yes/No). | Object | 27 | | `ApplicantIncome` | Applicant's monthly income. | Integer | 28 | | `CoapplicantIncome`| Co-applicant's monthly income. | Float | 29 | | `LoanAmount` | Loan amount in thousands. | Float | 30 | | `Loan_Amount_Term` | Term of loan in months. | Float | 31 | | `Credit_History` | Credit history meets guidelines (1: Yes, 0: No). | Float | 32 | | `Property_Area` | Location of the property (Urban/Semiurban/Rural). | Object | 33 | | `Loan_Status` | Loan approved (Y) or not (N) - **Target Variable**. | Object | 34 | 35 | ## Methodology and Steps Taken 36 | 37 | 1. **Data Loading and Initial Inspection:** 38 | * The `loan_prediction.csv` dataset was loaded into a Pandas DataFrame. 39 | * Initial inspection revealed 614 entries and 13 columns. 40 | * Identified columns with missing values: `Gender`, `Married`, `Dependents`, `Self_Employed`, `LoanAmount`, `Loan_Amount_Term`, and `Credit_History`. 41 | 42 | 2. **Data Preprocessing and Cleaning:** 43 | * **Irrelevant Feature Removal:** The `Loan_ID` column was dropped as it serves merely as an identifier and holds no predictive power. 44 | * **Handling Missing Values:** 45 | * Missing values in categorical columns (`Gender`, `Married`, `Dependents`, `Self_Employed`, `Loan_Amount_Term`, `Credit_History`) were imputed using the **mode** of their respective columns. 46 | * Missing values in the numerical `LoanAmount` column were imputed using the **median** of the column to mitigate the effect of potential outliers. 47 | * **Outlier Treatment:** 48 | * Outliers in `ApplicantIncome` were removed using the Interquartile Range (IQR) method (values outside $Q1 - 1.5 \times IQR$ and $Q3 + 1.5 \times IQR$). 49 | * Outliers in `CoapplicantIncome` were also removed using the IQR method. 50 | 51 | 3. **Exploratory Data Analysis (EDA) - Visualizations and Insights:** 52 | * **Loan Status Distribution:** The target variable `Loan_Status` was found to be imbalanced, with a significantly higher number of approved loans (`Y`) compared to rejected loans (`N`). This highlights the need to consider imbalance handling techniques during model development. 53 | * **Gender Distribution:** The dataset is predominantly composed of male applicants. 54 | * **Marital Status Distribution:** A majority of the loan applicants are married. 55 | * **Education Distribution:** Most applicants are Graduates, indicating a higher proportion of educated individuals in the applicant pool. 56 | * **Self-Employment Distribution:** The vast majority of applicants are not self-employed. 57 | * **Applicant Income Distribution:** The original distribution was heavily right-skewed, with many applicants having lower incomes and a few high-income outliers. After outlier removal, the distribution became much less skewed. 58 | * **Loan Status vs. Applicant Income:** While both approved and rejected loans show wide income ranges, the median applicant income for approved loans appears slightly higher. 59 | * **Loan Status vs. Coapplicant Income:** A large number of applicants in both approved and rejected categories have zero coapplicant income. For those with coapplicant income, the distribution is still right-skewed, and coapplicant income alone doesn't appear to be a strong differentiating factor. 60 | * **Loan Status vs. Loan Amount:** The median loan amount for approved loans is slightly higher than for rejected loans. Both distributions are positively skewed. 61 | * **Loan Status vs. Credit History:** This was identified as a highly influential factor. Applicants with a credit history of 1 (good credit) are overwhelmingly likely to get their loan approved, whereas those with a credit history of 0 (bad credit) are largely rejected. 62 | * **Loan Status vs. Property Area:** Loans in Semiurban areas appear to have the highest approval rate, followed by Urban, then Rural areas. 63 | 64 | 4. **Feature Engineering and Scaling:** 65 | * **One-Hot Encoding:** Categorical features (`Gender`, `Married`, `Dependents`, `Education`, `Self_Employed`, `Property_Area`) were converted into numerical format using one-hot encoding. 66 | * **Feature Scaling:** Numerical features (`ApplicantIncome`, `CoapplicantIncome`, `LoanAmount`, `Loan_Amount_Term`, `Credit_History`) were scaled using `StandardScaler` to ensure that no single feature dominates the model training due to its scale. 67 | 68 | 5. **Model Training:** 69 | * The dataset was split into training (80%) and testing (20%) sets. 70 | * A **Support Vector Classifier (SVC)** model was chosen and trained on the scaled training data. 71 | 72 | 6. **Model Evaluation:** 73 | * Predictions (`y_pred`) were made on the test set. 74 | * The model's performance was evaluated using: 75 | * **Accuracy:** 0.8273 76 | * **Classification Report:** 77 | * Precision (N): 0.94, Recall (N): 0.49, F1-score (N): 0.64 78 | * Precision (Y): 0.80, Recall (Y): 0.99, F1-score (Y): 0.89 79 | * **Confusion Matrix:** 80 | * True Negatives (Correctly Rejected): 17 81 | * False Positives (Incorrectly Approved): 2 82 | * False Negatives (Incorrectly Rejected): 1 83 | * True Positives (Correctly Approved): 74 84 | 85 | ## Final Insights 86 | 87 | * The dataset is imbalanced towards loan approvals, which the model learned to prioritize, resulting in very high recall for approved loans. 88 | * `Credit_History` is by far the most significant predictor of loan status. A positive credit history strongly correlates with loan approval. 89 | * Applicants from Semiurban areas show a higher propensity for loan approval. 90 | * While income features (`ApplicantIncome`, `CoapplicantIncome`) are important, their distributions, even after outlier handling, show significant overlap between approved and rejected categories, suggesting they are not standalone determinants. 91 | * The model excels at identifying loans that *will be approved* (high recall for 'Y') and at correctly identifying when it predicts a loan *will be rejected* (high precision for 'N'). 92 | * However, the model has a notable weakness in identifying *all* loans that should be rejected (low recall for 'N'). This means it incorrectly predicts a significant portion of actual rejections as approvals (false negatives). 93 | 94 | ## Recommendations 95 | 96 | 1. **Address Class Imbalance:** Given the high recall for 'Y' and low recall for 'N', consider employing techniques like oversampling (e.g., SMOTE) or undersampling on the training data to balance the classes. This could improve the model's ability to correctly identify rejected loans. 97 | 2. **Feature Engineering:** 98 | * Create a `TotalIncome` feature by combining `ApplicantIncome` and `CoapplicantIncome`. This might provide a more holistic view of the applicant's financial capacity. 99 | * Derive `LoanAmount_per_Income` to understand the loan burden relative to income. 100 | 3. **Explore Other Models:** While SVC performs reasonably, experiment with other classification algorithms such as Logistic Regression, Decision Trees, Random Forests, or Gradient Boosting (e.g., XGBoost, LightGBM). These models might offer different trade-offs in precision and recall, and some are less sensitive to class imbalance or feature scaling. 101 | 4. **Hyperparameter Tuning:** Conduct thorough hyperparameter tuning for the chosen model (e.g., using GridSearchCV or RandomizedSearchCV for SVC) to optimize its performance further. 102 | 5. **Cost-Sensitive Learning:** If the cost of a false positive (approving a bad loan) is higher than a false negative (rejecting a good loan), consider using cost-sensitive learning techniques or adjusting the model's decision threshold. 103 | 104 | ## Conclusion 105 | 106 | This project successfully established a baseline machine learning model for loan prediction. The EDA provided valuable insights into the dataset, particularly highlighting the dominance of `Credit_History` and the class imbalance in loan approvals. The trained SVC model achieved good overall accuracy, demonstrating strong capability in predicting loan approvals. However, there's room for improvement, especially in correctly identifying all loan rejections. By implementing the suggested recommendations, the model's predictive power and robustness can be further enhanced for more reliable loan decision-making. 107 | --------------------------------------------------------------------------------