├── README.md └── Employee Performance Analysis(IABAC) └── Employee Performance Analysis(IABAC) └── Summary Of Project..ipynb /README.md: -------------------------------------------------------------------------------- 1 | # EMPLOYEE PERFORMANCE ANALYSIS 2 | 3 | ![image](https://github.com/user-attachments/assets/c7056587-b207-4f85-84b1-9bc6ae8bd168) 4 | 5 | ## **Project Summary:** 6 | ### Goal: Predict employee performance rating based on dataset features. 7 | - **Dataset**: INX Future Inc (IABAC). 1200 rows, 28 features. 8 | - **Key Insights**: Department-wise performance, top factors affecting employee performance, trained model for prediction, and recommendations to improve performance. 9 | 10 | --- 11 | 12 | **1. Requirements:** 13 | - Data provided by IABAC™ based on INX Future Inc. The project was conducted in Jupyter using Python. 14 | 15 | **2. Analysis:** 16 | - **Features:** Numerical, categorical, and ordinal. 17 | - **Important Features:** EmpNumber (not relevant), Gender, Education, MaritalStatus, JobRole, etc. 18 | 19 | **3. Data Analysis:** 20 | - **Univariate Analysis:** Distribution of categories and numerical data. 21 | - **Bivariate Analysis:** Relationship with performance rating. 22 | - **Multivariate Analysis:** Relationships across features. 23 | 24 | **4. Exploratory Data Analysis:** 25 | - Distribution and statistical checks for features. 26 | - Skewness and Kurtosis analysis for normality. 27 | - Visualizations for feature relationships. 28 | 29 | **5. Data Preprocessing:** 30 | - Missing value check, encoding categorical data, handling outliers, feature transformation, scaling. 31 | 32 | **6. Feature Selection:** 33 | - Dropped unique columns and performed PCA to reduce features from 28 to 25. 34 | 35 | **7. Machine Learning Model:** 36 | - Algorithms used: SVM, Random Forest, and Artificial Neural Networks (ANN). 37 | - Best Model: ANN with 95.80% accuracy. 38 | 39 | **8. Model Saving:** 40 | - Saved the trained model using Pickle for future predictions. 41 | 42 | --- 43 | 44 | **Goal 1: Department-wise Performances:** 45 | - Violinplot and Countplot used for performance distribution across departments. 46 | - Findings: Sales and Development departments have the highest performers. Female employees in HR perform well. 47 | 48 | **Goal 2: Top 3 Factors Affecting Performance:** 49 | - Important factors: Environment satisfaction, salary hike percentage, and experience in current role. 50 | 51 | **Goal 3: Trained Model for Prediction:** 52 | - Trained models: SVC (98.28%), Random Forest (95.61%), ANN (95.80%). 53 | 54 | **Goal 4: Recommendations to Improve Performance:** 55 | - Focus on environment satisfaction, salary hikes, and work-life balance. 56 | - Promote employees regularly, focus on female candidates in HR, and improve job satisfaction and relationships. 57 | 58 | -------------------------------------------------------------------------------- /Employee Performance Analysis(IABAC)/Employee Performance Analysis(IABAC)/Summary Of Project..ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "460b1b6c", 6 | "metadata": {}, 7 | "source": [ 8 | "# Employee Performance Analysis\n", 9 | "## INX Future Inc.\n", 10 | "\n", 11 | "* Candidate Name : Sakshi Tanwar\n", 12 | "\n", 13 | "* Candidate E-Mail : tanwarsakshi1717157@gmail.com\n", 14 | "\n", 15 | "* Project Code : 10281\n", 16 | "\n", 17 | "* REP Name : DataMites™ Solutions Pvt Ltd\n", 18 | "\n", 19 | "* Assesment ID : E10901-PR2-V18\n", 20 | "\n", 21 | "* Module : Certified Data Scientist - Project\n", 22 | "\n", 23 | "* Exam Format : Open Project- IABAC™ Project Submission\n", 24 | "\n", 25 | "* Project Assessment : IABAC™ \n", 26 | "\n", 27 | "* Registered Trainer : Ashok Kumar A\n", 28 | "\n", 29 | "* Submission Deadline Date: 22-Dec-2022 " 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "id": "cadca078", 35 | "metadata": {}, 36 | "source": [ 37 | "# PROJECT SUMMARY:\n", 38 | "## BUISNESSCASE & GOAL OF PROJECT: BASED ON GIVEN FEATURE OF DATASET WE NEED TO PREDICT THE PERFOMANCE RATING OF EMPLOYEE\n", 39 | "### INX Future Inc Employee Performance - Project\n", 40 | "The Data science project which is given here is an analysis of employee performance.\n", 41 | "\n", 42 | "\n", 43 | "**The Goal and Insights of the project are as follows:**\n", 44 | "\n", 45 | "* Department wise performances\n", 46 | "* Top 3 Important Factors effecting employee performance\n", 47 | "* A trained model which can predict the employee performance based on factors as inputs.\n", 48 | " This will be used to hire employees\n", 49 | "* Recommendations to improve the employee performance based on insights from analysis\n", 50 | "\n", 51 | "\n", 52 | "The given Employee dataset consist of 1200 rows. The features present in the data are 28 columns. The shape of the dataset is 1200x28. The 28 features\n", 53 | "are classified into quantitative and qualitative where 19 features are quantitative (11 columns consists numeric data & 8 columns consists ordinal data) and\n", 54 | "8 features are qualitative. EmpNumber consist alphanumerical data (distinct values) which doesn't play a role as a relevant feature for performance rating.\n", 55 | "\n", 56 | "From Correlation we can get the important aspects of the data, Correlation between features and Performance Rating.Correlation is a statistical measure\n", 57 | "that expresses the extent to which two variables are linearly related.The analysis of the project has gone through the stage of Univariate,Bivariate & Multivariate analysis,\n", 58 | "correlation analysis and analysis by each department to satisfy the project goal." 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "3692d098", 64 | "metadata": {}, 65 | "source": [ 66 | "The dataset consists of Categorical data and Numerical data. The Target variable consist of ordinal data, so this is a classification problem.The multiple machine learning model used in this project is Support vector classifier, Random forest classifier & Artifical neural network[Multilayer percepton]. from above all models Artifical neural network[Multilayer percepton] predicts higher accuracy 95.80%." 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "be3b3429", 72 | "metadata": {}, 73 | "source": [ 74 | "One of the important goal of this project is to find the important feature affecting the performance rating. The important features were predicted using the machine learning model feature importance technique. The main technique used in the preprocessing data using the Mannual & Frequency encoding method to convert the string - categorical data into numerical data, because, Most of machine learning methods are based on numerical methods where strings are not supportive. The overall project was performed and achieved the goals by using the machine learning model and visualization techniques.\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "e8e1a884", 80 | "metadata": {}, 81 | "source": [ 82 | "# 1. Requirement\n", 83 | "The data was given from the IABAC for this project where the collected source is IABAC™. The data is based on INX Future Inc, (referred as INX ). It is one of the leading data analytics and automation solutions provider with over 15 years of global business presence. INX is consistently rated as top 20 best employers past 5 years. The data is not from the real organization. The whole project was done in Jupiter notebook with python platform." 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "id": "24cdb08e", 89 | "metadata": {}, 90 | "source": [ 91 | "# 2. Analysis\n", 92 | "Data were analyzed by describing the features present in the data. the features play the bigger part in the analysis. The features tell the relation between the dependent and independent variables. Pandas also help to describe the datasets answering following questions early in our project. The data present in the dataset are divided into numerical and categorical data." 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "id": "de4b4406", 98 | "metadata": {}, 99 | "source": [ 100 | "## Categorical Features\n", 101 | "* EmpNumber\n", 102 | "* Gender\n", 103 | "* EducationBackground\n", 104 | "* MaritalStatus\n", 105 | "* EmpDepartment\n", 106 | "* EmpJobRole\n", 107 | "* BusinessTravelFrequency\n", 108 | "* OverTime\n", 109 | "* Attrition\n", 110 | "\n", 111 | "## Numerical Features\n", 112 | "* Age\n", 113 | "* DistanceFromHome\n", 114 | "* EmpHourlyRate\n", 115 | "* NumCompaniesWorked\n", 116 | "* EmpLastSalaryHikePercent\n", 117 | "* TotalWorkExperienceInYears\n", 118 | "* TrainingTimesLastYear\n", 119 | "* ExperienceYearsAtThisCompany\n", 120 | "* ExperienceYearsInCurrentRole\n", 121 | "* YearsSinceLastPromotion\n", 122 | "* YearsWithCurrManager\n", 123 | "\n", 124 | "## Ordinal Features\n", 125 | "* EmpEducationLevel\n", 126 | "* EmpEnvironmentSatisfaction\n", 127 | "* EmpJobInvolvement\n", 128 | "* EmpJobLevel\n", 129 | "* EmpJobSatisfaction\n", 130 | "* EmpRelationshipSatisfaction\n", 131 | "* EmpWorkLifeBalance\n", 132 | "* PerformanceRating" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "id": "f6150269", 138 | "metadata": {}, 139 | "source": [ 140 | "## 3.Univariate, Bivariate & Multivariate Analysis" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "id": "2c1d850b", 146 | "metadata": {}, 147 | "source": [ 148 | "* Library Used: Matplotlib & Seaborn\n", 149 | "* Plots Used: Histplot, Lineplot, CountPlot, Barplot\n", 150 | "* Tip: All Observation or insights written below the plots\n", 151 | "\n", 152 | "**Univariate Analysis:** In univariate analysis we get the unique labels of categorical features, as well as get the range & density of numbers.\n", 153 | "\n", 154 | "\n", 155 | "**Bivariate Analysis:** In bivariate analysis we check the feature relationship with target veriable.\n", 156 | "\n", 157 | "**Multivariate Analysis:** In multivariate Analysis check the relationship between two veriable with respect to the target veriable.\n", 158 | "\n", 159 | "\n", 160 | "#### CONCLUSION\n", 161 | "* There are some features are positively correlated with performance rating( Target variable)\n", 162 | "[Emp Environment Satisfaction,Emp Last Salary Hike Percent,Emp Work Life Balance]" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "id": "969f5183", 168 | "metadata": {}, 169 | "source": [ 170 | "# 4.Explotary Data Analysis" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "id": "de383208", 176 | "metadata": {}, 177 | "source": [ 178 | "### Basic Check & Statistical Measures\n", 179 | "* Their is no constant column is present in Numerical as well as categoriacl data." 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "id": "a977e66c", 185 | "metadata": {}, 186 | "source": [ 187 | "## Distribution of Continuous Features:\n", 188 | "In general, one of the first few steps in exploring the data would be to have a rough idea of how the features are distributed with one another. To do so, we shall invoke the familiar distplot function from the Seaborn plotting library. The distribution has been done by both numerical features. it will show the overall idea about the density and majority of data present in a different level.\n", 189 | "\n", 190 | "* The age distribution is starting from 18 to 60 where the most of the employees are laying between 30 to 40 age count\n", 191 | "* Employees are worked in the multiple companies up to 8 companies where most of the employees worked up to 2 companies before getting to work here.\n", 192 | "* The hourly rate range is 65 to 95 for majority employees work in this company.\n", 193 | "* In General, Most of Employees work up to 5 years in this company. Most of the employees get 11% to 15% of salary hike in this company." 194 | ] 195 | }, 196 | { 197 | "cell_type": "markdown", 198 | "id": "b3122f2f", 199 | "metadata": {}, 200 | "source": [ 201 | "#### Check Skewness and Kurtosis of Numerical Features\n", 202 | "Checking weather the data is Normally distributed or Not with Skewness and Kurtosis,\n", 203 | "* YearsSinceLastPromotion, This column is skewed\n", 204 | "* skewness for YearsSinceLastPromotion: 1.9724620367914252\n", 205 | "* kurtosis for YearsSinceLastPromotion: 3.5193552691799805" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "id": "7156cc4e", 211 | "metadata": {}, 212 | "source": [ 213 | "#### Distribution of Mean of Data\n", 214 | "* Distribution of mean close to guassian distribution with mean value 9.5\n", 215 | "* we can say that around 80% feature mean lies between 8.5 to 10.5\n", 216 | "\n", 217 | "#### Distribution of Standard Deviation of Data\n", 218 | "Distribution of standard deviation of data also look like guassian distribution around 30% of feature standard deviation around the range of 3 3 to 20 and remaining 70% feature standard deviation in between 0 to 2" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "id": "2182aa94", 224 | "metadata": {}, 225 | "source": [ 226 | "# 5.Data Pre-Processing\n", 227 | "**1.Check Missing Value:** Their is no missing value in data\n", 228 | "\n", 229 | "**2.Categorical Data Conversion:** Handel categorical data with the help of frequency and mannual encoding, because feature is contain lot's of labels\n", 230 | "\n", 231 | "* Mannual Encoding: Mannual encoding is a best techinque to handel categorical feature with the help of map function, map the labels based on frequency.\n", 232 | "\n", 233 | "* Frequency Encoding: Frequency encoding is an encoding technique to transform an original categorical variable to a numerical variable by considering the frequency distribution of the data getting value counts.\n", 234 | "\n", 235 | "**3.Outlier Handling** Some features are contain outliers so we are impute this outlier with the help of IQR because in all features data is not normally distributed\n", 236 | "\n", 237 | "**4.Feature Transformation:** In YearsSinceLastPromotion some skewed & kurtosis is present, so we are use Square Root Transformation techinque\n", 238 | "\n", 239 | "* Square root transformation: Square root transformation is one of the many types of standard transformations.This transformation is used for count data (data that follow a Poisson distribution) or small whole numbers. Each data point is replaced by its square root. Negative data is converted to positive by adding a constant, and then transformed.\n", 240 | "* Q-Q Plot: Q–Q plot is a probability plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.\n", 241 | "\n", 242 | "**5.Scaling The Data:** scaling the data with the help of Standard scalar\n", 243 | "* Standard Scaling: Standardization is the process of scaling the feature, it assumes the feature follow normal distribution and scale the feature between mean and standard deviation, here mean is 0 and standard deviation is always 1.\n", 244 | "\n" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "id": "4b2be3a7", 250 | "metadata": {}, 251 | "source": [ 252 | "# 6.Future Selection\n", 253 | "**1.Drop unique and constant feature:** Dropping employee number because this is a constant column as well as drop Years Since Last Promotion because we create a new feaure using square root transformation\n", 254 | "\n", 255 | "**2.Checking Correlation:** Checking correlation with the help of heat map, and get the their is no highly correlated feature is present.\n", 256 | "* Heatmap: A heatmap is a graphical representation of data that uses a system of color-coding to represent different values.\n", 257 | "\n", 258 | "**3.Check Duplicates:** In this data is no dupicates is present.\n", 259 | "\n", 260 | "**4.PCA:** Use pca to reduce the dimension of data, Data is contain total 27 feature after dropping unique and constant column,from PCA it shows the 25 feature has less varaince loss, so we are going to select 25 feature.\n", 261 | "\n", 262 | "* Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset.\n", 263 | "\n", 264 | "**5.Saving Pre-Process Data:** save the all preprocess data in new file and add target feature to it." 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "id": "824dbf45", 270 | "metadata": {}, 271 | "source": [ 272 | "# 7.Machine learning Model Creation & Evaluation\n", 273 | "**1.Define Dependant and Independant Features:**\n", 274 | "\n", 275 | "**2.Balancing the data:** The data is imbalance, so we need to balance the data with the help of SMOTE\n", 276 | "* SMOTE: SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesises new minority instances between existing minority instances.\n", 277 | "\n", 278 | "**3.Splitting Training And Testing Data:** 80% data use for training & 20% data used for testing\n", 279 | "\n", 280 | "\n", 281 | "### Algorithm:\n", 282 | "\n", 283 | "**AIM:** Create a sweet spot model (Low bias, Low variance)\n", 284 | "\n", 285 | "**HERE WE WILL BE EXPERIMENTING WITH THREE ALGORITHM**\n", 286 | "1. Support Vector Machine\n", 287 | "2. Random Forest\n", 288 | "3. Artificial Neural Network [MLP Classifier]\n", 289 | "\n", 290 | "* Support vector machine well perform on training data with accuracy 96.61% but the test score is 94.66 after applying Hyperparameter tunning score is 98.28 means model is overfit.\n", 291 | "* Random forest very well perform in training data with 100% accuracy but in testing 95.61% after doing hyperparameter tunning testing score is decreases.\n", 292 | "* Artifical neural network[Multilayer percepton] perform very well on training data with 98.95% accuracy and testing score is 95.80%.\n", 293 | "* So we are select Artifical neuranl network [Multilayer percepton] model." 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "id": "8177ae30", 299 | "metadata": {}, 300 | "source": [ 301 | "# 6.Saving Model\n", 302 | "* Save model with the helpof pickle file" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "id": "2fef7306", 308 | "metadata": {}, 309 | "source": [ 310 | "# Tools and Library Used:\n", 311 | "\n", 312 | "### Tools: \n", 313 | "* Jupyter\n", 314 | "\n", 315 | "### Library Used: \n", 316 | "* Pandas\n", 317 | "* Numpy\n", 318 | "* Matplotlib\n", 319 | "* Seaborn\n", 320 | "* pylab\n", 321 | "* Scipy\n", 322 | "* Sklearn\n", 323 | "* Pickle" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "id": "868abf38", 329 | "metadata": {}, 330 | "source": [ 331 | "# Goal 1: Department Wise Performances\n", 332 | "**PLOT USED**\n", 333 | "* Violinplot: It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.\n", 334 | "* CountPlot: countplot is used to Show the counts of observations in each categorical bin using bars.\n", 335 | "\n", 336 | "\n", 337 | "\n", 338 | "**Sales:** The Performace rating level 3 is more in the sales department. The male performance rating the little bit higher compared to female.\n", 339 | "\n", 340 | "**Human Resources:** The majority of the employees lying under the level 3 performance . The older people are performing low in this department. The female employees in HR department doing really well in their performance.\n", 341 | "\n", 342 | "**Development:** The maximum number of employees are level 3 performers. Employees of all age are performing at the level of 3 only. The gender-based performance is nearly same for both.\n", 343 | "\n", 344 | "**Data Science:** The highest average of level 3 performance is in data science department. Data science is the only department where less number of level 2 performers. The overall performance is higher compared to all departments. Male employees are doing good in this department.\n", 345 | "\n", 346 | "**Research & Development:** The age factor is not deviating from the level of performance here where different employees with different age are there in every level of performance. The R&D has the good female employees in their performance.\n", 347 | "\n", 348 | "**Finance:** The finance department performance is exponentially decreasing when age increases. The male employees are doing good. The experience factor is inversely relating to the performance level." 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "id": "e7d7e250", 354 | "metadata": {}, 355 | "source": [ 356 | "# Goal 2: Top 3 Important Factors effecting employee performance\n", 357 | "The top three important features affecting the performance rating are ordered with their importance level as follows,\n", 358 | "1. Employment Environment Satisfaction\n", 359 | "2. Employee Salary Hike Percentage\n", 360 | "3. Experience Years In CurrentRole\n", 361 | " \n", 362 | "**Employee Enviroment satisfaction:** Maximum Number of Employees Performance Rating belongs to EmpEnvironmentSatisfaction Level 3 & Level 4, It contains 367 & 361.\n", 363 | "\n", 364 | "**Employee last salary hike percent:** More Number of Employees whose salary hike percentage belongs to 11-19 % are getting 2 & 3 performance rating Maximum time. as well asEmployees whose salary hike percentage is in between 20-22%, There performance rating is 4.\n", 365 | "\n", 366 | "**Employee work life balance:** In EmpWorkLifeBalance, level 3 is showing high Performance Rating of employees\n" 367 | ] 368 | }, 369 | { 370 | "cell_type": "markdown", 371 | "id": "81c1f16c", 372 | "metadata": {}, 373 | "source": [ 374 | "# Goal 3: A Trained model which can predict the employee performance\n", 375 | "The trained model is created using the machine learning algorithm as follows with the accuracy score,\n", 376 | "\n", 377 | "1. Support Vector Classifier: 96.76% accuracy\n", 378 | "2. Artifical Neural Network [Multilayer percepton]: 95.80%\n", 379 | "3.Random Forest classifier: 95.61% accuracy" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "id": "54b1f286", 385 | "metadata": {}, 386 | "source": [ 387 | "Support vector machine well perform on training data with accuracy 96.80% but the test score is 94.09% after applying Hyperparameter tunning score is 96.73% means model is overfit.\n", 388 | "\n", 389 | "Artifical neural network[Multilayer percepton] perform very well on training data with 99.28% accuracy and testing score is 95.61% .\n", 390 | "\n", 391 | "Random forest very well perform in training data with 100% accuracy but in testing 93.90% after doing hyperparameter tunning testing score is decreases.\n", 392 | "\n", 393 | "So we are select Artifical neural network [Multilayer percepton] model." 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "id": "57c36b01", 399 | "metadata": {}, 400 | "source": [ 401 | "# Goal 4: Recommendations to improve the employee performance\n", 402 | "1. The overall employee performance can be achieved by employee environment satisfaction. The company needs to focus more on the employee environment satisfaction.\n", 403 | "2. The salary hike will give the boost to the employees to perform well.\n", 404 | "3. Promote the employee ervery 6th month\n", 405 | "4. Improve Employee's work-life balance this affects the performance rating.\n", 406 | "5. While recruiting for HR, consider the female candidates where they perform well compared to male.\n", 407 | "6. The development and sales department is having an overall higher performance comparing to rest of the departments.\n", 408 | "While some of the employees who gives feedback like Low & Medium from Job Satisfaction & Relationship Satisfaction feature, such employees gives Excellent performance more in number. So company should focus on them." 409 | ] 410 | } 411 | ], 412 | "metadata": { 413 | "kernelspec": { 414 | "display_name": "Python 3 (ipykernel)", 415 | "language": "python", 416 | "name": "python3" 417 | }, 418 | "language_info": { 419 | "codemirror_mode": { 420 | "name": "ipython", 421 | "version": 3 422 | }, 423 | "file_extension": ".py", 424 | "mimetype": "text/x-python", 425 | "name": "python", 426 | "nbconvert_exporter": "python", 427 | "pygments_lexer": "ipython3", 428 | "version": "3.9.12" 429 | } 430 | }, 431 | "nbformat": 4, 432 | "nbformat_minor": 5 433 | } 434 | --------------------------------------------------------------------------------