├── effort_estimation_data.xlsx ├── README.md ├── File1 └── effort_estimation.py /effort_estimation_data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Otutu11/Effort-Estimation-Model/HEAD/effort_estimation_data.xlsx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Effort Estimation Model 2 | 3 | A machine learning model for predicting project effort based on various project characteristics, using synthetic data. 4 | 5 | ## Table of Contents 6 | - [Overview](#overview) 7 | - [Features](#features) 8 | - [Installation](#installation) 9 | - [Usage](#usage) 10 | - [Data Generation](#data-generation) 11 | - [Model Details](#model-details) 12 | - [Evaluation Metrics](#evaluation-metrics) 13 | - [Example](#example) 14 | - [License](#license) 15 | 16 | ## Overview 17 | 18 | This project implements a Random Forest regression model to estimate project effort (in person-days) based on synthetic project data. The model takes into account factors like project size, team experience, technical complexity, and development methodology. 19 | 20 | ## Features 21 | 22 | - Synthetic data generation with customizable parameters 23 | - Exploratory data analysis visualizations 24 | - Random Forest regression model 25 | - Feature importance analysis 26 | - Model evaluation metrics 27 | - Example prediction capability 28 | 29 | ## Installation 30 | 31 | 1. Clone this repository: 32 | ```bash 33 | git clone https://github.com/Akajiaku1/effort-estimation-model.git 34 | cd effort-estimation-model 35 | 36 | Create and activate a virtual environment (recommended): 37 | bash 38 | Copy 39 | 40 | python -m venv venv 41 | source venv/bin/activate # On Windows use `venv\Scripts\activate` 42 | 43 | Install the required packages: 44 | bash 45 | Copy 46 | 47 | pip install -r requirements.txt 48 | 49 | Or install them manually: 50 | bash 51 | Copy 52 | 53 | pip install numpy pandas scikit-learn matplotlib seaborn 54 | 55 | Usage 56 | 57 | Run the main script: 58 | bash 59 | Copy 60 | 61 | python effort_estimation.py 62 | 63 | This will: 64 | 65 | Generate synthetic project data 66 | 67 | Train the effort estimation model 68 | 69 | Evaluate the model performance 70 | 71 | Show feature importance 72 | 73 | Make an example prediction 74 | 75 | Data Generation 76 | 77 | The synthetic data includes these features: 78 | Feature Description Range/Values 79 | project_size Project size in function/story points Log-normal distribution 80 | team_experience Average team experience in years 1-10 years 81 | requirements_volatility Requirements stability 1-5 scale 82 | technical_complexity Technical difficulty 1-5 scale 83 | team_size Number of team members 2-10 people 84 | methodology Development methodology 1=Waterfall, 2=Agile, 3=Hybrid 85 | actual_effort Actual effort in person-days 20-200 days 86 | Model Details 87 | 88 | Algorithm: Random Forest Regressor 89 | 90 | Hyperparameters: 91 | 92 | n_estimators: 100 93 | 94 | random_state: 42 95 | 96 | Input Features: All features except actual_effort 97 | 98 | Target Variable: actual_effort 99 | 100 | Evaluation Metrics 101 | 102 | The model is evaluated using: 103 | 104 | Mean Absolute Error (MAE): Average absolute difference between predictions and actual values 105 | 106 | R-squared (R²): Proportion of variance in the dependent variable that's predictable 107 | 108 | Typical performance on synthetic data: 109 | 110 | MAE: ~8-12 person-days 111 | 112 | R²: ~0.85-0.95 113 | 114 | Author Name: Anslem Otutu 115 | Github: https://github.com/Otutu11 116 | 117 | 118 | -------------------------------------------------------------------------------- /File1: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from sklearn.model_selection import train_test_split 4 | from sklearn.ensemble import RandomForestRegressor 5 | from sklearn.metrics import mean_absolute_error, r2_score 6 | import matplotlib.pyplot as plt 7 | import seaborn as sns 8 | 9 | # Set random seed for reproducibility 10 | np.random.seed(42) 11 | 12 | # Generate synthetic data 13 | def generate_synthetic_data(n_samples=1000): 14 | """ 15 | Generate synthetic project data for effort estimation. 16 | Returns a DataFrame with project features and actual effort. 17 | """ 18 | # Project size (in function points or story points) 19 | size = np.random.lognormal(mean=3, sigma=0.5, size=n_samples).round(1) 20 | 21 | # Team experience (years) 22 | team_exp = np.random.normal(loc=5, scale=2, size=n_samples).round(1) 23 | team_exp = np.clip(team_exp, 1, 10) # clip between 1-10 years 24 | 25 | # Requirements volatility (scale 1-5) 26 | req_volatility = np.random.randint(1, 6, size=n_samples) 27 | 28 | # Technical complexity (scale 1-5) 29 | tech_complexity = np.random.randint(1, 6, size=n_samples) 30 | 31 | # Number of team members 32 | team_size = np.random.randint(2, 11, size=n_samples) 33 | 34 | # Use of methodology (1=waterfall, 2=agile, 3=hybrid) 35 | methodology = np.random.choice([1, 2, 3], size=n_samples, p=[0.3, 0.5, 0.2]) 36 | 37 | # Generate actual effort based on features with some noise 38 | base_effort = (size * 0.5 + 39 | team_size * 2 + 40 | tech_complexity * 5 - 41 | team_exp * 0.8 + 42 | req_volatility * 3) 43 | 44 | # Add methodology factor 45 | methodology_factor = np.where(methodology == 1, 1.1, 46 | np.where(methodology == 2, 0.9, 1.0)) 47 | base_effort = base_effort * methodology_factor 48 | 49 | # Add some noise 50 | noise = np.random.normal(0, 10, size=n_samples) 51 | actual_effort = np.clip(base_effort + noise, 20, 200).round(1) 52 | 53 | # Create DataFrame 54 | data = pd.DataFrame({ 55 | 'project_size': size, 56 | 'team_experience': team_exp, 57 | 'requirements_volatility': req_volatility, 58 | 'technical_complexity': tech_complexity, 59 | 'team_size': team_size, 60 | 'methodology': methodology, 61 | 'actual_effort': actual_effort 62 | }) 63 | 64 | return data 65 | 66 | # Generate the synthetic dataset 67 | project_data = generate_synthetic_data(1500) 68 | 69 | # Explore the data 70 | print(project_data.head()) 71 | print("\nData Description:") 72 | print(project_data.describe()) 73 | 74 | # Visualize relationships 75 | plt.figure(figsize=(12, 8)) 76 | sns.pairplot(project_data, 77 | vars=['project_size', 'team_experience', 'technical_complexity', 'actual_effort'], 78 | hue='methodology', 79 | plot_kws={'alpha': 0.6}) 80 | plt.suptitle("Feature Relationships", y=1.02) 81 | plt.show() 82 | 83 | # Preprocess data 84 | # Convert methodology to dummy variables 85 | data_processed = pd.get_dummies(project_data, columns=['methodology'], drop_first=True) 86 | 87 | # Split data into features and target 88 | X = data_processed.drop('actual_effort', axis=1) 89 | y = data_processed['actual_effort'] 90 | 91 | # Split into train and test sets 92 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 93 | 94 | # Initialize and train the model 95 | model = RandomForestRegressor(n_estimators=100, random_state=42) 96 | model.fit(X_train, y_train) 97 | 98 | # Make predictions 99 | y_pred = model.predict(X_test) 100 | 101 | # Evaluate model 102 | mae = mean_absolute_error(y_test, y_pred) 103 | r2 = r2_score(y_test, y_pred) 104 | 105 | print(f"\nModel Performance:") 106 | print(f"Mean Absolute Error: {mae:.2f} person-days") 107 | print(f"R-squared: {r2:.2f}") 108 | 109 | # Feature importance 110 | feature_importance = pd.DataFrame({ 111 | 'feature': X.columns, 112 | 'importance': model.feature_importances_ 113 | }).sort_values('importance', ascending=False) 114 | 115 | print("\nFeature Importance:") 116 | print(feature_importance) 117 | 118 | # Plot feature importance 119 | plt.figure(figsize=(10, 6)) 120 | sns.barplot(x='importance', y='feature', data=feature_importance) 121 | plt.title('Feature Importance for Effort Estimation') 122 | plt.xlabel('Importance') 123 | plt.ylabel('Feature') 124 | plt.show() 125 | 126 | # Example prediction 127 | sample_project = pd.DataFrame({ 128 | 'project_size': [75], 129 | 'team_experience': [4.5], 130 | 'requirements_volatility': [3], 131 | 'technical_complexity': [4], 132 | 'team_size': [5], 133 | 'methodology_2': [1], # agile 134 | 'methodology_3': [0] # not hybrid 135 | }) 136 | 137 | predicted_effort = model.predict(sample_project) 138 | print(f"\nExample Project Effort Prediction: {predicted_effort[0]:.1f} person-days") 139 | 140 | # Save model (uncomment to use) 141 | # import joblib 142 | # joblib.dump(model, 'effort_estimation_model.pkl') 143 | -------------------------------------------------------------------------------- /effort_estimation.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from sklearn.model_selection import train_test_split 4 | from sklearn.ensemble import RandomForestRegressor 5 | from sklearn.metrics import mean_absolute_error, r2_score 6 | import matplotlib.pyplot as plt 7 | import seaborn as sns 8 | 9 | # Set random seed for reproducibility 10 | np.random.seed(42) 11 | 12 | # Generate synthetic data 13 | def generate_synthetic_data(n_samples=1000): 14 | """ 15 | Generate synthetic project data for effort estimation. 16 | Returns a DataFrame with project features and actual effort. 17 | """ 18 | # Project size (in function points or story points) 19 | size = np.random.lognormal(mean=3, sigma=0.5, size=n_samples).round(1) 20 | 21 | # Team experience (years) 22 | team_exp = np.random.normal(loc=5, scale=2, size=n_samples).round(1) 23 | team_exp = np.clip(team_exp, 1, 10) # clip between 1-10 years 24 | 25 | # Requirements volatility (scale 1-5) 26 | req_volatility = np.random.randint(1, 6, size=n_samples) 27 | 28 | # Technical complexity (scale 1-5) 29 | tech_complexity = np.random.randint(1, 6, size=n_samples) 30 | 31 | # Number of team members 32 | team_size = np.random.randint(2, 11, size=n_samples) 33 | 34 | # Use of methodology (1=waterfall, 2=agile, 3=hybrid) 35 | methodology = np.random.choice([1, 2, 3], size=n_samples, p=[0.3, 0.5, 0.2]) 36 | 37 | # Generate actual effort based on features with some noise 38 | base_effort = (size * 0.5 + 39 | team_size * 2 + 40 | tech_complexity * 5 - 41 | team_exp * 0.8 + 42 | req_volatility * 3) 43 | 44 | # Add methodology factor 45 | methodology_factor = np.where(methodology == 1, 1.1, 46 | np.where(methodology == 2, 0.9, 1.0)) 47 | base_effort = base_effort * methodology_factor 48 | 49 | # Add some noise 50 | noise = np.random.normal(0, 10, size=n_samples) 51 | actual_effort = np.clip(base_effort + noise, 20, 200).round(1) 52 | 53 | # Create DataFrame 54 | data = pd.DataFrame({ 55 | 'project_size': size, 56 | 'team_experience': team_exp, 57 | 'requirements_volatility': req_volatility, 58 | 'technical_complexity': tech_complexity, 59 | 'team_size': team_size, 60 | 'methodology': methodology, 61 | 'actual_effort': actual_effort 62 | }) 63 | 64 | return data 65 | 66 | # Generate the synthetic dataset 67 | project_data = generate_synthetic_data(1500) 68 | 69 | # Explore the data 70 | print(project_data.head()) 71 | print("\nData Description:") 72 | print(project_data.describe()) 73 | 74 | # Visualize relationships 75 | plt.figure(figsize=(12, 8)) 76 | sns.pairplot(project_data, 77 | vars=['project_size', 'team_experience', 'technical_complexity', 'actual_effort'], 78 | hue='methodology', 79 | plot_kws={'alpha': 0.6}) 80 | plt.suptitle("Feature Relationships", y=1.02) 81 | plt.show() 82 | 83 | # Preprocess data 84 | # Convert methodology to dummy variables 85 | data_processed = pd.get_dummies(project_data, columns=['methodology'], drop_first=True) 86 | 87 | # Split data into features and target 88 | X = data_processed.drop('actual_effort', axis=1) 89 | y = data_processed['actual_effort'] 90 | 91 | # Split into train and test sets 92 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 93 | 94 | # Initialize and train the model 95 | model = RandomForestRegressor(n_estimators=100, random_state=42) 96 | model.fit(X_train, y_train) 97 | 98 | # Make predictions 99 | y_pred = model.predict(X_test) 100 | 101 | # Evaluate model 102 | mae = mean_absolute_error(y_test, y_pred) 103 | r2 = r2_score(y_test, y_pred) 104 | 105 | print(f"\nModel Performance:") 106 | print(f"Mean Absolute Error: {mae:.2f} person-days") 107 | print(f"R-squared: {r2:.2f}") 108 | 109 | # Feature importance 110 | feature_importance = pd.DataFrame({ 111 | 'feature': X.columns, 112 | 'importance': model.feature_importances_ 113 | }).sort_values('importance', ascending=False) 114 | 115 | print("\nFeature Importance:") 116 | print(feature_importance) 117 | 118 | # Plot feature importance 119 | plt.figure(figsize=(10, 6)) 120 | sns.barplot(x='importance', y='feature', data=feature_importance) 121 | plt.title('Feature Importance for Effort Estimation') 122 | plt.xlabel('Importance') 123 | plt.ylabel('Feature') 124 | plt.show() 125 | 126 | # Example prediction 127 | sample_project = pd.DataFrame({ 128 | 'project_size': [75], 129 | 'team_experience': [4.5], 130 | 'requirements_volatility': [3], 131 | 'technical_complexity': [4], 132 | 'team_size': [5], 133 | 'methodology_2': [1], # agile 134 | 'methodology_3': [0] # not hybrid 135 | }) 136 | 137 | predicted_effort = model.predict(sample_project) 138 | print(f"\nExample Project Effort Prediction: {predicted_effort[0]:.1f} person-days") 139 | 140 | 141 | # Save model (uncomment to use) 142 | # import joblib 143 | # joblib.dump(model, 'effort_estimation_model.pkl') 144 | --------------------------------------------------------------------------------