├── effort_estimation_data.xlsx
├── README.md
├── File1
└── effort_estimation.py


/effort_estimation_data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Otutu11/Effort-Estimation-Model/HEAD/effort_estimation_data.xlsx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Effort Estimation Model
  2 | 
  3 | A machine learning model for predicting project effort based on various project characteristics, using synthetic data.
  4 | 
  5 | ## Table of Contents
  6 | - [Overview](#overview)
  7 | - [Features](#features)
  8 | - [Installation](#installation)
  9 | - [Usage](#usage)
 10 | - [Data Generation](#data-generation)
 11 | - [Model Details](#model-details)
 12 | - [Evaluation Metrics](#evaluation-metrics)
 13 | - [Example](#example)
 14 | - [License](#license)
 15 | 
 16 | ## Overview
 17 | 
 18 | This project implements a Random Forest regression model to estimate project effort (in person-days) based on synthetic project data. The model takes into account factors like project size, team experience, technical complexity, and development methodology.
 19 | 
 20 | ## Features
 21 | 
 22 | - Synthetic data generation with customizable parameters
 23 | - Exploratory data analysis visualizations
 24 | - Random Forest regression model
 25 | - Feature importance analysis
 26 | - Model evaluation metrics
 27 | - Example prediction capability
 28 | 
 29 | ## Installation
 30 | 
 31 | 1. Clone this repository:
 32 |    ```bash
 33 |    git clone https://github.com/Akajiaku1/effort-estimation-model.git
 34 |    cd effort-estimation-model
 35 | 
 36 |     Create and activate a virtual environment (recommended):
 37 |     bash
 38 |     Copy
 39 | 
 40 |     python -m venv venv
 41 |     source venv/bin/activate  # On Windows use `venv\Scripts\activate`
 42 | 
 43 |     Install the required packages:
 44 |     bash
 45 |     Copy
 46 | 
 47 |     pip install -r requirements.txt
 48 | 
 49 |     Or install them manually:
 50 |     bash
 51 |     Copy
 52 | 
 53 |     pip install numpy pandas scikit-learn matplotlib seaborn
 54 | 
 55 | Usage
 56 | 
 57 | Run the main script:
 58 | bash
 59 | Copy
 60 | 
 61 | python effort_estimation.py
 62 | 
 63 | This will:
 64 | 
 65 |     Generate synthetic project data
 66 | 
 67 |     Train the effort estimation model
 68 | 
 69 |     Evaluate the model performance
 70 | 
 71 |     Show feature importance
 72 | 
 73 |     Make an example prediction
 74 | 
 75 | Data Generation
 76 | 
 77 | The synthetic data includes these features:
 78 | Feature	Description	Range/Values
 79 | project_size	Project size in function/story points	Log-normal distribution
 80 | team_experience	Average team experience in years	1-10 years
 81 | requirements_volatility	Requirements stability	1-5 scale
 82 | technical_complexity	Technical difficulty	1-5 scale
 83 | team_size	Number of team members	2-10 people
 84 | methodology	Development methodology	1=Waterfall, 2=Agile, 3=Hybrid
 85 | actual_effort	Actual effort in person-days	20-200 days
 86 | Model Details
 87 | 
 88 |     Algorithm: Random Forest Regressor
 89 | 
 90 |     Hyperparameters:
 91 | 
 92 |         n_estimators: 100
 93 | 
 94 |         random_state: 42
 95 | 
 96 |     Input Features: All features except actual_effort
 97 | 
 98 |     Target Variable: actual_effort
 99 | 
100 | Evaluation Metrics
101 | 
102 | The model is evaluated using:
103 | 
104 |     Mean Absolute Error (MAE): Average absolute difference between predictions and actual values
105 | 
106 |     R-squared (R²): Proportion of variance in the dependent variable that's predictable
107 | 
108 | Typical performance on synthetic data:
109 | 
110 |     MAE: ~8-12 person-days
111 | 
112 |     R²: ~0.85-0.95
113 | 
114 | Author Name: Anslem Otutu
115 | Github: https://github.com/Otutu11
116 | 
117 |     
118 | 


--------------------------------------------------------------------------------
/File1:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from sklearn.model_selection import train_test_split
  4 | from sklearn.ensemble import RandomForestRegressor
  5 | from sklearn.metrics import mean_absolute_error, r2_score
  6 | import matplotlib.pyplot as plt
  7 | import seaborn as sns
  8 | 
  9 | # Set random seed for reproducibility
 10 | np.random.seed(42)
 11 | 
 12 | # Generate synthetic data
 13 | def generate_synthetic_data(n_samples=1000):
 14 |     """
 15 |     Generate synthetic project data for effort estimation.
 16 |     Returns a DataFrame with project features and actual effort.
 17 |     """
 18 |     # Project size (in function points or story points)
 19 |     size = np.random.lognormal(mean=3, sigma=0.5, size=n_samples).round(1)
 20 |     
 21 |     # Team experience (years)
 22 |     team_exp = np.random.normal(loc=5, scale=2, size=n_samples).round(1)
 23 |     team_exp = np.clip(team_exp, 1, 10)  # clip between 1-10 years
 24 |     
 25 |     # Requirements volatility (scale 1-5)
 26 |     req_volatility = np.random.randint(1, 6, size=n_samples)
 27 |     
 28 |     # Technical complexity (scale 1-5)
 29 |     tech_complexity = np.random.randint(1, 6, size=n_samples)
 30 |     
 31 |     # Number of team members
 32 |     team_size = np.random.randint(2, 11, size=n_samples)
 33 |     
 34 |     # Use of methodology (1=waterfall, 2=agile, 3=hybrid)
 35 |     methodology = np.random.choice([1, 2, 3], size=n_samples, p=[0.3, 0.5, 0.2])
 36 |     
 37 |     # Generate actual effort based on features with some noise
 38 |     base_effort = (size * 0.5 + 
 39 |                    team_size * 2 + 
 40 |                    tech_complexity * 5 - 
 41 |                    team_exp * 0.8 + 
 42 |                    req_volatility * 3)
 43 |     
 44 |     # Add methodology factor
 45 |     methodology_factor = np.where(methodology == 1, 1.1, 
 46 |                                  np.where(methodology == 2, 0.9, 1.0))
 47 |     base_effort = base_effort * methodology_factor
 48 |     
 49 |     # Add some noise
 50 |     noise = np.random.normal(0, 10, size=n_samples)
 51 |     actual_effort = np.clip(base_effort + noise, 20, 200).round(1)
 52 |     
 53 |     # Create DataFrame
 54 |     data = pd.DataFrame({
 55 |         'project_size': size,
 56 |         'team_experience': team_exp,
 57 |         'requirements_volatility': req_volatility,
 58 |         'technical_complexity': tech_complexity,
 59 |         'team_size': team_size,
 60 |         'methodology': methodology,
 61 |         'actual_effort': actual_effort
 62 |     })
 63 |     
 64 |     return data
 65 | 
 66 | # Generate the synthetic dataset
 67 | project_data = generate_synthetic_data(1500)
 68 | 
 69 | # Explore the data
 70 | print(project_data.head())
 71 | print("\nData Description:")
 72 | print(project_data.describe())
 73 | 
 74 | # Visualize relationships
 75 | plt.figure(figsize=(12, 8))
 76 | sns.pairplot(project_data, 
 77 |              vars=['project_size', 'team_experience', 'technical_complexity', 'actual_effort'],
 78 |              hue='methodology',
 79 |              plot_kws={'alpha': 0.6})
 80 | plt.suptitle("Feature Relationships", y=1.02)
 81 | plt.show()
 82 | 
 83 | # Preprocess data
 84 | # Convert methodology to dummy variables
 85 | data_processed = pd.get_dummies(project_data, columns=['methodology'], drop_first=True)
 86 | 
 87 | # Split data into features and target
 88 | X = data_processed.drop('actual_effort', axis=1)
 89 | y = data_processed['actual_effort']
 90 | 
 91 | # Split into train and test sets
 92 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 93 | 
 94 | # Initialize and train the model
 95 | model = RandomForestRegressor(n_estimators=100, random_state=42)
 96 | model.fit(X_train, y_train)
 97 | 
 98 | # Make predictions
 99 | y_pred = model.predict(X_test)
100 | 
101 | # Evaluate model
102 | mae = mean_absolute_error(y_test, y_pred)
103 | r2 = r2_score(y_test, y_pred)
104 | 
105 | print(f"\nModel Performance:")
106 | print(f"Mean Absolute Error: {mae:.2f} person-days")
107 | print(f"R-squared: {r2:.2f}")
108 | 
109 | # Feature importance
110 | feature_importance = pd.DataFrame({
111 |     'feature': X.columns,
112 |     'importance': model.feature_importances_
113 | }).sort_values('importance', ascending=False)
114 | 
115 | print("\nFeature Importance:")
116 | print(feature_importance)
117 | 
118 | # Plot feature importance
119 | plt.figure(figsize=(10, 6))
120 | sns.barplot(x='importance', y='feature', data=feature_importance)
121 | plt.title('Feature Importance for Effort Estimation')
122 | plt.xlabel('Importance')
123 | plt.ylabel('Feature')
124 | plt.show()
125 | 
126 | # Example prediction
127 | sample_project = pd.DataFrame({
128 |     'project_size': [75],
129 |     'team_experience': [4.5],
130 |     'requirements_volatility': [3],
131 |     'technical_complexity': [4],
132 |     'team_size': [5],
133 |     'methodology_2': [1],  # agile
134 |     'methodology_3': [0]   # not hybrid
135 | })
136 | 
137 | predicted_effort = model.predict(sample_project)
138 | print(f"\nExample Project Effort Prediction: {predicted_effort[0]:.1f} person-days")
139 | 
140 | # Save model (uncomment to use)
141 | # import joblib
142 | # joblib.dump(model, 'effort_estimation_model.pkl')
143 | 


--------------------------------------------------------------------------------
/effort_estimation.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from sklearn.model_selection import train_test_split
  4 | from sklearn.ensemble import RandomForestRegressor
  5 | from sklearn.metrics import mean_absolute_error, r2_score
  6 | import matplotlib.pyplot as plt
  7 | import seaborn as sns
  8 | 
  9 | # Set random seed for reproducibility
 10 | np.random.seed(42)
 11 | 
 12 | # Generate synthetic data
 13 | def generate_synthetic_data(n_samples=1000):
 14 |     """
 15 |     Generate synthetic project data for effort estimation.
 16 |     Returns a DataFrame with project features and actual effort.
 17 |     """
 18 |     # Project size (in function points or story points)
 19 |     size = np.random.lognormal(mean=3, sigma=0.5, size=n_samples).round(1)
 20 |     
 21 |     # Team experience (years)
 22 |     team_exp = np.random.normal(loc=5, scale=2, size=n_samples).round(1)
 23 |     team_exp = np.clip(team_exp, 1, 10)  # clip between 1-10 years
 24 |     
 25 |     # Requirements volatility (scale 1-5)
 26 |     req_volatility = np.random.randint(1, 6, size=n_samples)
 27 |     
 28 |     # Technical complexity (scale 1-5)
 29 |     tech_complexity = np.random.randint(1, 6, size=n_samples)
 30 |     
 31 |     # Number of team members
 32 |     team_size = np.random.randint(2, 11, size=n_samples)
 33 |     
 34 |     # Use of methodology (1=waterfall, 2=agile, 3=hybrid)
 35 |     methodology = np.random.choice([1, 2, 3], size=n_samples, p=[0.3, 0.5, 0.2])
 36 |     
 37 |     # Generate actual effort based on features with some noise
 38 |     base_effort = (size * 0.5 + 
 39 |                    team_size * 2 + 
 40 |                    tech_complexity * 5 - 
 41 |                    team_exp * 0.8 + 
 42 |                    req_volatility * 3)
 43 |     
 44 |     # Add methodology factor
 45 |     methodology_factor = np.where(methodology == 1, 1.1, 
 46 |                                  np.where(methodology == 2, 0.9, 1.0))
 47 |     base_effort = base_effort * methodology_factor
 48 |     
 49 |     # Add some noise
 50 |     noise = np.random.normal(0, 10, size=n_samples)
 51 |     actual_effort = np.clip(base_effort + noise, 20, 200).round(1)
 52 |     
 53 |     # Create DataFrame
 54 |     data = pd.DataFrame({
 55 |         'project_size': size,
 56 |         'team_experience': team_exp,
 57 |         'requirements_volatility': req_volatility,
 58 |         'technical_complexity': tech_complexity,
 59 |         'team_size': team_size,
 60 |         'methodology': methodology,
 61 |         'actual_effort': actual_effort
 62 |     })
 63 |     
 64 |     return data
 65 | 
 66 | # Generate the synthetic dataset
 67 | project_data = generate_synthetic_data(1500)
 68 | 
 69 | # Explore the data
 70 | print(project_data.head())
 71 | print("\nData Description:")
 72 | print(project_data.describe())
 73 | 
 74 | # Visualize relationships
 75 | plt.figure(figsize=(12, 8))
 76 | sns.pairplot(project_data, 
 77 |              vars=['project_size', 'team_experience', 'technical_complexity', 'actual_effort'],
 78 |              hue='methodology',
 79 |              plot_kws={'alpha': 0.6})
 80 | plt.suptitle("Feature Relationships", y=1.02)
 81 | plt.show()
 82 | 
 83 | # Preprocess data
 84 | # Convert methodology to dummy variables
 85 | data_processed = pd.get_dummies(project_data, columns=['methodology'], drop_first=True)
 86 | 
 87 | # Split data into features and target
 88 | X = data_processed.drop('actual_effort', axis=1)
 89 | y = data_processed['actual_effort']
 90 | 
 91 | # Split into train and test sets
 92 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 93 | 
 94 | # Initialize and train the model
 95 | model = RandomForestRegressor(n_estimators=100, random_state=42)
 96 | model.fit(X_train, y_train)
 97 | 
 98 | # Make predictions
 99 | y_pred = model.predict(X_test)
100 | 
101 | # Evaluate model
102 | mae = mean_absolute_error(y_test, y_pred)
103 | r2 = r2_score(y_test, y_pred)
104 | 
105 | print(f"\nModel Performance:")
106 | print(f"Mean Absolute Error: {mae:.2f} person-days")
107 | print(f"R-squared: {r2:.2f}")
108 | 
109 | # Feature importance
110 | feature_importance = pd.DataFrame({
111 |     'feature': X.columns,
112 |     'importance': model.feature_importances_
113 | }).sort_values('importance', ascending=False)
114 | 
115 | print("\nFeature Importance:")
116 | print(feature_importance)
117 | 
118 | # Plot feature importance
119 | plt.figure(figsize=(10, 6))
120 | sns.barplot(x='importance', y='feature', data=feature_importance)
121 | plt.title('Feature Importance for Effort Estimation')
122 | plt.xlabel('Importance')
123 | plt.ylabel('Feature')
124 | plt.show()
125 | 
126 | # Example prediction
127 | sample_project = pd.DataFrame({
128 |     'project_size': [75],
129 |     'team_experience': [4.5],
130 |     'requirements_volatility': [3],
131 |     'technical_complexity': [4],
132 |     'team_size': [5],
133 |     'methodology_2': [1],  # agile
134 |     'methodology_3': [0]   # not hybrid
135 | })
136 | 
137 | predicted_effort = model.predict(sample_project)
138 | print(f"\nExample Project Effort Prediction: {predicted_effort[0]:.1f} person-days")
139 | 
140 | 
141 | # Save model (uncomment to use)
142 | # import joblib
143 | # joblib.dump(model, 'effort_estimation_model.pkl')
144 | 


--------------------------------------------------------------------------------