├── LICENSE
├── Machine Learning Foundations
    ├── 02 Unsupervised Learning
    │   ├── 02 Dimensionality Reduction
    │   │   ├── 02 t-Distributed Stochastic Neighbor Embedding (t-SNE)
    │   │   │   └── tSNE.py
    │   │   ├── 03 Linear Discriminant Analysis (LDA)
    │   │   │   └── LDA.py
    │   │   └── 01 Principal Component Analysis (PCA)
    │   │   │   └── PCA.py
    │   ├── 01 Clustering
    │   │   ├── 04 Mean Shift
    │   │   │   └── MeanShift.py
    │   │   ├── 01 K-Means Clustering
    │   │   │   └── KMeansClustering.py
    │   │   ├── 03 DBSCAN
    │   │   │   └── DBSCAN.py
    │   │   └── 02 Hierarchical Clustering
    │   │   │   └── HierarchicalClustering.py
    │   └── 03 Association Rules
    │   │   ├── 02 FP-Growth
    │   │       └── FPGrowth.py
    │   │   └── 01 Apriori Algorithm
    │   │       └── AprioriAlgorithm.py
    ├── 03 ML Pipelines
    │   ├── 03 Model Selection
    │   │   ├── 02 K-Fold Cross-Validation
    │   │   │   └── KFoldCrossValidation.py
    │   │   ├── 04 Grid Search
    │   │   │   └── GridSearch.py
    │   │   ├── 01 Train-Test Split
    │   │   │   └── TrainTestSplit.py
    │   │   ├── 05 Random Search
    │   │   │   └── RandomSearch.py
    │   │   └── 03 Stratified K-Fold
    │   │   │   └── StratifiedKFold.py
    │   ├── 05 Deployment
    │   │   ├── 01 Model Serialization
    │   │   │   └── ModelSerialization.py
    │   │   └── 02 API Integration
    │   │   │   └── APIIntegration.py
    │   ├── 01 Data Preprocessing
    │   │   ├── 05 Outlier Detection
    │   │   │   └── OutlierDetection.py
    │   │   ├── 03 Standardization
    │   │   │   └── Standardization.py
    │   │   ├── 01 Feature Scaling
    │   │   │   └── FeatureScaling.py
    │   │   ├── 06 Encoding Categorical Variables
    │   │   │   └── EncodingCategoricalVariables.py
    │   │   ├── 02 Normalization
    │   │   │   └── Normalization.py
    │   │   └── 04 Handling Missing Values
    │   │   │   └── HandlingMissingValues.py
    │   ├── 04 Model Evaluation
    │   │   ├── 01 Bias-Variance Tradeoff
    │   │   │   └── BiasVarianceTradeoff.py
    │   │   ├── 02 Overfitting
    │   │   │   └── Overfitting.py
    │   │   └── 03 Underfitting
    │   │   │   └── Underfitting.py
    │   └── 02 Feature Engineering
    │   │   ├── 01 Feature Selection
    │   │       └── FeatureSelection.py
    │   │   ├── 02 Polynomial Features
    │   │       └── PolynomialFeatures.py
    │   │   ├── 04 Binning
    │   │       └── Binning.py
    │   │   └── 03 Interaction Terms
    │   │       └── InteractionTerms.py
    ├── 01 Supervised Learning
    │   ├── 03 Evaluation Metrics
    │   │   ├── 01 Regression Metrics
    │   │   │   └── RegressionMetrics.py
    │   │   └── 02 Classification Metrics
    │   │   │   └── ClassificationMetrics.py
    │   ├── 01 Regression
    │   │   ├── 04 Lasso Regression
    │   │   │   └── LassoRegression.py
    │   │   ├── 03 Ridge Regression
    │   │   │   └── RidgeRegression.py
    │   │   ├── 02 Polynomial Regression
    │   │   │   └── PolynomialRegression.py
    │   │   └── 01 Linear Regression
    │   │   │   └── LinearRegression.py
    │   └── 02 Classification
    │   │   ├── 04 Naive Bayes
    │   │       └── NaiveBayes.py
    │   │   ├── 05 K-Nearest Neighbors (KNN)
    │   │       └── KNN.py
    │   │   ├── 02 Decision Trees
    │   │       └── DecisionTrees.py
    │   │   ├── 01 Logistic Regression
    │   │       └── LogisticRegression.py
    │   │   ├── 06 Support Vector Machines (SVM)
    │   │       └── SVM.py
    │   │   └── 03 Random Forest
    │   │       └── RandomForest.py
    └── 04 Ensemble Methods
    │   ├── 01 Bagging
    │       ├── 02 Random Forest
    │       │   └── RandomForest.py
    │       └── 01 Bootstrap Aggregating
    │       │   └── BootstrapAggregating.py
    │   └── 02 Boosting
    │       ├── 01 AdaBoost
    │           └── AdaBoost.py
    │       └── 02 Gradient Boosting
    │           └── GradientBoosting.py
├── README.md
└── Machine Learning Interview Questions with Ans
    └── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License  
 2 | 
 3 | Copyright (c) 2025 rohanmistry231
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy  
 6 | of this software and associated documentation files (the "Software"), to deal  
 7 | in the Software without restriction, including without limitation the rights  
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell  
 9 | copies of the Software, and to permit persons to whom the Software is  
10 | furnished to do so, subject to the following conditions:  
11 | 
12 | The above copyright notice and this permission notice shall be included in all  
13 | copies or substantial portions of the Software.  
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR  
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,  
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE  
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER  
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,  
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE  
21 | SOFTWARE.


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/02 Dimensionality Reduction/02 t-Distributed Stochastic Neighbor Embedding (t-SNE)/tSNE.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.manifold import TSNE
 6 | from sklearn.preprocessing import StandardScaler
 7 | import seaborn as sns
 8 | 
 9 | # t-Distributed Stochastic Neighbor Embedding (t-SNE)
10 | # This script demonstrates t-SNE for dimensionality reduction on the Iris dataset.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Standardize the features.
15 | # 3. Apply t-SNE to reduce to 2 dimensions.
16 | # 4. Visualize the reduced data.
17 | 
18 | # Step 1: Load data
19 | iris = load_iris()
20 | X = iris.data
21 | y = iris.target
22 | data = pd.DataFrame(X, columns=iris.feature_names)
23 | data['Target'] = y
24 | 
25 | # Step 2: Standardize features
26 | scaler = StandardScaler()
27 | X_scaled = scaler.fit_transform(X)
28 | 
29 | # Step 3: Apply t-SNE
30 | tsne = TSNE(n_components=2, random_state=42)
31 | X_tsne = tsne.fit_transform(X_scaled)
32 | 
33 | # Step 4: Visualize reduced data
34 | plt.figure(figsize=(10, 6))
35 | sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=iris.target_names[y], palette='viridis', s=100)
36 | plt.xlabel('t-SNE Component 1')
37 | plt.ylabel('t-SNE Component 2')
38 | plt.title('t-SNE: Iris Dataset Reduced to 2D')
39 | plt.grid(True)
40 | plt.savefig('tsne.png')
41 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/02 Dimensionality Reduction/03 Linear Discriminant Analysis (LDA)/LDA.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 6 | from sklearn.preprocessing import StandardScaler
 7 | import seaborn as sns
 8 | 
 9 | # Linear Discriminant Analysis (LDA)
10 | # This script demonstrates LDA for supervised dimensionality reduction on the Iris dataset.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Standardize the features.
15 | # 3. Apply LDA to reduce to 2 dimensions (since we have 3 classes).
16 | # 4. Visualize the reduced data.
17 | 
18 | # Step 1: Load data
19 | iris = load_iris()
20 | X = iris.data
21 | y = iris.target
22 | data = pd.DataFrame(X, columns=iris.feature_names)
23 | data['Target'] = y
24 | 
25 | # Step 2: Standardize features
26 | scaler = StandardScaler()
27 | X_scaled = scaler.fit_transform(X)
28 | 
29 | # Step 3: Apply LDA
30 | lda = LinearDiscriminantAnalysis(n_components=2)
31 | X_lda = lda.fit_transform(X_scaled, y)
32 | 
33 | # Step 4: Visualize reduced data
34 | plt.figure(figsize=(10, 6))
35 | sns.scatterplot(x=X_lda[:, 0], y=X_lda[:, 1], hue=iris.target_names[y], palette='viridis', s=100)
36 | plt.xlabel('LDA Component 1')
37 | plt.ylabel('LDA Component 2')
38 | plt.title('LDA: Iris Dataset Reduced to 2D')
39 | plt.grid(True)
40 | plt.savefig('lda.png')
41 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/01 Clustering/04 Mean Shift/MeanShift.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_blobs
 5 | from sklearn.cluster import MeanShift
 6 | from sklearn.metrics import silhouette_score
 7 | import seaborn as sns
 8 | 
 9 | # Mean Shift Clustering
10 | # This script demonstrates Mean Shift clustering.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic clustering data.
14 | # 2. Apply Mean Shift clustering.
15 | # 3. Evaluate clustering performance using silhouette score.
16 | # 4. Visualize clusters.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | 
25 | # Step 2: Apply Mean Shift clustering
26 | mean_shift = MeanShift()
27 | labels = mean_shift.fit_predict(X)
28 | cluster_centers = mean_shift.cluster_centers_
29 | 
30 | # Step 3: Evaluate performance
31 | silhouette = silhouette_score(X, labels)
32 | print(f'Silhouette Score: {silhouette:.2f}')
33 | 
34 | # Step 4: Visualize clusters
35 | plt.figure(figsize=(10, 6))
36 | sns.scatterplot(x=data['Feature_1'], y=data['Feature_2'], hue=labels, palette='viridis', s=100)
37 | plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], c='red', marker='x', s=200, linewidths=3, label='Cluster Centers')
38 | plt.xlabel('Feature 1')
39 | plt.ylabel('Feature 2')
40 | plt.title('Mean Shift Clustering')
41 | plt.legend()
42 | plt.grid(True)
43 | plt.savefig('mean_shift_clustering.png')
44 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/03 Model Selection/02 K-Fold Cross-Validation/KFoldCrossValidation.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.model_selection import cross_val_score
 7 | import seaborn as sns
 8 | 
 9 | # K-Fold Cross-Validation
10 | # This script demonstrates K-Fold Cross-Validation.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Apply K-Fold Cross-Validation (k=5).
15 | # 3. Train a Logistic Regression model.
16 | # 4. Evaluate performance (mean accuracy and standard deviation).
17 | # 5. Visualize cross-validation scores.
18 | 
19 | # Step 1: Load data
20 | iris = load_iris()
21 | X = iris.data
22 | y = iris.target
23 | 
24 | # Step 2: Apply K-Fold Cross-Validation
25 | model = LogisticRegression(random_state=42)
26 | k = 5
27 | scores = cross_val_score(model, X, y, cv=k, scoring='accuracy')
28 | 
29 | # Step 3: Evaluate performance
30 | mean_accuracy = np.mean(scores)
31 | std_accuracy = np.std(scores)
32 | print(f'K-Fold CV Scores: {scores}')
33 | print(f'Mean Accuracy: {mean_accuracy:.2f}')
34 | print(f'Standard Deviation: {std_accuracy:.2f}')
35 | 
36 | # Step 4: Visualize CV scores
37 | plt.figure(figsize=(10, 6))
38 | sns.barplot(x=np.arange(1, k+1), y=scores, palette='viridis')
39 | plt.axhline(mean_accuracy, color='red', linestyle='--', label=f'Mean Accuracy: {mean_accuracy:.2f}')
40 | plt.xlabel('Fold')
41 | plt.ylabel('Accuracy')
42 | plt.title('K-Fold Cross-Validation Scores')
43 | plt.legend()
44 | plt.grid(True)
45 | plt.savefig('kfold_cross_validation.png')
46 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/01 Clustering/01 K-Means Clustering/KMeansClustering.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_blobs
 5 | from sklearn.cluster import KMeans
 6 | from sklearn.metrics import silhouette_score
 7 | import seaborn as sns
 8 | 
 9 | # K-Means Clustering
10 | # This script demonstrates K-Means Clustering on synthetic data.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic clustering data.
14 | # 2. Apply K-Means clustering with a chosen number of clusters.
15 | # 3. Evaluate clustering performance using silhouette score.
16 | # 4. Visualize clusters and centroids.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | 
25 | # Step 2: Apply K-Means clustering
26 | n_clusters = 4
27 | kmeans = KMeans(n_clusters=n_clusters, random_state=42)
28 | kmeans.fit(X)
29 | labels = kmeans.labels_
30 | centroids = kmeans.cluster_centers_
31 | 
32 | # Step 3: Evaluate performance
33 | silhouette = silhouette_score(X, labels)
34 | print(f'Silhouette Score: {silhouette:.2f}')
35 | 
36 | # Step 4: Visualize clusters
37 | plt.figure(figsize=(10, 6))
38 | sns.scatterplot(x=data['Feature_1'], y=data['Feature_2'], hue=labels, palette='viridis', s=100)
39 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidths=3, label='Centroids')
40 | plt.xlabel('Feature 1')
41 | plt.ylabel('Feature 2')
42 | plt.title('K-Means Clustering')
43 | plt.legend()
44 | plt.grid(True)
45 | plt.savefig('kmeans_clustering.png')
46 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/01 Clustering/03 DBSCAN/DBSCAN.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_moons
 5 | from sklearn.cluster import DBSCAN
 6 | from sklearn.metrics import silhouette_score
 7 | import seaborn as sns
 8 | 
 9 | # DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
10 | # This script demonstrates DBSCAN on non-spherical data.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic moon-shaped data.
14 | # 2. Apply DBSCAN clustering.
15 | # 3. Evaluate clustering performance using silhouette score (if applicable).
16 | # 4. Visualize clusters and noise points.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | X, y_true = make_moons(n_samples=300, noise=0.05, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | 
25 | # Step 2: Apply DBSCAN clustering
26 | dbscan = DBSCAN(eps=0.3, min_samples=5)
27 | labels = dbscan.fit_predict(X)
28 | 
29 | # Step 3: Evaluate performance
30 | # Silhouette score only if there are at least 2 clusters and no noise (-1 labels)
31 | if len(set(labels)) > 1 and -1 not in labels:
32 |     silhouette = silhouette_score(X, labels)
33 |     print(f'Silhouette Score: {silhouette:.2f}')
34 | else:
35 |     print('Silhouette Score: Not applicable due to noise or single cluster')
36 | 
37 | # Step 4: Visualize clusters
38 | plt.figure(figsize=(10, 6))
39 | sns.scatterplot(x=data['Feature_1'], y=data['Feature_2'], hue=labels, palette='viridis', s=100)
40 | plt.xlabel('Feature 1')
41 | plt.ylabel('Feature 2')
42 | plt.title('DBSCAN Clustering (Noise points in black)')
43 | plt.grid(True)
44 | plt.savefig('dbscan_clustering.png')
45 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/03 Model Selection/04 Grid Search/GridSearch.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.svm import SVC
 6 | from sklearn.model_selection import GridSearchCV
 7 | import seaborn as sns
 8 | 
 9 | # Grid Search
10 | # This script demonstrates Grid Search for hyperparameter tuning.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Define a parameter grid for SVC.
15 | # 3. Perform Grid Search with cross-validation.
16 | # 4. Evaluate the best model (accuracy).
17 | # 5. Visualize hyperparameter performance.
18 | 
19 | # Step 1: Load data
20 | iris = load_iris()
21 | X = iris.data
22 | y = iris.target
23 | 
24 | # Step 2: Define parameter grid
25 | param_grid = {
26 |     'C': [0.1, 1, 10],
27 |     'kernel': ['linear', 'rbf'],
28 |     'gamma': ['scale', 'auto']
29 | }
30 | 
31 | # Step 3: Perform Grid Search
32 | model = SVC(random_state=42)
33 | grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
34 | grid_search.fit(X, y)
35 | 
36 | # Step 4: Evaluate best model
37 | best_model = grid_search.best_estimator_
38 | best_score = grid_search.best_score_
39 | print(f'Best Parameters: {grid_search.best_params_}')
40 | print(f'Best Cross-Validation Accuracy: {best_score:.2f}')
41 | 
42 | # Step 5: Visualize hyperparameter performance
43 | results = pd.DataFrame(grid_search.cv_results_)
44 | pivot_table = results.pivot_table(values='mean_test_score', index='param_C', columns='param_kernel')
45 | 
46 | plt.figure(figsize=(10, 6))
47 | sns.heatmap(pivot_table, annot=True, cmap='viridis', fmt='.2f')
48 | plt.xlabel('Kernel')
49 | plt.ylabel('C')
50 | plt.title('Grid Search: Mean Test Accuracy')
51 | plt.savefig('grid_search.png')
52 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/03 Model Selection/01 Train-Test Split/TrainTestSplit.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.metrics import accuracy_score
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Train-Test Split
11 | # This script demonstrates splitting data into training and testing sets.
12 | 
13 | # Tasks:
14 | # 1. Load the Iris dataset.
15 | # 2. Split data into training and testing sets.
16 | # 3. Train a Logistic Regression model.
17 | # 4. Evaluate performance (accuracy).
18 | # 5. Visualize training vs testing data distribution.
19 | 
20 | # Step 1: Load data
21 | iris = load_iris()
22 | X = iris.data
23 | y = iris.target
24 | data = pd.DataFrame(X, columns=iris.feature_names)
25 | data['Target'] = y
26 | 
27 | # Step 2: Split data
28 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
29 | 
30 | # Step 3: Train Logistic Regression
31 | model = LogisticRegression(random_state=42)
32 | model.fit(X_train, y_train)
33 | 
34 | # Step 4: Evaluate performance
35 | y_pred = model.predict(X_test)
36 | accuracy = accuracy_score(y_test, y_pred)
37 | print(f'Accuracy: {accuracy:.2f}')
38 | 
39 | # Step 5: Visualize data distribution (using first feature as example)
40 | plt.figure(figsize=(10, 6))
41 | sns.histplot(data=X_train[:, 0], color='blue', label='Training Data', alpha=0.5, bins=20)
42 | sns.histplot(data=X_test[:, 0], color='red', label='Testing Data', alpha=0.5, bins=20)
43 | plt.xlabel(iris.feature_names[0])
44 | plt.ylabel('Count')
45 | plt.title('Train-Test Split: Distribution of First Feature')
46 | plt.legend()
47 | plt.grid(True)
48 | plt.savefig('train_test_split.png')
49 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/05 Deployment/01 Model Serialization/ModelSerialization.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | from sklearn.datasets import load_iris
 4 | from sklearn.linear_model import LogisticRegression
 5 | from sklearn.model_selection import train_test_split
 6 | import joblib
 7 | import matplotlib.pyplot as plt
 8 | import seaborn as sns
 9 | 
10 | # Model Serialization
11 | # This script demonstrates saving and loading a trained model.
12 | 
13 | # Tasks:
14 | # 1. Load the Iris dataset.
15 | # 2. Train a Logistic Regression model.
16 | # 3. Save the model using joblib.
17 | # 4. Load the model and make predictions.
18 | # 5. Visualize prediction results.
19 | 
20 | # Step 1: Load data
21 | iris = load_iris()
22 | X = iris.data
23 | y = iris.target
24 | data = pd.DataFrame(X, columns=iris.feature_names)
25 | data['Target'] = y
26 | 
27 | # Step 2: Train Logistic Regression
28 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
29 | model = LogisticRegression(random_state=42)
30 | model.fit(X_train, y_train)
31 | 
32 | # Step 3: Save the model
33 | joblib.dump(model, 'logistic_model.pkl')
34 | 
35 | # Step 4: Load the model and predict
36 | loaded_model = joblib.load('logistic_model.pkl')
37 | y_pred = loaded_model.predict(X_test)
38 | 
39 | # Evaluate performance
40 | accuracy = loaded_model.score(X_test, y_test)
41 | print(f'Accuracy of Loaded Model: {accuracy:.2f}')
42 | 
43 | # Step 5: Visualize predictions
44 | plt.figure(figsize=(10, 6))
45 | sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=iris.target_names[y_pred], style=iris.target_names[y_test], s=100)
46 | plt.xlabel(iris.feature_names[0])
47 | plt.ylabel(iris.feature_names[1])
48 | plt.title('Model Serialization: Predictions from Loaded Model')
49 | plt.grid(True)
50 | plt.savefig('model_serialization.png')
51 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/03 Model Selection/05 Random Search/RandomSearch.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.svm import SVC
 6 | from sklearn.model_selection import RandomizedSearchCV
 7 | from scipy.stats import uniform, randint
 8 | import seaborn as sns
 9 | 
10 | # Random Search
11 | # This script demonstrates Random Search for hyperparameter tuning.
12 | 
13 | # Tasks:
14 | # 1. Load the Iris dataset.
15 | # 2. Define a parameter distribution for SVC.
16 | # 3. Perform Random Search with cross-validation.
17 | # 4. Evaluate the best model (accuracy).
18 | # 5. Visualize hyperparameter performance.
19 | 
20 | # Step 1: Load data
21 | iris = load_iris()
22 | X = iris.data
23 | y = iris.target
24 | 
25 | # Step 2: Define parameter distribution
26 | param_dist = {
27 |     'C': uniform(0.1, 10),
28 |     'kernel': ['linear', 'rbf'],
29 |     'gamma': ['scale', 'auto']
30 | }
31 | 
32 | # Step 3: Perform Random Search
33 | model = SVC(random_state=42)
34 | random_search = RandomizedSearchCV(model, param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
35 | random_search.fit(X, y)
36 | 
37 | # Step 4: Evaluate best model
38 | best_model = random_search.best_estimator_
39 | best_score = random_search.best_score_
40 | print(f'Best Parameters: {random_search.best_params_}')
41 | print(f'Best Cross-Validation Accuracy: {best_score:.2f}')
42 | 
43 | # Step 5: Visualize hyperparameter performance
44 | results = pd.DataFrame(random_search.cv_results_)
45 | plt.figure(figsize=(10, 6))
46 | sns.scatterplot(data=results, x='param_C', y='mean_test_score', hue='param_kernel', style='param_gamma', size='mean_test_score')
47 | plt.xlabel('C')
48 | plt.ylabel('Mean Test Accuracy')
49 | plt.title('Random Search: Hyperparameter Performance')
50 | plt.grid(True)
51 | plt.savefig('random_search.png')
52 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/01 Clustering/02 Hierarchical Clustering/HierarchicalClustering.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_blobs
 5 | from sklearn.cluster import AgglomerativeClustering
 6 | from sklearn.metrics import silhouette_score
 7 | from scipy.cluster.hierarchy import dendrogram, linkage
 8 | import seaborn as sns
 9 | 
10 | # Hierarchical Clustering
11 | # This script demonstrates Hierarchical Clustering with a dendrogram.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic clustering data.
15 | # 2. Apply Hierarchical Clustering (Agglomerative).
16 | # 3. Evaluate clustering performance using silhouette score.
17 | # 4. Visualize clusters and dendrogram.
18 | 
19 | # Step 1: Generate synthetic data
20 | np.random.seed(42)
21 | X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)
22 | 
23 | # Convert to DataFrame
24 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
25 | 
26 | # Step 2: Apply Hierarchical Clustering
27 | n_clusters = 4
28 | hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
29 | labels = hierarchical.fit_predict(X)
30 | 
31 | # Step 3: Evaluate performance
32 | silhouette = silhouette_score(X, labels)
33 | print(f'Silhouette Score: {silhouette:.2f}')
34 | 
35 | # Step 4: Visualize clusters and dendrogram
36 | plt.figure(figsize=(12, 10))
37 | 
38 | # Clusters
39 | plt.subplot(2, 1, 1)
40 | sns.scatterplot(x=data['Feature_1'], y=data['Feature_2'], hue=labels, palette='viridis', s=100)
41 | plt.xlabel('Feature 1')
42 | plt.ylabel('Feature 2')
43 | plt.title('Hierarchical Clustering')
44 | 
45 | # Dendrogram
46 | plt.subplot(2, 1, 2)
47 | Z = linkage(X, method='ward')
48 | dendrogram(Z)
49 | plt.title('Dendrogram')
50 | plt.xlabel('Sample Index')
51 | plt.ylabel('Distance')
52 | plt.tight_layout()
53 | plt.savefig('hierarchical_clustering.png')
54 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/03 Model Selection/03 Stratified K-Fold/StratifiedKFold.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.model_selection import StratifiedKFold
 7 | import seaborn as sns
 8 | 
 9 | # Stratified K-Fold Cross-Validation
10 | # This script demonstrates Stratified K-Fold Cross-Validation for imbalanced classes.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Apply Stratified K-Fold Cross-Validation (k=5).
15 | # 3. Train a Logistic Regression model.
16 | # 4. Evaluate performance (mean accuracy and standard deviation).
17 | # 5. Visualize cross-validation scores.
18 | 
19 | # Step 1: Load data
20 | iris = load_iris()
21 | X = iris.data
22 | y = iris.target
23 | 
24 | # Step 2: Apply Stratified K-Fold Cross-Validation
25 | model = LogisticRegression(random_state=42)
26 | k = 5
27 | skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
28 | scores = []
29 | for train_idx, test_idx in skf.split(X, y):
30 |     X_train, X_test = X[train_idx], X[test_idx]
31 |     y_train, y_test = y[train_idx], y[test_idx]
32 |     model.fit(X_train, y_train)
33 |     score = model.score(X_test, y_test)
34 |     scores.append(score)
35 | 
36 | # Step 3: Evaluate performance
37 | mean_accuracy = np.mean(scores)
38 | std_accuracy = np.std(scores)
39 | print(f'Stratified K-Fold CV Scores: {scores}')
40 | print(f'Mean Accuracy: {mean_accuracy:.2f}')
41 | print(f'Standard Deviation: {std_accuracy:.2f}')
42 | 
43 | # Step 4: Visualize CV scores
44 | plt.figure(figsize=(10, 6))
45 | sns.barplot(x=np.arange(1, k+1), y=scores, palette='viridis')
46 | plt.axhline(mean_accuracy, color='red', linestyle='--', label=f'Mean Accuracy: {mean_accuracy:.2f}')
47 | plt.xlabel('Fold')
48 | plt.ylabel('Accuracy')
49 | plt.title('Stratified K-Fold Cross-Validation Scores')
50 | plt.legend()
51 | plt.grid(True)
52 | plt.savefig('stratified_kfold.png')
53 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/02 Dimensionality Reduction/01 Principal Component Analysis (PCA)/PCA.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.decomposition import PCA
 6 | from sklearn.preprocessing import StandardScaler
 7 | import seaborn as sns
 8 | 
 9 | # Principal Component Analysis (PCA)
10 | # This script demonstrates PCA for dimensionality reduction on the Iris dataset.
11 | 
12 | # Tasks:
13 | # 1. Load the Iris dataset.
14 | # 2. Standardize the features.
15 | # 3. Apply PCA to reduce to 2 dimensions.
16 | # 4. Evaluate explained variance.
17 | # 5. Visualize the reduced data.
18 | 
19 | # Step 1: Load data
20 | iris = load_iris()
21 | X = iris.data
22 | y = iris.target
23 | data = pd.DataFrame(X, columns=iris.feature_names)
24 | data['Target'] = y
25 | 
26 | # Step 2: Standardize features
27 | scaler = StandardScaler()
28 | X_scaled = scaler.fit_transform(X)
29 | 
30 | # Step 3: Apply PCA
31 | pca = PCA(n_components=2)
32 | X_pca = pca.fit_transform(X_scaled)
33 | 
34 | # Step 4: Evaluate explained variance
35 | explained_variance = pca.explained_variance_ratio_
36 | print(f'Explained Variance Ratio: {explained_variance}')
37 | print(f'Total Explained Variance: {sum(explained_variance):.2f}')
38 | 
39 | # Step 5: Visualize reduced data
40 | plt.figure(figsize=(10, 6))
41 | sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=iris.target_names[y], palette='viridis', s=100)
42 | plt.xlabel('Principal Component 1')
43 | plt.ylabel('Principal Component 2')
44 | plt.title('PCA: Iris Dataset Reduced to 2D')
45 | plt.grid(True)
46 | plt.savefig('pca.png')
47 | plt.close()
48 | 
49 | # Scree plot for explained variance
50 | plt.figure(figsize=(8, 5))
51 | plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.6, color='blue')
52 | plt.xlabel('Principal Component')
53 | plt.ylabel('Explained Variance Ratio')
54 | plt.title('Scree Plot')
55 | plt.grid(True)
56 | plt.savefig('pca_scree.png')
57 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/03 Evaluation Metrics/01 Regression Metrics/RegressionMetrics.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.linear_model import LinearRegression
 5 | from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
 6 | from sklearn.model_selection import train_test_split
 7 | 
 8 | # Regression Evaluation Metrics
 9 | # This script demonstrates MSE, MAE, and R² for regression.
10 | 
11 | # Tasks:
12 | # 1. Generate synthetic regression data.
13 | # 2. Split data into training and testing sets.
14 | # 3. Train a Linear Regression model.
15 | # 4. Calculate MSE, MAE, and R².
16 | # 5. Visualize actual vs predicted values.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | X = np.random.rand(100, 1) * 10
21 | y = 2 * X.flatten() + 1 + np.random.randn(100) * 2
22 | 
23 | # Convert to DataFrame
24 | data = pd.DataFrame({'X': X.flatten(), 'y': y})
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train Linear Regression model
30 | model = LinearRegression()
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions and calculate metrics
34 | y_pred = model.predict(X_test)
35 | mse = mean_squared_error(y_test, y_pred)
36 | mae = mean_absolute_error(y_test, y_pred)
37 | r2 = r2_score(y_test, y_pred)
38 | 
39 | print(f'Mean Squared Error (MSE): {mse:.2f}')
40 | print(f'Mean Absolute Error (MAE): {mae:.2f}')
41 | print(f'R² Score: {r2:.2f}')
42 | 
43 | # Step 5: Visualize actual vs predicted
44 | plt.figure(figsize=(10, 6))
45 | plt.scatter(y_test, y_pred, color='blue', label='Predicted vs Actual')
46 | plt.plot([y_test.min(), y demolition_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Prediction')
47 | plt.xlabel('Actual Values')
48 | plt.ylabel('Predicted Values')
49 | plt.title('Regression Metrics: Actual vs Predicted')
50 | plt.legend()
51 | plt.grid(True)
52 | plt.savefig('regression_metrics.png')
53 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/05 Outlier Detection/OutlierDetection.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.ensemble import IsolationForest
 6 | import seaborn as sns
 7 | 
 8 | # Outlier Detection
 9 | # This script demonstrates outlier detection using Isolation Forest.
10 | 
11 | # Tasks:
12 | # 1. Load the Iris dataset and introduce synthetic outliers.
13 | # 2. Apply Isolation Forest to detect outliers.
14 | # 3. Evaluate the number of detected outliers.
15 | # 4. Visualize outliers in the dataset.
16 | 
17 | # Step 1: Load data and introduce outliers
18 | iris = load_iris()
19 | X = iris.data
20 | data = pd.DataFrame(X, columns=iris.feature_names)
21 | 
22 | # Introduce 5% outliers
23 | np.random.seed(42)
24 | n_outliers = int(0.05 * X.shape[0])
25 | outliers = np.random.uniform(low=-10, high=10, size=(n_outliers, X.shape[1]))
26 | X_with_outliers = np.vstack([X, outliers])
27 | data_with_outliers = pd.DataFrame(X_with_outliers, columns=iris.feature_names)
28 | 
29 | # Step 2: Apply Isolation Forest
30 | iso_forest = IsolationForest(contamination=0.05, random_state=42)
31 | outlier_labels = iso_forest.fit_predict(X_with_outliers)
32 | # -1 indicates outlier, 1 indicates inlier
33 | outliers_detected = X_with_outliers[outlier_labels == -1]
34 | inliers = X_with_outliers[outlier_labels == 1]
35 | 
36 | # Step 3: Evaluate
37 | n_outliers_detected = len(outliers_detected)
38 | print(f'Number of Outliers Detected: {n_outliers_detected}')
39 | 
40 | # Step 4: Visualize outliers (using first two features for 2D plot)
41 | plt.figure(figsize=(10, 6))
42 | sns.scatterplot(x=data_with_outliers.iloc[:, 0], y=data_with_outliers.iloc[:, 1], hue=outlier_labels, palette={1: 'blue', -1: 'red'}, s=100)
43 | plt.xlabel(iris.feature_names[0])
44 | plt.ylabel(iris.feature_names[1])
45 | plt.title('Outlier Detection with Isolation Forest')
46 | plt.legend(['Inliers', 'Outliers'])
47 | plt.grid(True)
48 | plt.savefig('outlier_detection.png')
49 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/04 Model Evaluation/01 Bias-Variance Tradeoff/BiasVarianceTradeoff.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.metrics import mean_squared_error
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Bias-Variance Tradeoff
10 | # This script demonstrates the bias-variance tradeoff using polynomial regression.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic non-linear data.
14 | # 2. Train polynomial regression models with varying degrees.
15 | # 3. Evaluate training and testing errors.
16 | # 4. Visualize bias-variance tradeoff.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | X = np.sort(5 * np.random.rand(100, 1), axis=0)
21 | y = np.sin(X).ravel() + np.random.randn(100) * 0.1
22 | 
23 | # Step 2: Split data
24 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
25 | 
26 | # Step 3: Train models with varying polynomial degrees
27 | degrees = [1, 3, 10]
28 | train_errors = []
29 | test_errors = []
30 | 
31 | for degree in degrees:
32 |     poly = PolynomialFeatures(degree=degree)
33 |     X_train_poly = poly.fit_transform(X_train)
34 |     X_test_poly = poly.transform(X_test)
35 |     
36 |     model = LinearRegression()
37 |     model.fit(X_train_poly, y_train)
38 |     
39 |     y_train_pred = model.predict(X_train_poly)
40 |     y_test_pred = model.predict(X_test_poly)
41 |     
42 |     train_errors.append(mean_squared_error(y_train, y_train_pred))
43 |     test_errors.append(mean_squared_error(y_test, y_test_pred))
44 | 
45 | # Step 4: Visualize bias-variance tradeoff
46 | plt.figure(figsize=(10, 6))
47 | plt.plot(degrees, train_errors, marker='o', label='Training Error', color='blue')
48 | plt.plot(degrees, test_errors, marker='o', label='Testing Error', color='red')
49 | plt.xlabel('Polynomial Degree')
50 | plt.ylabel('Mean Squared Error')
51 | plt.title('Bias-Variance Tradeoff')
52 | plt.legend()
53 | plt.grid(True)
54 | plt.savefig('bias_variance_tradeoff.png')
55 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/01 Regression/04 Lasso Regression/LassoRegression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.linear_model import Lasso
 5 | from sklearn.metrics import mean_squared_error, r2_score
 6 | from sklearn.model_selection import train_test_split
 7 | 
 8 | # Lasso Regression
 9 | # This script demonstrates Lasso Regression with synthetic data for feature selection.
10 | 
11 | # Tasks:
12 | # 1. Generate synthetic data with some irrelevant features.
13 | # 2. Split data into training and testing sets.
14 | # 3. Train a Lasso Regression model with L1 regularization.
15 | # 4. Make predictions and evaluate performance (MSE, R²).
16 | # 5. Visualize feature coefficients to show sparsity.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | n_samples, n_features = 100, 10
21 | X = np.random.randn(n_samples, n_features)
22 | # Only first 3 features are relevant
23 | y = 3 * X[:, 0] + 2 * X[:, 1] - 1.5 * X[:, 2] + np.random.randn(n_samples) * 0.5
24 | 
25 | # Convert to DataFrame
26 | data = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(n_features)])
27 | data['Target'] = y
28 | 
29 | # Step 2: Split data
30 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
31 | 
32 | # Step 3: Train Lasso Regression model
33 | lasso = Lasso(alpha=0.1)  # Regularization strength
34 | lasso.fit(X_train, y_train)
35 | 
36 | # Step 4: Make predictions
37 | y_pred = lasso.predict(X_test)
38 | 
39 | # Evaluate performance
40 | mse = mean_squared_error(y_test, y_pred)
41 | r2 = r2_score(y_test, y_pred)
42 | 
43 | print(f'Mean Squared Error: {mse:.2f}')
44 | print(f'R² Score: {r2:.2f}')
45 | print('Feature Coefficients:', lasso.coef_)
46 | 
47 | # Step 5: Visualize feature coefficients
48 | plt.figure(figsize=(10, 6))
49 | plt.bar(range(n_features), lasso.coef_, color='blue')
50 | plt.xticks(range(n_features), [f'Feature_{i}' for i in range(n_features)])
51 | plt.xlabel('Features')
52 | plt.ylabel('Coefficient Value')
53 | plt.title('Lasso Regression: Feature Coefficients')
54 | plt.grid(True)
55 | plt.savefig('lasso_regression.png')
56 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/03 Association Rules/02 FP-Growth/FPGrowth.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import matplotlib.pyplot as plt
 3 | from mlxtend.frequent_patterns import fpgrowth
 4 | from mlxtend.frequent_patterns import association_rules
 5 | import seaborn as sns
 6 | 
 7 | # FP-Growth Algorithm
 8 | # This script demonstrates the FP-Growth algorithm for association rule mining.
 9 | 
10 | # Tasks:
11 | # 1. Create synthetic transactional data.
12 | # 2. Apply the FP-Growth algorithm to find frequent itemsets.
13 | # 3. Generate association rules.
14 | # 4. Evaluate rules using support, confidence, and lift.
15 | # 5. Visualize rule metrics.
16 | 
17 | # Step 1: Create synthetic transactional data
18 | transactions = [
19 |     ['Bread', 'Milk', 'Eggs'],
20 |     ['Bread', 'Butter', 'Eggs'],
21 |     ['Milk', 'Butter', 'Cheese'],
22 |     ['Bread', 'Milk', 'Butter'],
23 |     ['Bread', 'Milk', 'Eggs', 'Cheese'],
24 |     ['Milk', 'Cheese'],
25 |     ['Bread', 'Eggs'],
26 |     ['Butter', 'Cheese'],
27 |     ['Bread', 'Milk', 'Butter', 'Eggs'],
28 |     ['Milk', 'Butter']
29 | ]
30 | 
31 | # Convert to one-hot encoded DataFrame
32 | items = set(item for transaction in transactions for item in transaction)
33 | data = pd.DataFrame([[item in transaction for item in items] for transaction in transactions], columns=items)
34 | 
35 | # Step 2: Apply FP-Growth algorithm
36 | frequent_itemsets = fpgrowth(data, min_support=0.3, use_colnames=True)
37 | 
38 | # Step 3: Generate association rules
39 | rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.6)
40 | rules = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
41 | 
42 | # Step 4: Evaluate rules
43 | print('Frequent Itemsets:')
44 | print(frequent_itemsets)
45 | print('\nAssociation Rules:')
46 | print(rules)
47 | 
48 | # Step 5: Visualize rule metrics
49 | plt.figure(figsize=(10, 6))
50 | sns.scatterplot(data=rules, x='support', y='confidence', size='lift', hue='lift', palette='viridis')
51 | plt.xlabel('Support')
52 | plt.ylabel('Confidence')
53 | plt.title('FP-Growth: Association Rules (Size and Color by Lift)')
54 | plt.grid(True)
55 | plt.savefig('fpgrowth_rules.png')
56 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/02 Unsupervised Learning/03 Association Rules/01 Apriori Algorithm/AprioriAlgorithm.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import matplotlib.pyplot as plt
 3 | from mlxtend.frequent_patterns import apriori
 4 | from mlxtend.frequent_patterns import association_rules
 5 | import seaborn as sns
 6 | 
 7 | # Apriori Algorithm
 8 | # This script demonstrates the Apriori algorithm for association rule mining.
 9 | 
10 | # Tasks:
11 | # 1. Create synthetic transactional data.
12 | # 2. Apply the Apriori algorithm to find frequent itemsets.
13 | # 3. Generate association rules.
14 | # 4. Evaluate rules using support, confidence, and lift.
15 | # 5. Visualize rule metrics.
16 | 
17 | # Step 1: Create synthetic transactional data
18 | transactions = [
19 |     ['Bread', 'Milk', 'Eggs'],
20 |     ['Bread', 'Butter', 'Eggs'],
21 |     ['Milk', 'Butter', 'Cheese'],
22 |     ['Bread', 'Milk', 'Butter'],
23 |     ['Bread', 'Milk', 'Eggs', 'Cheese'],
24 |     ['Milk', 'Cheese'],
25 |     ['Bread', 'Eggs'],
26 |     ['Butter', 'Cheese'],
27 |     ['Bread', 'Milk', 'Butter', 'Eggs'],
28 |     ['Milk', 'Butter']
29 | ]
30 | 
31 | # Convert to one-hot encoded DataFrame
32 | items = set(item for transaction in transactions for item in transaction)
33 | data = pd.DataFrame([[item in transaction for item in items] for transaction in transactions], columns=items)
34 | 
35 | # Step 2: Apply Apriori algorithm
36 | frequent_itemsets = apriori(data, min_support=0.3, use_colnames=True)
37 | 
38 | # Step 3: Generate association rules
39 | rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.6)
40 | rules = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]
41 | 
42 | # Step 4: Evaluate rules
43 | print('Frequent Itemsets:')
44 | print(frequent_itemsets)
45 | print('\nAssociation Rules:')
46 | print(rules)
47 | 
48 | # Step 5: Visualize rule metrics
49 | plt.figure(figsize=(10, 6))
50 | sns.scatterplot(data=rules, x='support', y='confidence', size='lift', hue='lift', palette='viridis')
51 | plt.xlabel('Support')
52 | plt.ylabel('Confidence')
53 | plt.title('Apriori: Association Rules (Size and Color by Lift)')
54 | plt.grid(True)
55 | plt.savefig('apriori_rules.png')
56 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/01 Regression/03 Ridge Regression/RidgeRegression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.linear_model import Ridge
 5 | from sklearn.metrics import mean_squared_error, r2_score
 6 | from sklearn.model_selection import train_test_split
 7 | 
 8 | # Ridge Regression
 9 | # This script demonstrates Ridge Regression with synthetic data to handle multicollinearity.
10 | 
11 | # Tasks:
12 | # 1. Generate synthetic data with correlated features.
13 | # 2. Split data into training and testing sets.
14 | # 3. Train a Ridge Regression model with regularization.
15 | # 4. Make predictions and evaluate performance (MSE, R²).
16 | # 5. Visualize feature coefficients.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | n_samples, n_features = 100, 5
21 | X = np.random.randn(n_samples, n_features)
22 | # Introduce multicollinearity
23 | X[:, 1] = X[:, 0] + np.random.randn(n_samples) * 0.1  # Correlated feature
24 | y = 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(n_samples) * 0.5
25 | 
26 | # Convert to DataFrame
27 | data = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(n_features)])
28 | data['Target'] = y
29 | 
30 | # Step 2: Split data
31 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
32 | 
33 | # Step 3: Train Ridge Regression model
34 | ridge = Ridge(alpha=1.0)  # Regularization strength
35 | ridge.fit(X_train, y_train)
36 | 
37 | # Step 4: Make predictions
38 | y_pred = ridge.predict(X_test)
39 | 
40 | # Evaluate performance
41 | mse = mean_squared_error(y_test, y_pred)
42 | r2 = r2_score(y_test, y_pred)
43 | 
44 | print(f'Mean Squared Error: {mse:.2f}')
45 | print(f'R² Score: {r2:.2f}')
46 | print('Feature Coefficients:', ridge.coef_)
47 | 
48 | # Step 5: Visualize feature coefficients
49 | plt.figure(figsize=(10, 6))
50 | plt.bar(range(n_features), ridge.coef_, color='blue')
51 | plt.xticks(range(n_features), [f'Feature_{i}' for i in range(n_features)])
52 | plt.xlabel('Features')
53 | plt.ylabel('Coefficient Value')
54 | plt.title('Ridge Regression: Feature Coefficients')
55 | plt.grid(True)
56 | plt.savefig('ridge_regression.png')
57 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/04 Naive Bayes/NaiveBayes.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.naive_bayes import GaussianNB
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Naive Bayes
10 | # This script demonstrates Gaussian Naive Bayes classification.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train a Gaussian Naive Bayes model.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train Naive Bayes model
30 | model = GaussianNB()
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | plt.xlabel('Feature 1')
54 | plt.ylabel('Feature 2')
55 | plt.title('Naive Bayes: Decision Boundary')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('naive_bayes.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/05 K-Nearest Neighbors (KNN)/KNN.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.neighbors import KNeighborsClassifier
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # K-Nearest Neighbors (K truely
10 | # This script demonstrates KNN classification.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train a KNN Classifier.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train KNN model
30 | model = KNeighborsClassifier(n_neighbors=5)
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | plt.xlabel('Feature 1')
54 | plt.ylabel('Feature 2')
55 | plt.title('K-Nearest Neighbors: Decision Boundary')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('knn.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/01 Regression/02 Polynomial Regression/PolynomialRegression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.pipeline import make_pipeline
 7 | from sklearn.metrics import mean_squared_error, r2_score
 8 | from sklearn.model_selection import train_test_split
 9 | 
10 | # Polynomial Regression
11 | # This script demonstrates Polynomial Regression using synthetic non-linear data.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic non-linear data (quadratic relationship).
15 | # 2. Split data into training and testing sets.
16 | # 3. Create and train a Polynomial Regression model (degree 2).
17 | # 4. Make predictions and evaluate performance (MSE, R²).
18 | # 5. Visualize the polynomial fit.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | X = np.sort(6 * np.random.rand(100, 1) - 3, axis=0)  # Values between -3 and 3
23 | y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1) * 0.5  # Quadratic + noise
24 | 
25 | # Convert to DataFrame
26 | data = pd.DataFrame({'X': X.flatten(), 'y': y.flatten()})
27 | 
28 | # Step 2: Split data
29 | X_train, X_test, y_train, y_test = train_test_split(X, y.flatten(), test_size=0.2, random_state=42)
30 | 
31 | # Step 3: Create and train Polynomial Regression model
32 | degree = 2
33 | polyreg = make_pipeline(PolynomialFeatures(degree), LinearRegression())
34 | polyreg.fit(X_train, y_train)
35 | 
36 | # Step 4: Make predictions
37 | X_test_sorted = np.sort(X_test, axis=0)
38 | y_pred = polyreg.predict(X_test)
39 | y_pred_plot = polyreg.predict(X_test_sorted)
40 | 
41 | # Evaluate performance
42 | mse = mean_squared_error(y_test, y_pred)
43 | r2 = r2_score(y_test, y_pred)
44 | 
45 | print(f'Mean Squared Error: {mse:.2f}')
46 | print(f'R² Score: {r2:.2f}')
47 | 
48 | # Step 5: Visualize results
49 | plt.figure(figsize=(10, 6))
50 | plt.scatter(X_train, y_train, color='blue', label='Training data')
51 | plt.scatter(X_test, y_test, color='green', label='Testing data')
52 | plt.plot(X_test_sorted, y_pred_plot, color='red', label='Polynomial fit (degree=2)')
53 | plt.xlabel('X')
54 | plt.ylabel('y')
55 | plt.title('Polynomial Regression')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('polynomial_regression.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/04 Ensemble Methods/01 Bagging/02 Random Forest/RandomForest.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.ensemble import RandomForestClassifier
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Random Forest
10 | # This script demonstrates Random Forest classification.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train a Random Forest Classifier.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train Random Forest model
30 | model = RandomForestClassifier(n_estimators=100, random_state=42)
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | plt.xlabel('Feature 1')
54 | plt.ylabel('Feature 2')
55 | plt.title('Random Forest: Decision Boundary')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('random_forest.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/02 Decision Trees/DecisionTrees.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.tree import DecisionTreeClassifier
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Decision Trees
10 | # This script demonstrates Decision Tree classification.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train a Decision Tree Classifier.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train Decision Tree model
30 | model = DecisionTreeClassifier(max_depth=3, random_state=42)
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | plt.xlabel('Feature 1')
54 | plt.ylabel('Feature 2')
55 | plt.title('Decision Tree: Decision Boundary')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('decision_tree.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/01 Regression/01 Linear Regression/LinearRegression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.linear_model import LinearRegression
 5 | from sklearn.metrics import mean_squared_error, r2_score
 6 | from sklearn.model_selection import train_test_split
 7 | 
 8 | # Linear Regression
 9 | # This script demonstrates Linear Regression using a synthetic dataset to predict house prices based on size.
10 | 
11 | # Tasks:
12 | # 1. Generate synthetic data for house sizes and prices.
13 | # 2. Split data into training and testing sets.
14 | # 3. Train a Linear Regression model.
15 | # 4. Make predictions and evaluate performance (MSE, R²).
16 | # 5. Visualize the regression line and predictions.
17 | 
18 | # Step 1: Generate synthetic data
19 | np.random.seed(42)
20 | house_sizes = np.random.rand(100, 1) * 200  # Size in square feet (0-200)
21 | prices = 50 + 3 * house_sizes + np.random.randn(100, 1) * 10  # Price = 50 + 3*size + noise
22 | 
23 | # Convert to DataFrame for clarity
24 | data = pd.DataFrame({'Size': house_sizes.flatten(), 'Price': prices.flatten()})
25 | 
26 | # Step 2: Split data
27 | X = data[['Size']]
28 | y = data['Price']
29 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
30 | 
31 | # Step 3: Train Linear Regression model
32 | model = LinearRegression()
33 | model.fit(X_train, y_train)
34 | 
35 | # Step 4: Make predictions
36 | y_pred = model.predict(X_test)
37 | 
38 | # Evaluate performance
39 | mse = mean_squared_error(y_test, y_pred)
40 | r2 = r2_score(y_test, y_pred)
41 | 
42 | print(f'Mean Squared Error: {mse:.2f}')
43 | print(f'R² Score: {r2:.2f}')
44 | print(f'Coefficients: {model.coef_[0]:.2f}')
45 | print(f'Intercept: {model.intercept_:.2f}')
46 | 
47 | # Step 5: Visualize results
48 | plt.figure(figsize=(10, 6))
49 | plt.scatter(X_train, y_train, color='blue', label='Training data')
50 | plt.scatter(X_test, y_test, color='green', label='Testing data')
51 | plt.plot(X_test, y_pred, color='red', label='Regression line')
52 | plt.xlabel('House Size (sq ft)/100)')
53 | plt.ylabel('Price ($1000)')
54 | plt.title('Linear Regression: House Size vs Price')
55 | plt.legend()
56 | plt.grid(True)
57 | plt.savefig('linear_regression.png')
58 | plt.close()
59 | 
60 | # Save the plot for reference
61 | # The plot shows the regression line fitting the data, with training and testing points.


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/01 Logistic Regression/LogisticRegression.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Logistic Regression
10 | # This script demonstrates Logistic Regression for binary classification.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train a Logistic Regression model.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train Logistic Regression model
30 | model = LogisticRegression(random_state=42)
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | plt.xlabel('Feature 1')
54 | plt.ylabel('Feature 2')
55 | plt.title('Logistic Regression: Decision Boundary')
56 | plt.legend()
57 | plt.grid(True)
58 | plt.savefig('logistic_regression.png')
59 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/02 Feature Engineering/01 Feature Selection/FeatureSelection.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.feature_selection import SelectKBest, f_classif
 6 | from sklearn.linear_model import LogisticRegression
 7 | from sklearn.metrics import accuracy_score
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Feature Selection
12 | # This script demonstrates feature selection using SelectKBest on the Iris dataset.
13 | 
14 | # Tasks:
15 | # 1. Load the Iris dataset.
16 | # 2. Apply SelectKBest to select top features.
17 | # 3. Train a Logistic Regression model on selected features.
18 | # 4. Evaluate model performance (accuracy).
19 | # 5. Visualize feature importance scores.
20 | 
21 | # Step 1: Load data
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | data = pd.DataFrame(X, columns=iris.feature_names)
26 | 
27 | # Step 2: Apply feature selection
28 | selector = SelectKBest(score_func=f_classif, k=2)
29 | X_selected = selector.fit_transform(X, y)
30 | selected_features = [iris.feature_names[i] for i in selector.get_support(indices=True)]
31 | feature_scores = selector.scores_
32 | 
33 | # Step 3: Train Logistic Regression
34 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
35 | X_train_selected, X_test_selected = train_test_split(X_selected, test_size=0.2, random_state=42)
36 | 
37 | # Full features
38 | model_full = LogisticRegression(random_state=42)
39 | model_full.fit(X_train, y_train)
40 | y_pred_full = model_full.predict(X_test)
41 | acc_full = accuracy_score(y_test, y_pred_full)
42 | 
43 | # Selected features
44 | model_selected = LogisticRegression(random_state=42)
45 | model_selected.fit(X_train_selected, y_train)
46 | y_pred_selected = model_selected.predict(X_test_selected)
47 | acc_selected = accuracy_score(y_test, y_pred_selected)
48 | 
49 | print(f'Accuracy (Full Features): {acc_full:.2f}')
50 | print(f'Accuracy (Selected Features): {acc_selected:.2f}')
51 | print(f'Selected Features: {selected_features}')
52 | 
53 | # Step 4: Visualize feature scores
54 | plt.figure(figsize=(10, 6))
55 | sns.barplot(x=feature_scores, y=iris.feature_names, palette='viridis')
56 | plt.xlabel('Feature Score (f_classif)')
57 | plt.ylabel('Feature')
58 | plt.title('Feature Importance Scores')
59 | plt.grid(True)
60 | plt.savefig('feature_selection.png')
61 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/02 Feature Engineering/02 Polynomial Features/PolynomialFeatures.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.metrics import mean_squared_error
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Polynomial Features
11 | # This script demonstrates adding polynomial features to improve regression.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic non-linear data.
15 | # 2. Apply PolynomialFeatures to create polynomial terms.
16 | # 3. Train Linear Regression models with and without polynomial features.
17 | # 4. Evaluate performance (MSE).
18 | # 5. Visualize regression fits.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | X = np.sort(5 * np.random.rand(100, 1), axis=0)
23 | y = np.sin(X).ravel() + np.random.randn(100) * 0.1
24 | 
25 | # Step 2: Apply PolynomialFeatures
26 | poly = PolynomialFeatures(degree=3)
27 | X_poly = poly.fit_transform(X)
28 | 
29 | # Step 3: Train Linear Regression models
30 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
31 | X_train_poly, X_test_poly = train_test_split(X_poly, test_size=0.2, random_state=42)
32 | 
33 | # Linear model
34 | model_linear = LinearRegression()
35 | model_linear.fit(X_train, y_train)
36 | y_pred_linear = model_linear.predict(X_test)
37 | mse_linear = mean_squared_error(y_test, y_pred_linear)
38 | 
39 | # Polynomial model
40 | model_poly = LinearRegression()
41 | model_poly.fit(X_train_poly, y_train)
42 | y_pred_poly = model_poly.predict(X_test_poly)
43 | mse_poly = mean_squared_error(y_test, y_pred_poly)
44 | 
45 | print(f'MSE (Linear): {mse_linear:.2f}')
46 | print(f'MSE (Polynomial): {mse_poly:.2f}')
47 | 
48 | # Step 4: Visualize regression fits
49 | X_plot = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
50 | y_plot_linear = model_linear.predict(X_plot)
51 | y_plot_poly = model_poly.predict(poly.transform(X_plot))
52 | 
53 | plt.figure(figsize=(10, 6))
54 | plt.scatter(X, y, color='blue', label='Data')
55 | plt.plot(X_plot, y_plot_linear, color='green', label='Linear Fit')
56 | plt.plot(X_plot, y_plot_poly, color='red', label='Polynomial Fit (degree=3)')
57 | plt.xlabel('X')
58 | plt.ylabel('y')
59 | plt.title('Polynomial Features: Linear vs Polynomial Regression')
60 | plt.legend()
61 | plt.grid(True)
62 | plt.savefig('polynomial_features.png')
63 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/04 Ensemble Methods/02 Boosting/01 AdaBoost/AdaBoost.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.ensemble import AdaBoostClassifier
 6 | from sklearn.tree import DecisionTreeClassifier
 7 | from sklearn.metrics import accuracy_score, classification_report
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # AdaBoost
12 | # This script demonstrates AdaBoost classification.
13 | 
14 | # Tasks:
15 | # 1. Generate synthetic classification data.
16 | # 2. Split data into training and testing sets.
17 | # 3. Train an AdaBoost Classifier with Decision Trees.
18 | # 4. Make predictions and evaluate performance (accuracy, classification report).
19 | # 5. Visualize decision boundary.
20 | 
21 | # Step 1: Generate synthetic data
22 | np.random.seed(42)
23 | X, y = make_classification(n_samples=300, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
24 | 
25 | # Convert to DataFrame
26 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
27 | data['Target'] = y
28 | 
29 | # Step 2: Split data
30 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
31 | 
32 | # Step 3: Train AdaBoost Classifier
33 | base_estimator = DecisionTreeClassifier(max_depth=1)  # Weak learner
34 | adaboost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42)
35 | adaboost.fit(X_train, y_train)
36 | 
37 | # Step 4: Make predictions and evaluate
38 | y_pred = adaboost.predict(X_test)
39 | accuracy = accuracy_score(y_test, y_pred)
40 | print(f'Accuracy: {accuracy:.2f}')
41 | print('Classification Report:')
42 | print(classification_report(y_test, y_pred))
43 | 
44 | # Step 5: Visualize decision boundary
45 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
46 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
47 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
48 | Z = adaboost.predict(np.c_[xx.ravel(), yy.ravel()])
49 | Z = Z.reshape(xx.shape)
50 | 
51 | plt.figure(figsize=(10, 6))
52 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
53 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training Data')
54 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing Data')
55 | plt.xlabel('Feature 1')
56 | plt.ylabel('Feature 2')
57 | plt.title('AdaBoost: Decision Boundary')
58 | plt.legend()
59 | plt.grid(True)
60 | plt.savefig('adaboost.png')
61 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/02 Feature Engineering/04 Binning/Binning.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import KBinsDiscretizer
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.metrics import accuracy_score
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Binning
11 | # This script demonstrates binning continuous features.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic data with continuous features.
15 | # 2. Apply KBinsDiscretizer to bin a feature.
16 | # 3. Train a Logistic Regression model with binned features.
17 | # 4. Evaluate performance (accuracy).
18 | # 5. Visualize binned feature distribution.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | data = pd.DataFrame({
23 |     'Age': np.random.randint(20, 80, 100),
24 |     'Income': np.random.randint(20000, 120000, 100),
25 |     'Target': np.random.choice([0, 1], 100)
26 | })
27 | 
28 | # Step 2: Apply binning
29 | binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
30 | data['Age_Binned'] = binner.fit_transform(data[['Age']])
31 | 
32 | # Step 3: Train Logistic Regression
33 | X = data[['Age', 'Income']]
34 | X_binned = data[['Age_Binned', 'Income']]
35 | y = data['Target']
36 | 
37 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
38 | X_train_binned, X_test_binned = train_test_split(X_binned, test_size=0.2, random_state=42)
39 | 
40 | # Original features
41 | model_orig = LogisticRegression(random_state=42)
42 | model_orig.fit(X_train, y_train)
43 | y_pred_orig = model_orig.predict(X_test)
44 | acc_orig = accuracy_score(y_test, y_pred_orig)
45 | 
46 | # Binned features
47 | model_binned = LogisticRegression(random_state=42)
48 | model_binned.fit(X_train_binned, y_train)
49 | y_pred_binned = model_binned.predict(X_test_binned)
50 | acc_binned = accuracy_score(y_test, y_pred_binned)
51 | 
52 | print(f'Accuracy (Original): {acc_orig:.2f}')
53 | print(f'Accuracy (Binned): {acc_binned:.2f}')
54 | 
55 | # Step 4: Visualize binned feature distribution
56 | plt.figure(figsize=(10, 6))
57 | sns.histplot(data=data, x='Age', color='blue', alpha=0.5, label='Original Age', bins=20)
58 | sns.histplot(data=data, x='Age_Binned', color='red', alpha=0.5, label='Binned Age', bins=5)
59 | plt.xlabel('Age / Binned Age')
60 | plt.ylabel('Count')
61 | plt.title('Binning: Original vs Binned Age Distribution')
62 | plt.legend()
63 | plt.grid(True)
64 | plt.savefig('binning.png')
65 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/06 Support Vector Machines (SVM)/SVM.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.svm import SVC
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Support Vector Machines (SVM)
10 | # This script demonstrates SVM classification with a linear kernel.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic classification data.
14 | # 2. Split data into training and testing sets.
15 | # 3. Train an SVM Classifier.
16 | # 4. Make predictions and evaluate performance (accuracy, classification report).
17 | # 5. Visualize decision boundary and support vectors.
18 | 
19 | # Step 1: Generate synthetic data
20 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
21 | 
22 | # Convert to DataFrame
23 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
24 | data['Target'] = y
25 | 
26 | # Step 2: Split data
27 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
28 | 
29 | # Step 3: Train SVM model
30 | model = SVC(kernel='linear', random_state=42)
31 | model.fit(X_train, y_train)
32 | 
33 | # Step 4: Make predictions
34 | y_pred = model.predict(X_test)
35 | 
36 | # Evaluate performance
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary and support vectors
43 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
44 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
45 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
46 | Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
47 | Z = Z.reshape(xx.shape)
48 | 
49 | plt.figure(figsize=(10, 6))
50 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
51 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training data')
52 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing data')
53 | # Highlight support vectors
54 | plt.scatter(model.support_vectors_[:, 0], model.support_vectors_[:, 1], s=100, facecolors='none', edgecolors='k', label='Support Vectors')
55 | plt.xlabel('Feature 1')
56 | plt.ylabel('Feature 2')
57 | plt.title('SVM: Decision Boundary with Support Vectors')
58 | plt.legend()
59 | plt.grid(True)
60 | plt.savefig('svm.png')
61 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/04 Ensemble Methods/01 Bagging/01 Bootstrap Aggregating/BootstrapAggregating.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.ensemble import BaggingClassifier
 6 | from sklearn.tree import DecisionTreeClassifier
 7 | from sklearn.metrics import accuracy_score, classification_report
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Bootstrap Aggregating (Bagging)
12 | # This script demonstrates Bagging using Decision Trees as base estimators.
13 | 
14 | # Tasks:
15 | # 1. Generate synthetic classification data.
16 | # 2. Split data into training and testing sets.
17 | # 3. Train a Bagging Classifier with Decision Trees.
18 | # 4. Make predictions and evaluate performance (accuracy, classification report).
19 | # 5. Visualize decision boundary.
20 | 
21 | # Step 1: Generate synthetic data
22 | np.random.seed(42)
23 | X, y = make_classification(n_samples=300, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
24 | 
25 | # Convert to DataFrame
26 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
27 | data['Target'] = y
28 | 
29 | # Step 2: Split data
30 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
31 | 
32 | # Step 3: Train Bagging Classifier
33 | base_estimator = DecisionTreeClassifier(max_depth=3)
34 | bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=50, random_state=42)
35 | bagging.fit(X_train, y_train)
36 | 
37 | # Step 4: Make predictions and evaluate
38 | y_pred = bagging.predict(X_test)
39 | accuracy = accuracy_score(y_test, y_pred)
40 | print(f'Accuracy: {accuracy:.2f}')
41 | print('Classification Report:')
42 | print(classification_report(y_test, y_pred))
43 | 
44 | # Step 5: Visualize decision boundary
45 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
46 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
47 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
48 | Z = bagging.predict(np.c_[xx.ravel(), yy.ravel()])
49 | Z = Z.reshape(xx.shape)
50 | 
51 | plt.figure(figsize=(10, 6))
52 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
53 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training Data')
54 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing Data')
55 | plt.xlabel('Feature 1')
56 | plt.ylabel('Feature 2')
57 | plt.title('Bagging Classifier: Decision Boundary')
58 | plt.legend()
59 | plt.grid(True)
60 | plt.savefig('bagging_classifier.png')
61 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/02 Feature Engineering/03 Interaction Terms/InteractionTerms.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.metrics import mean_squared_error
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Interaction Terms
11 | # This script demonstrates adding interaction terms to capture feature interactions.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic data with interacting features.
15 | # 2. Apply PolynomialFeatures to include interaction terms.
16 | # 3. Train Linear Regression models with and without interaction terms.
17 | # 4. Evaluate performance (MSE).
18 | # 5. Visualize model predictions.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | X = np.random.rand(100, 2)
23 | y = 2 * X[:, 0] + 3 * X[:, 1] + 5 * X[:, 0] * X[:, 1] + np.random.randn(100) * 0.1
24 | 
25 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
26 | data['Target'] = y
27 | 
28 | # Step 2: Apply PolynomialFeatures for interaction terms
29 | poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
30 | X_interaction = poly.fit_transform(X)
31 | 
32 | # Step 3: Train Linear Regression models
33 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
34 | X_train_inter, X_test_inter = train_test_split(X_interaction, test_size=0.2, random_state=42)
35 | 
36 | # Linear model (no interactions)
37 | model_linear = LinearRegression()
38 | model_linear.fit(X_train, y_train)
39 | y_pred_linear = model_linear.predict(X_test)
40 | mse_linear = mean_squared_error(y_test, y_pred_linear)
41 | 
42 | # Interaction model
43 | model_inter = LinearRegression()
44 | model_inter.fit(X_train_inter, y_train)
45 | y_pred_inter = model_inter.predict(X_test_inter)
46 | mse_inter = mean_squared_error(y_test, y_pred_inter)
47 | 
48 | print(f'MSE (Linear): {mse_linear:.2f}')
49 | print(f'MSE (Interaction Terms): {mse_inter:.2f}')
50 | 
51 | # Step 4: Visualize actual vs predicted
52 | plt.figure(figsize=(10, 6))
53 | plt.scatter(y_test, y_pred_linear, color='green', label='Linear Predictions')
54 | plt.scatter(y_test, y_pred_inter, color='red', label='Interaction Predictions')
55 | plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'b--', label='Perfect Prediction')
56 | plt.xlabel('Actual Values')
57 | plt.ylabel('Predicted Values')
58 | plt.title('Interaction Terms: Actual vs Predicted')
59 | plt.legend()
60 | plt.grid(True)
61 | plt.savefig('interaction_terms.png')
62 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/05 Deployment/02 API Integration/APIIntegration.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | from sklearn.datasets import load_iris
 4 | from sklearn.linear_model import LogisticRegression
 5 | from sklearn.model_selection import train_test_split
 6 | import joblib
 7 | from flask import Flask, request, jsonify
 8 | import matplotlib.pyplot as plt
 9 | import seaborn as sns
10 | 
11 | # API Integration
12 | # This script demonstrates creating a Flask API for a trained model.
13 | 
14 | # Tasks:
15 | # 1. Load the Iris dataset and train a Logistic Regression model.
16 | # 2. Save the model using joblib.
17 | # 3. Create a Flask API to serve predictions.
18 | # 4. Test the API with sample data.
19 | # 5. Visualize API predictions.
20 | 
21 | # Step 1: Load data and train model
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
26 | model = LogisticRegression(random_state=42)
27 | model.fit(X_train, y_train)
28 | 
29 | # Step 2: Save the model
30 | joblib.dump(model, 'logistic_model_api.pkl')
31 | 
32 | # Step 3: Create Flask API
33 | app = Flask(__name__)
34 | model = joblib.load('logistic_model_api.pkl')
35 | 
36 | @app.route('/predict', methods=['POST'])
37 | def predict():
38 |     data = request.get_json(force=True)
39 |     features = np.array(data['features']).reshape(1, -1)
40 |     prediction = model.predict(features)
41 |     return jsonify({'prediction': int(prediction[0]), 'class': iris.target_names[prediction[0]]})
42 | 
43 | # Note: The Flask app would be run with app.run(debug=True) in a real environment.
44 | # For demonstration, we'll simulate API testing.
45 | 
46 | # Step 4: Test the API (simulated)
47 | sample_data = X_test[:5]
48 | predictions = model.predict(sample_data)
49 | class_names = [iris.target_names[pred] for pred in predictions]
50 | 
51 | print('Sample Predictions:')
52 | for i, (features, pred, name) in enumerate(zip(sample_data, predictions, class_names)):
53 |     print(f'Sample {i+1}: Features={features.tolist()}, Prediction={pred}, Class={name}')
54 | 
55 | # Step 5: Visualize predictions
56 | plt.figure(figsize=(10, 6))
57 | sns.scatterplot(x=X_test[:5, 0], y=X_test[:5, 1], hue=class_names, style=iris.target_names[y_test[:5]], s=100)
58 | plt.xlabel(iris.feature_names[0])
59 | plt.ylabel(iris.feature_names[1])
60 | plt.title('API Integration: Simulated API Predictions')
61 | plt.grid(True)
62 | plt.savefig('api_integration.png')
63 | plt.close()
64 | 
65 | # To run the Flask API, use: app.run(debug=True)
66 | # Then send POST requests to http://localhost:5000/predict with JSON like {"features": [5.1, 3.5, 1.4, 0.2]}


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/04 Model Evaluation/02 Overfitting/Overfitting.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.metrics import mean_squared_error
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Overfitting
10 | # This script demonstrates overfitting using a high-degree polynomial regression.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic non-linear data.
14 | # 2. Train a high-degree polynomial regression model.
15 | # 3. Compare with a simpler model.
16 | # 4. Evaluate training and testing errors.
17 | # 5. Visualize overfitting.
18 | 
19 | # Step 1: Generate synthetic data
20 | np.random.seed(42)
21 | X = np.sort(5 * np.random.rand(100, 1), axis=0)
22 | y = np.sin(X).ravel() + np.random.randn(100) * 0.1
23 | 
24 | # Step 2: Split data
25 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
26 | 
27 | # Step 3: Train models
28 | # Simple model (degree 3)
29 | poly_simple = PolynomialFeatures(degree=3)
30 | X_train_simple = poly_simple.fit_transform(X_train)
31 | X_test_simple = poly_simple.transform(X_test)
32 | model_simple = LinearRegression()
33 | model_simple.fit(X_train_simple, y_train)
34 | y_pred_simple = model_simple.predict(X_test_simple)
35 | mse_simple = mean_squared_error(y_test, y_pred_simple)
36 | 
37 | # Overfit model (degree 15)
38 | poly_overfit = PolynomialFeatures(degree=15)
39 | X_train_overfit = poly_overfit.fit_transform(X_train)
40 | X_test_overfit = poly_overfit.transform(X_test)
41 | model_overfit = LinearRegression()
42 | model_overfit.fit(X_train_overfit, y_train)
43 | y_pred_overfit = model_overfit.predict(X_test_overfit)
44 | mse_overfit = mean_squared_error(y_test, y_pred_overfit)
45 | 
46 | print(f'MSE (Simple, degree=3): {mse_simple:.2f}')
47 | print(f'MSE (Overfit, degree=15): {mse_overfit:.2f}')
48 | 
49 | # Step 4: Visualize overfitting
50 | X_plot = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
51 | y_plot_simple = model_simple.predict(poly_simple.transform(X_plot))
52 | y_plot_overfit = model_overfit.predict(poly_overfit.transform(X_plot))
53 | 
54 | plt.figure(figsize=(10, 6))
55 | plt.scatter(X_train, y_train, color='blue', label='Training Data')
56 | plt.scatter(X_test, y_test, color='green', label='Testing Data')
57 | plt.plot(X_plot, y_plot_simple, color='red', label='Simple Model (degree=3)')
58 | plt.plot(X_plot, y_plot_overfit, color='purple', label='Overfit Model (degree=15)')
59 | plt.xlabel('X')
60 | plt.ylabel('y')
61 | plt.title('Overfitting: Simple vs Overfit Model')
62 | plt.legend()
63 | plt.grid(True)
64 | plt.savefig('overfitting.png')
65 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/02 Classification/03 Random Forest/RandomForest.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.ensemble import RandomForestClassifier
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Random Forest
11 | # This script demonstrates Random Forest classification.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic classification data.
15 | # 2. Split data into training and testing sets.
16 | # 3. Train a Random Forest Classifier.
17 | # 4. Make predictions and evaluate performance (accuracy, classification report).
18 | # 5. Visualize decision boundary and feature importance.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | X, y = make_classification(n_samples=300, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
23 | 
24 | # Convert to DataFrame
25 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
26 | data['Target'] = y
27 | 
28 | # Step 2: Split data
29 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
30 | 
31 | # Step 3: Train Random Forest Classifier
32 | rf = RandomForestClassifier(n_estimators=100, random_state=42)
33 | rf.fit(X_train, y_train)
34 | 
35 | # Step 4: Make predictions and evaluate
36 | y_pred = rf.predict(X_test)
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary and feature importance
43 | plt.figure(figsize=(12, 5))
44 | 
45 | # Decision boundary
46 | plt.subplot(1, 2, 1)
47 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
48 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
49 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
50 | Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
51 | Z = Z.reshape(xx.shape)
52 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
53 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training Data')
54 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing Data')
55 | plt.xlabel('Feature 1')
56 | plt.ylabel('Feature 2')
57 | plt.title('Random Forest: Decision Boundary')
58 | plt.legend()
59 | 
60 | # Feature importance
61 | plt.subplot(1, 2, 2)
62 | sns.barplot(x=rf.feature_importances_, y=['Feature_1', 'Feature_2'], palette='viridis')
63 | plt.xlabel('Feature Importance')
64 | plt.title('Random Forest: Feature Importance')
65 | plt.tight_layout()
66 | plt.savefig('random_forest.png')
67 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/04 Model Evaluation/03 Underfitting/Underfitting.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.preprocessing import PolynomialFeatures
 5 | from sklearn.linear_model import LinearRegression
 6 | from sklearn.metrics import mean_squared_error
 7 | from sklearn.model_selection import train_test_split
 8 | 
 9 | # Underfitting
10 | # This script demonstrates underfitting using a low-degree polynomial regression.
11 | 
12 | # Tasks:
13 | # 1. Generate synthetic non-linear data.
14 | # 2. Train a low-degree polynomial regression model.
15 | # 3. Compare with a better-fitting model.
16 | # 4. Evaluate training and testing errors.
17 | # 5. Visualize underfitting.
18 | 
19 | # Step 1: Generate synthetic data
20 | np.random.seed(42)
21 | X = np.sort(5 * np.random.rand(100, 1), axis=0)
22 | y = np.sin(X).ravel() + np.random.randn(100) * 0.1
23 | 
24 | # Step 2: Split data
25 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
26 | 
27 | # Step 3: Train models
28 | # Underfit model (degree 1)
29 | poly_underfit = PolynomialFeatures(degree=1)
30 | X_train_underfit = poly_underfit.fit_transform(X_train)
31 | X_test_underfit = poly_underfit.transform(X_test)
32 | model_underfit = LinearRegression()
33 | model_underfit.fit(X_train_underfit, y_train)
34 | y_pred_underfit = model_underfit.predict(X_test_underfit)
35 | mse_underfit = mean_squared_error(y_test, y_pred_underfit)
36 | 
37 | # Better model (degree 3)
38 | poly_better = PolynomialFeatures(degree=3)
39 | X_train_better = poly_better.fit_transform(X_train)
40 | X_test_better = poly_better.transform(X_test)
41 | model_better = LinearRegression()
42 | model_better.fit(X_train_better, y_train)
43 | y_pred_better = model_better.predict(X_test_better)
44 | mse_better = mean_squared_error(y_test, y_pred_better)
45 | 
46 | print(f'MSE (Underfit, degree=1): {mse_underfit:.2f}')
47 | print(f'MSE (Better, degree=3): {mse_better:.2f}')
48 | 
49 | # Step 4: Visualize underfitting
50 | X_plot = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
51 | y_plot_underfit = model_underfit.predict(poly_underfit.transform(X_plot))
52 | y_plot_better = model_better.predict(poly_better.transform(X_plot))
53 | 
54 | plt.figure(figsize=(10, 6))
55 | plt.scatter(X_train, y_train, color='blue', label='Training Data')
56 | plt.scatter(X_test, y_test, color='green', label='Testing Data')
57 | plt.plot(X_plot, y_plot_underfit, color='red', label='Underfit Model (degree=1)')
58 | plt.plot(X_plot, y_plot_better, color='purple', label='Better Model (degree=3)')
59 | plt.xlabel('X')
60 | plt.ylabel('y')
61 | plt.title('Underfitting: Underfit vs Better Model')
62 | plt.legend()
63 | plt.grid(True)
64 | plt.savefig('underfitting.png')
65 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/04 Ensemble Methods/02 Boosting/02 Gradient Boosting/GradientBoosting.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.ensemble import GradientBoostingClassifier
 6 | from sklearn.metrics import accuracy_score, classification_report
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Gradient Boosting
11 | # This script demonstrates Gradient Boosting classification.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic classification data.
15 | # 2. Split data into training and testing sets.
16 | # 3. Train a Gradient Boosting Classifier.
17 | # 4. Make predictions and evaluate performance (accuracy, classification report).
18 | # 5. Visualize decision boundary and feature importance.
19 | 
20 | # Step 1: Generate synthetic data
21 | np.random.seed(42)
22 | X, y = make_classification(n_samples=300, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
23 | 
24 | # Convert to DataFrame
25 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
26 | data['Target'] = y
27 | 
28 | # Step 2: Split data
29 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
30 | 
31 | # Step 3: Train Gradient Boosting Classifier
32 | gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
33 | gb.fit(X_train, y_train)
34 | 
35 | # Step 4: Make predictions and evaluate
36 | y_pred = gb.predict(X_test)
37 | accuracy = accuracy_score(y_test, y_pred)
38 | print(f'Accuracy: {accuracy:.2f}')
39 | print('Classification Report:')
40 | print(classification_report(y_test, y_pred))
41 | 
42 | # Step 5: Visualize decision boundary and feature importance
43 | plt.figure(figsize=(12, 5))
44 | 
45 | # Decision boundary
46 | plt.subplot(1, 2, 1)
47 | x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
48 | y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
49 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
50 | Z = gb.predict(np.c_[xx.ravel(), yy.ravel()])
51 | Z = Z.reshape(xx.shape)
52 | plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
53 | plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', label='Training Data')
54 | plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='x', label='Testing Data')
55 | plt.xlabel('Feature 1')
56 | plt.ylabel('Feature 2')
57 | plt.title('Gradient Boosting: Decision Boundary')
58 | plt.legend()
59 | 
60 | # Feature importance
61 | plt.subplot(1, 2, 2)
62 | sns.barplot(x=gb.feature_importances_, y=['Feature_1', 'Feature_2'], palette='viridis')
63 | plt.xlabel('Feature Importance')
64 | plt.title('Gradient Boosting: Feature Importance')
65 | plt.tight_layout()
66 | plt.savefig('gradient_boosting.png')
67 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/01 Supervised Learning/03 Evaluation Metrics/02 Classification Metrics/ClassificationMetrics.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import make_classification
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Classification Evaluation Metrics
11 | # This script demonstrates Accuracy, Precision, Recall, F1 Score, Confusion Matrix, ROC Curve, and AUC.
12 | 
13 | # Tasks:
14 | # 1. Generate synthetic classification data.
15 | # 2. Split data into training and testing sets.
16 | # 3. Train a Logistic Regression model.
17 | # 4. Calculate classification metrics.
18 | # 5. Visualize confusion matrix and ROC curve.
19 | 
20 | # Step 1: Generate synthetic data
21 | X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)
22 | 
23 | # Convert to DataFrame
24 | data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
25 | data['Target'] = y
26 | 
27 | # Step 2: Split data
28 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
29 | 
30 | # Step 3: Train Logistic Regression model
31 | model = LogisticRegression(random_state=42)
32 | model.fit(X_train, y_train)
33 | 
34 | # Step 4: Make predictions and calculate metrics
35 | y_pred = model.predict(X_test)
36 | y_proba = model.predict_proba(X_test)[:, 1]  # Probability for positive class
37 | 
38 | accuracy = accuracy_score(y_test, y_pred)
39 | precision = precision_score(y_test, y_pred)
40 | recall = recall_score(y_test, y_pred)
41 | f1 = f1_score(y_test, y_pred)
42 | cm = confusion_matrix(y_test, y_pred)
43 | fpr, tpr, _ = roc_curve(y_test, y_proba)
44 | roc_auc = auc(fpr, tpr)
45 | 
46 | print(f'Accuracy: {accuracy:.2f}')
47 | print(f'Precision: {precision:.2f}')
48 | print(f'Recall: {recall:.2f}')
49 | print(f'F1 Score: {f1:.2f}')
50 | print('Confusion Matrix:')
51 | print(cm)
52 | print(f'AUC Score: {roc_auc:.2f}')
53 | 
54 | # Step 5: Visualize confusion matrix and ROC curve
55 | plt.figure(figsize=(12, 5))
56 | 
57 | # Confusion Matrix
58 | plt.subplot(1, 2, 1)
59 | sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
60 | plt.xlabel('Predicted')
61 | plt.ylabel('Actual')
62 | plt.title('Confusion Matrix')
63 | 
64 | # ROC Curve
65 | plt.subplot(1, 2, 2)
66 | plt.plot(fpr, tpr, color='blue', label=f'ROC Curve (AUC = {roc_auc:.2f})')
67 | plt.plot([0, 1], [0, 1], color='red', linestyle='--', label='Random')
68 | plt.xlabel('False Positive Rate')
69 | plt.ylabel('True Positive Rate')
70 | plt.title('ROC Curve')
71 | plt.legend()
72 | plt.grid(True)
73 | 
74 | plt.tight_layout()
75 | plt.savefig('classification_metrics.png')
76 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/03 Standardization/Standardization.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.preprocessing import StandardScaler
 6 | from sklearn.linear_model import LogisticRegression
 7 | from sklearn.metrics import accuracy_score
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Standardization Demonstration
12 | # This script focuses on Standardization techniques using StandardScaler on the Iris dataset.
13 | 
14 | # Tasks:
15 | # 1. Load and explore the Iris dataset.
16 | # 2. Apply StandardScaler for standardization (zero mean, unit variance).
17 | # 3. Train a Logistic Regression model on standardized data.
18 | # 4. Compare performance with raw data.
19 | # 5. Visualize the effect of standardization.
20 | 
21 | # Step 1: Load data
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | data = pd.DataFrame(X, columns=iris.feature_names)
26 | 
27 | # Explore raw data
28 | print("Raw Data Statistics:")
29 | print(data.describe())
30 | 
31 | # Step 2: Apply Standardization
32 | standard_scaler = StandardScaler()
33 | X_standardized = standard_scaler.fit_transform(X)
34 | 
35 | # Step 3: Train Logistic Regression
36 | X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
37 | X_train_std, X_test_std = train_test_split(X_standardized, test_size=0.2, random_state=42)
38 | 
39 | # Raw data model
40 | model_raw = LogisticRegression(random_state=42, max_iter=200)
41 | model_raw.fit(X_train_raw, y_train)
42 | y_pred_raw = model_raw.predict(X_test_raw)
43 | acc_raw = accuracy_score(y_test, y_pred_raw)
44 | 
45 | # Standardized data model
46 | model_std = LogisticRegression(random_state=42, max_iter=200)
47 | model_std.fit(X_train_std, y_train)
48 | y_pred_std = model_std.predict(X_test_std)
49 | acc_std = accuracy_score(y_test, y_pred_std)
50 | 
51 | # Step 4: Print results
52 | print(f'\nAccuracy (Raw Data): {acc_raw:.2f}')
53 | print(f'Accuracy (Standardized): {acc_std:.2f}')
54 | 
55 | # Step 5: Visualize
56 | plt.figure(figsize=(10, 6))
57 | 
58 | # Raw data
59 | plt.subplot(2, 1, 1)
60 | sns.boxplot(data=data)
61 | plt.title('Raw Features')
62 | plt.xticks(rotation=45)
63 | 
64 | # Standardized data
65 | plt.subplot(2, 1, 2)
66 | sns.boxplot(data=pd.DataFrame(X_standardized, columns=iris.feature_names))
67 | plt.title('Standardized Features (StandardScaler)')
68 | plt.xticks(rotation=45)
69 | 
70 | plt.tight_layout()
71 | plt.savefig('standardization_effect.png')
72 | plt.close()
73 | 
74 | # Additional: Check mean and variance after standardization
75 | standardized_data = pd.DataFrame(X_standardized, columns=iris.feature_names)
76 | print("\nStandardized Data Statistics (Mean ~ 0, Std ~ 1):")
77 | print(standardized_data.describe())
78 | 
79 | print("\nStandardization complete. Check 'standardization_effect.png' for visualization.")


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/01 Feature Scaling/FeatureScaling.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.preprocessing import MinMaxScaler, StandardScaler
 6 | from sklearn.linear_model import LogisticRegression
 7 | from sklearn.metrics import accuracy_score
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Feature Scaling
12 | # This script demonstrates Normalization and Standardization on the Iris dataset.
13 | 
14 | # Tasks:
15 | # 1. Load the Iris dataset.
16 | # 2. Apply Normalization (MinMaxScaler) and Standardization (StandardScaler).
17 | # 3. Train a Logistic Regression model on scaled data.
18 | # 4. Compare model performance (accuracy).
19 | # 5. Visualize feature distributions before and after scaling.
20 | 
21 | # Step 1: Load data
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | data = pd.DataFrame(X, columns=iris.feature_names)
26 | 
27 | # Step 2: Apply scaling
28 | # Normalization (MinMaxScaler)
29 | minmax_scaler = MinMaxScaler()
30 | X_normalized = minmax_scaler.fit_transform(X)
31 | 
32 | # Standardization (StandardScaler)
33 | standard_scaler = StandardScaler()
34 | X_standardized = standard_scaler.fit_transform(X)
35 | 
36 | # Step 3: Train Logistic Regression on each dataset
37 | X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
38 | X_train_norm, X_test_norm = train_test_split(X_normalized, test_size=0.2, random_state=42)
39 | X_train_std, X_test_std = train_test_split(X_standardized, test_size=0.2, random_state=42)
40 | 
41 | # Raw data
42 | model_raw = LogisticRegression(random_state=42)
43 | model_raw.fit(X_train_raw, y_train)
44 | y_pred_raw = model_raw.predict(X_test_raw)
45 | acc_raw = accuracy_score(y_test, y_pred_raw)
46 | 
47 | # Normalized data
48 | model_norm = LogisticRegression(random_state=42)
49 | model_norm.fit(X_train_norm, y_train)
50 | y_pred_norm = model_norm.predict(X_test_norm)
51 | acc_norm = accuracy_score(y_test, y_pred_norm)
52 | 
53 | # Standardized data
54 | model_std = LogisticRegression(random_state=42)
55 | model_std.fit(X_train_std, y_train)
56 | y_pred_std = model_std.predict(X_test_std)
57 | acc_std = accuracy_score(y_test, y_pred_std)
58 | 
59 | print(f'Accuracy (Raw): {acc_raw:.2f}')
60 | print(f'Accuracy (Normalized): {acc_norm:.2f}')
61 | print(f'Accuracy (Standardized): {acc_std:.2f}')
62 | 
63 | # Step 4: Visualize feature distributions
64 | plt.figure(figsize=(12, 8))
65 | 
66 | # Raw data
67 | plt.subplot(3, 1, 1)
68 | sns.boxplot(data=data)
69 | plt.title('Raw Features')
70 | plt.xticks(rotation=45)
71 | 
72 | # Normalized data
73 | plt.subplot(3, 1, 2)
74 | sns.boxplot(data=pd.DataFrame(X_normalized, columns=iris.feature_names))
75 | plt.title('Normalized Features (MinMaxScaler)')
76 | plt.xticks(rotation=45)
77 | 
78 | # Standardized data
79 | plt.subplot(3, 1, 3)
80 | sns.boxplot(data=pd.DataFrame(X_standardized, columns=iris.feature_names))
81 | plt.title('Standardized Features (StandardScaler)')
82 | plt.xticks(rotation=45)
83 | 
84 | plt.tight_layout()
85 | plt.savefig('feature_scaling.png')
86 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/06 Encoding Categorical Variables/EncodingCategoricalVariables.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import matplotlib.pyplot as plt
 3 | from sklearn.datasets import load_iris
 4 | from sklearn.preprocessing import LabelEncoder, OneHotEncoder
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn.metrics import accuracy_score
 7 | from sklearn.model_selection import train_test_split
 8 | import seaborn as sns
 9 | 
10 | # Encoding Categorical Variables
11 | # This script demonstrates Label Encoding and One-Hot Encoding.
12 | 
13 | # Tasks:
14 | # 1. Create a synthetic dataset with categorical variables.
15 | # 2. Apply Label Encoding and One-Hot Encoding.
16 | # 3. Train a Logistic Regression model on encoded data.
17 | # 4. Compare model performance (accuracy).
18 | # 5. Visualize the effect of encoding.
19 | 
20 | # Step 1: Create synthetic dataset
21 | np.random.seed(42)
22 | data = pd.DataFrame({
23 |     'Age': np.random.randint(20, 60, 100),
24 |     'Income': np.random.randint(30000, 100000, 100),
25 |     'Category': np.random.choice(['Low', 'Medium', 'High'], 100),
26 |     'Target': np.random.choice([0, 1], 100)
27 | })
28 | 
29 | # Step 2: Apply encoding
30 | # Label Encoding
31 | label_encoder = LabelEncoder()
32 | data['Category_Label'] = label_encoder.fit_transform(data['Category'])
33 | 
34 | # One-Hot Encoding
35 | one_hot_encoder = OneHotEncoder(sparse=False)
36 | category_ohe = one_hot_encoder.fit_transform(data[['Category']])
37 | category_ohe_df = pd.DataFrame(category_ohe, columns=one_hot_encoder.get_feature_names_out(['Category']))
38 | 
39 | data_encoded = pd.concat([data[['Age', 'Income']], category_ohe_df, data['Target']], axis=1)
40 | 
41 | # Step 3: Train Logistic Regression
42 | X_label = data[['Age', 'Income', 'Category_Label']]
43 | X_ohe = data_encoded.drop('Target', axis=1)
44 | y = data['Target']
45 | 
46 | X_train_label, X_test_label, y_train, y_test = train_test_split(X_label, y, test_size=0.2, random_state=42)
47 | X_train_ohe, X_test_ohe = train_test_split(X_ohe, test_size=0.2, random_state=42)
48 | 
49 | # Label encoded model
50 | model_label = LogisticRegression(random_state=42)
51 | model_label.fit(X_train_label, y_train)
52 | y_pred_label = model_label.predict(X_test_label)
53 | acc_label = accuracy_score(y_test, y_pred_label)
54 | 
55 | # One-Hot encoded model
56 | model_ohe = LogisticRegression(random_state=42)
57 | model_ohe.fit(X_train_ohe, y_train)
58 | y_pred_ohe = model_ohe.predict(X_test_ohe)
59 | acc_ohe = accuracy_score(y_test, y_pred_label)
60 | 
61 | print(f'Accuracy (Label Encoded): {acc_label:.2f}')
62 | print(f'Accuracy (One-Hot Encoded): {acc_ohe:.2f}')
63 | 
64 | # Step 4: Visualize data distribution
65 | plt.figure(figsize=(12, 5))
66 | 
67 | # Label encoded
68 | plt.subplot(1, 2, 1)
69 | sns.countplot(x='Category_Label', hue='Target', data=data)
70 | plt.title('Label Encoded Categories')
71 | plt.xlabel('Category (Encoded)')
72 | 
73 | # One-Hot encoded (show distribution of original categories)
74 | plt.subplot(1, 2, 2)
75 | sns.countplot(x='Category', hue='Target', data=data)
76 | plt.title('Original Categories')
77 | plt.xlabel('Category')
78 | 
79 | plt.tight_layout()
80 | plt.savefig('encoding_categorical_variables.png')
81 | plt.close()


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/02 Normalization/Normalization.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.preprocessing import MinMaxScaler
 6 | from sklearn.linear_model import LogisticRegression
 7 | from sklearn.metrics import accuracy_score
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Normalization Demonstration
12 | # This script focuses on Normalization techniques using MinMaxScaler on the Iris dataset.
13 | 
14 | # Tasks:
15 | # 1. Load and explore the Iris dataset.
16 | # 2. Apply MinMaxScaler for normalization (scales data to [0, 1] or custom range).
17 | # 3. Train a Logistic Regression model on normalized data.
18 | # 4. Compare performance with raw data.
19 | # 5. Visualize the effect of normalization.
20 | 
21 | # Step 1: Load data
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | data = pd.DataFrame(X, columns=iris.feature_names)
26 | 
27 | # Explore raw data
28 | print("Raw Data Statistics:")
29 | print(data.describe())
30 | 
31 | # Step 2: Apply Normalization
32 | # Default range [0, 1]
33 | minmax_scaler = MinMaxScaler()
34 | X_normalized = minmax_scaler.fit_transform(X)
35 | 
36 | # Custom range example [-1, 1]
37 | minmax_scaler_custom = MinMaxScaler(feature_range=(-1, 1))
38 | X_normalized_custom = minmax_scaler_custom.fit_transform(X)
39 | 
40 | # Step 3: Train Logistic Regression
41 | X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
42 | X_train_norm, X_test_norm = train_test_split(X_normalized, test_size=0.2, random_state=42)
43 | 
44 | # Raw data model
45 | model_raw = LogisticRegression(random_state=42, max_iter=200)
46 | model_raw.fit(X_train_raw, y_train)
47 | y_pred_raw = model_raw.predict(X_test_raw)
48 | acc_raw = accuracy_score(y_test, y_pred_raw)
49 | 
50 | # Normalized data model
51 | model_norm = LogisticRegression(random_state=42, max_iter=200)
52 | model_norm.fit(X_train_norm, y_train)
53 | y_pred_norm = model_norm.predict(X_test_norm)
54 | acc_norm = accuracy_score(y_test, y_pred_norm)
55 | 
56 | # Step 4: Print results
57 | print(f'\nAccuracy (Raw Data): {acc_raw:.2f}')
58 | print(f'Accuracy (Normalized [0,1]): {acc_norm:.2f}')
59 | 
60 | # Step 5: Visualize
61 | plt.figure(figsize=(10, 6))
62 | 
63 | # Raw data
64 | plt.subplot(2, 1, 1)
65 | sns.boxplot(data=data)
66 | plt.title('Raw Features')
67 | plt.xticks(rotation=45)
68 | 
69 | # Normalized data [0, 1]
70 | plt.subplot(2, 1, 2)
71 | sns.boxplot(data=pd.DataFrame(X_normalized, columns=iris.feature_names))
72 | plt.title('Normalized Features (MinMaxScaler [0,1])')
73 | plt.xticks(rotation=45)
74 | 
75 | plt.tight_layout()
76 | plt.savefig('normalization_effect.png')
77 | plt.close()
78 | 
79 | # Additional: Show custom range effect
80 | plt.figure(figsize=(6, 4))
81 | sns.boxplot(data=pd.DataFrame(X_normalized_custom, columns=iris.feature_names))
82 | plt.title('Normalized Features (MinMaxScaler [-1,1])')
83 | plt.xticks(rotation=45)
84 | plt.tight_layout()
85 | plt.savefig('normalization_custom_range.png')
86 | plt.close()
87 | 
88 | print("\nNormalization complete. Check 'normalization_effect.png' and 'normalization_custom_range.png' for visualizations.")


--------------------------------------------------------------------------------
/Machine Learning Foundations/03 ML Pipelines/01 Data Preprocessing/04 Handling Missing Values/HandlingMissingValues.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import matplotlib.pyplot as plt
 4 | from sklearn.datasets import load_iris
 5 | from sklearn.impute import SimpleImputer
 6 | from sklearn.linear_model import LogisticRegression
 7 | from sklearn.metrics import accuracy_score
 8 | from sklearn.model_selection import train_test_split
 9 | import seaborn as sns
10 | 
11 | # Handling Missing Values
12 | # This script demonstrates techniques to handle missing values in a dataset.
13 | 
14 | # Tasks:
15 | # 1. Load the Iris dataset and introduce synthetic missing values.
16 | # 2. Apply mean imputation and median imputation.
17 | # 3. Train a Logistic Regression model on imputed data.
18 | # 4. Compare model performance (accuracy).
19 | # 5. Visualize data distribution before and after imputation.
20 | 
21 | # Step 1: Load data and introduce missing values
22 | iris = load_iris()
23 | X = iris.data
24 | y = iris.target
25 | data = pd.DataFrame(X, columns=iris.feature_names)
26 | 
27 | # Introduce 10% missing values randomly
28 | np.random.seed(42)
29 | mask = np.random.rand(*X.shape) < 0.1
30 | X_with_missing = X.copy()
31 | X_with_missing[mask] = np.nan
32 | data_missing = pd.DataFrame(X_with_missing, columns=iris.feature_names)
33 | 
34 | # Step 2: Apply imputation
35 | # Mean imputation
36 | mean_imputer = SimpleImputer(strategy='mean')
37 | X_mean_imputed = mean_imputer.fit_transform(X_with_missing)
38 | 
39 | # Median imputation
40 | median_imputer = SimpleImputer(strategy='median')
41 | X_median_imputed = median_imputer.fit_transform(X_with_missing)
42 | 
43 | # Step 3: Train Logistic Regression on each dataset
44 | X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
45 | X_train_mean, X_test_mean = train_test_split(X_mean_imputed, test_size=0.2, random_state=42)
46 | X_train_median, X_test_median = train_test_split(X_median_imputed, test_size=0.2, random_state=42)
47 | 
48 | # Raw data (no missing values)
49 | model_raw = LogisticRegression(random_state=42)
50 | model_raw.fit(X_train_raw, y_train)
51 | y_pred_raw = model_raw.predict(X_test_raw)
52 | acc_raw = accuracy_score(y_test, y_pred_raw)
53 | 
54 | # Mean imputed data
55 | model_mean = LogisticRegression(random_state=42)
56 | model_mean.fit(X_train_mean, y_train)
57 | y_pred_mean = model_mean.predict(X_test_mean)
58 | acc_mean = accuracy_score(y_test, y_pred_mean)
59 | 
60 | # Median imputed data
61 | model_median = LogisticRegression(random_state=42)
62 | model_median.fit(X_train_median, y_train)
63 | y_pred_median = model_median.predict(X_test_median)
64 | acc_median = accuracy_score(y_test, y_pred_median)
65 | 
66 | print(f'Accuracy (Raw): {acc_raw:.2f}')
67 | print(f'Accuracy (Mean Imputed): {acc_mean:.2f}')
68 | print(f'Accuracy (Median Imputed): {acc_median:.2f}')
69 | 
70 | # Step 4: Visualize data distribution
71 | plt.figure(figsize=(12, 8))
72 | 
73 | # Original data with missing values
74 | plt.subplot(3, 1, 1)
75 | sns.boxplot(data=data_missing)
76 | plt.title('Data with Missing Values')
77 | plt.xticks(rotation=45)
78 | 
79 | # Mean imputed data
80 | plt.subplot(3, 1, 2)
81 | sns.boxplot(data=pd.DataFrame(X_mean_imputed, columns=iris.feature_names))
82 | plt.title('Mean Imputed Data')
83 | plt.xticks(rotation=45)
84 | 
85 | # Median imputed data
86 | plt.subplot(3, 1, 3)
87 | sns.boxplot(data=pd.DataFrame(X_median_imputed, columns=iris.feature_names))
88 | plt.title('Median Imputed Data')
89 | plt.xticks(rotation=45)
90 | 
91 | plt.tight_layout()
92 | plt.savefig('handling_missing_values.png')
93 | plt.close()


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🤖 Machine Learning Interview Preparation
  2 | 
  3 | <div align="center">
  4 |   <img src="https://img.shields.io/badge/Machine%20Learning-FF6F61?style=for-the-badge&logo=scikit-learn&logoColor=white" alt="Machine Learning" />
  5 |   <img src="https://img.shields.io/badge/scikit--learn-F7931E?style=for-the-badge&logo=scikit-learn&logoColor=white" alt="scikit-learn" />
  6 | </div>
  7 | 
  8 | <p align="center">Your comprehensive guide to mastering Machine Learning for AI/ML interviews</p>
  9 | 
 10 | ---
 11 | 
 12 | ## 📖 Introduction
 13 | 
 14 | Welcome to my Machine Learning prep for AI/ML interviews! 🚀 This repository is your go-to guide for mastering ML, the heart of AI, with hands-on practice and interview-focused insights. From core concepts to advanced techniques, it’s designed to help you excel in technical interviews and AI projects with clarity and confidence.
 15 | 
 16 | ## 🌟 What’s Inside?
 17 | 
 18 | - **Algorithms Mastery**: Conquer regression, classification, clustering, and more to ace coding tests.
 19 | - **Pipelines Expertise**: Master preprocessing, evaluation, and deployment workflows.
 20 | - **Hands-on Practice**: Implement ML algorithms with detailed solutions to sharpen your edge.
 21 | - **Interview Question Bank**: Dive into ML theory with clear, concise answers.
 22 | - **Performance Optimization**: Learn tips for building efficient, interview-ready models.
 23 | 
 24 | ## 🔍 Who Is This For?
 25 | 
 26 | - Data Scientists prepping for technical interviews.
 27 | - Machine Learning Engineers strengthening ML foundations.
 28 | - AI Researchers enhancing algorithm skills.
 29 | - Software Engineers transitioning to AI/ML roles.
 30 | - Anyone mastering ML for data-driven applications.
 31 | 
 32 | ## 🗺️ Comprehensive Learning Roadmap
 33 | 
 34 | ---
 35 | 
 36 | ### 🤖 Machine Learning Foundations
 37 | 
 38 | #### 📈 Supervised Learning
 39 | - Regression
 40 |   - Linear Regression
 41 |   - Polynomial Regression
 42 |   - Ridge Regression
 43 |   - Lasso Regression
 44 | - Classification
 45 |   - Logistic Regression
 46 |   - Decision Trees
 47 |   - Random Forest
 48 |   - Naive Bayes
 49 |   - K-Nearest Neighbors (KNN)
 50 |   - Support Vector Machines (SVM)
 51 | - Evaluation Metrics
 52 |   - Regression Metrics
 53 |       - Mean Squared Error
 54 |       - Mean Absolute Error
 55 |       - R² Score
 56 |   - Classification Metrics 
 57 |       - Accuracy
 58 |       - Precision
 59 |       - Recall
 60 |       - F1 Score
 61 |       - Confusion Matrix
 62 |       - ROC Curve
 63 |       - AUC Score
 64 | 
 65 | #### 📊 Unsupervised Learning
 66 | - Clustering
 67 |   - K-Means Clustering
 68 |   - Hierarchical Clustering
 69 |   - DBSCAN
 70 |   - Mean Shift
 71 | - Dimensionality Reduction
 72 |   - Principal Component Analysis (PCA)
 73 |   - t-Distributed Stochastic Neighbor Embedding (t-SNE)
 74 |   - Linear Discriminant Analysis (LDA)
 75 | - Association Rules
 76 |   - Apriori Algorithm
 77 |   - FP-Growth
 78 | 
 79 | #### 🛠️ ML Pipelines
 80 | - Data Preprocessing
 81 |   - Feature Scaling
 82 |   - Normalization
 83 |   - Standardization
 84 |   - Handling Missing Values
 85 |   - Outlier Detection
 86 |   - Encoding Categorical Variables
 87 | - Feature Engineering
 88 |   - Feature Selection
 89 |   - Polynomial Features
 90 |   - Interaction Terms
 91 |   - Binning
 92 | - Model Selection
 93 |   - Train-Test Split
 94 |   - K-Fold Cross-Validation
 95 |   - Stratified K-Fold
 96 |   - Grid Search
 97 |   - Random Search
 98 | - Model Evaluation
 99 |   - Bias-Variance Tradeoff
100 |   - Overfitting
101 |   - Underfitting
102 | - Deployment
103 |   - Model Serialization
104 |   - API Integration
105 | 
106 | #### 🎯 Ensemble Methods
107 | - Bagging
108 |   - Bootstrap Aggregating
109 |   - Random Forest
110 | - Boosting
111 |   - AdaBoost
112 |   - Gradient Boosting
113 | 
114 | ---
115 | 
116 | ## 💡 Why Master Machine Learning for AI/ML?
117 | 
118 | Machine Learning fuels AI innovation, and here’s why:
119 | 1. **Versatility**: Powers predictive modeling and data insights.
120 | 2. **Industry Demand**: A core skill for 6 LPA+ AI/ML roles.
121 | 3. **Real-World Impact**: Solves complex problems across domains.
122 | 4. **Evolving Field**: Continuous learning with cutting-edge tools.
123 | 5. **Community Support**: Backed by a thriving network of experts.
124 | 
125 | This repo is my toolkit for mastering ML for technical interviews and AI/ML careers—let’s build that expertise together!
126 | 
127 | ## 📆 Study Plan
128 | 
129 | - **Week 1-2**: Supervised Learning Basics
130 | - **Week 3-4**: Unsupervised Learning and Evaluation
131 | - **Week 5-6**: ML Pipelines and Preprocessing
132 | - **Week 7-8**: Feature Engineering and Model Selection
133 | - **Week 9-10**: Ensemble Methods and Deployment
134 | - **Week 11-12**: Interview Practice and Optimization    
135 | 
136 | ## 🤝 Contributions
137 | 
138 | Love to collaborate? Here’s how! 🌟
139 | 1. Fork the repository.
140 | 2. Create a feature branch (`git checkout -b feature/amazing-addition`).
141 | 3. Commit your changes (`git commit -m 'Add some amazing content'`).
142 | 4. Push to the branch (`git push origin feature/amazing-addition`).
143 | 5. Open a Pull Request.
144 | 
145 | ---
146 | 
147 | <div align="center">
148 |   <p>Happy Learning and Good Luck with Your Interviews! ✨</p>
149 | </div>


--------------------------------------------------------------------------------
/Machine Learning Interview Questions with Ans/README.md:
--------------------------------------------------------------------------------
  1 | # Machine Learning Interview Questions for AI/ML Roles
  2 | 
  3 | ## Supervised Learning
  4 | Supervised learning trains models on labeled data to predict outcomes, crucial for tasks like sales forecasting or spam detection.
  5 | 
  6 | ### Basic
  7 | 1. **What is supervised learning, and how does it differ from unsupervised learning?**  
  8 |    Supervised learning uses labeled data to predict outputs (e.g., classifying emails), while unsupervised learning finds patterns without labels (e.g., clustering customers). In AI/ML, supervised learning is chosen when the goal is prediction with known targets.
  9 | 
 10 | 2. **What are the two main types of supervised learning tasks?**  
 11 |    - **Regression**: Predicts continuous outputs (e.g., house prices).  
 12 |    - **Classification**: Predicts categorical outputs (e.g., spam or not). These define the task type in ML workflows.
 13 | 
 14 | 3. **What are some common evaluation metrics for regression models?**  
 15 |    - **Mean Squared Error (MSE)**: Measures average squared error.  
 16 |    - **R² Score**: Indicates variance explained by the model.  
 17 |    - **Mean Absolute Error (MAE)**: Average absolute error. In AI/ML, these assess prediction accuracy for continuous data like sales forecasts.
 18 | 
 19 | 4. **What is classification, and what are some common classification algorithms?**  
 20 |    Classification predicts discrete labels (e.g., positive/negative sentiment). Algorithms include Logistic Regression, Decision Trees, and SVMs—used in tasks like fraud detection or medical diagnosis.
 21 | 
 22 | ### Intermediate
 23 | 5. **Explain the concept of linear regression. What is its goal?**  
 24 |    Linear regression models the relationship between features and a continuous target using a straight line (y = mx + b). Its goal is to minimize prediction error, e.g., predicting house prices based on size and location.
 25 | 
 26 | 6. **What is the difference between L1 and L2 regularization in regression?**  
 27 |    - **L1 (Lasso)**: Adds absolute value of coefficients to the loss, promoting sparsity (some weights become zero).  
 28 |    - **L2 (Ridge)**: Adds squared coefficients, shrinking weights evenly. In ML, they prevent overfitting in models like regression for noisy datasets.
 29 | 
 30 | 7. **How does a decision tree make predictions? What are its pros and cons?**  
 31 |    It splits data based on feature thresholds, predicting via leaf nodes (e.g., customer segmentation).  
 32 |    - **Pros**: Interpretable, handles non-linear data.  
 33 |    - **Cons**: Prone to overfitting, sensitive to small changes.
 34 | 
 35 | 8. **What is the confusion matrix, and how is it used to evaluate classification models?**  
 36 |    A confusion matrix shows true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It evaluates accuracy, precision, and recall—key for tasks like diagnosing diseases.
 37 | 
 38 | 9. **What is the Naive Bayes theorem, and why is it called "naive"?**  
 39 |    Naive Bayes uses Bayes’ theorem (P(A|B) = P(B|A)P(A)/P(B)) to predict class probabilities, assuming feature independence (hence "naive"). It’s effective for text classification despite this simplification.
 40 | 
 41 | ### Advanced
 42 | 10. **How does polynomial regression capture non-linear relationships?**  
 43 |     It extends linear regression by adding polynomial terms (e.g., x²), fitting curves to data like stock price trends. In ML, it balances flexibility and complexity for non-linear patterns.
 44 | 
 45 | 11. **Explain the bias-variance tradeoff in model complexity.**  
 46 |     - **Bias**: Error from overly simple models (underfitting).  
 47 |     - **Variance**: Error from sensitivity to training data (overfitting).  
 48 |     In AI/ML, optimal complexity minimizes total error for robust predictions.
 49 | 
 50 | 12. **How does Random Forest improve upon decision trees?**  
 51 |     Random Forest builds multiple trees on random data subsets and features, averaging predictions. It reduces overfitting and variance, enhancing accuracy for tasks like credit risk assessment.
 52 | 
 53 | 13. **Explain the ROC curve and AUC score. Why are they useful?**  
 54 |     The ROC curve plots true positive rate vs. false positive rate at various thresholds. AUC (Area Under Curve) summarizes performance—higher is better. They’re vital for imbalanced data, e.g., fraud detection.
 55 | 
 56 | 14. **How does the K-Nearest Neighbors algorithm work, and what are its strengths and weaknesses?**  
 57 |     KNN predicts by finding the K closest training points (e.g., via Euclidean distance) and voting/averaging.  
 58 |     - **Strengths**: Simple, captures local patterns.  
 59 |     - **Weaknesses**: Slow on large datasets, sensitive to irrelevant features.
 60 | 
 61 | 15. **Explain support vectors in SVM and their role in finding the decision boundary.**  
 62 |     Support vectors are data points nearest the decision boundary in SVM. They define the maximum-margin hyperplane, optimizing separation for high-dimensional tasks like image recognition.
 63 | 
 64 | ## Unsupervised Learning
 65 | Unsupervised learning uncovers patterns in unlabeled data, key for clustering or dimensionality reduction.
 66 | 
 67 | ### Basic
 68 | 16. **What is unsupervised learning, and what are its main applications?**  
 69 |     Unsupervised learning finds structure without labels, used for clustering (e.g., market segmentation) and dimensionality reduction (e.g., feature compression).
 70 | 
 71 | 17. **What is clustering, and why is it used?**  
 72 |     Clustering groups similar data points (e.g., customers by behavior). It’s used for insights, anomaly detection, or preprocessing in ML workflows.
 73 | 
 74 | 18. **What is dimensionality reduction, and why is it important?**  
 75 |     It reduces feature count while retaining key information, speeding up modeling and enabling visualization (e.g., compressing image data).
 76 | 
 77 | ### Intermediate
 78 | 19. **Explain how K-Means clustering works. How do you choose the number of clusters?**  
 79 |     K-Means assigns points to K clusters by minimizing distance to centroids, iteratively updating them. The elbow method (plotting inertia vs. K) helps select K—used for customer grouping.
 80 | 
 81 | 20. **What is Principal Component Analysis (PCA), and how does it reduce dimensionality?**  
 82 |     PCA transforms data into principal components (directions of max variance), retaining top ones. It compresses features for efficient modeling, e.g., in image processing.
 83 | 
 84 | 21. **Explain the Apriori algorithm and its key steps.**  
 85 |     Apriori finds frequent itemsets for association rules (e.g., in recommendation systems):  
 86 |     - Identify frequent items.  
 87 |     - Generate candidate itemsets.  
 88 |     - Prune infrequent ones iteratively.
 89 | 
 90 | 22. **What is Linear Discriminant Analysis (LDA), and when is it used?**  
 91 |     LDA maximizes class separability for dimensionality reduction, unlike PCA’s variance focus. It’s used in supervised tasks (e.g., face recognition) with labeled data.
 92 | 
 93 | ### Advanced
 94 | 23. **What is hierarchical clustering, and how does it differ from K-Means?**  
 95 |     Hierarchical clustering builds a tree of clusters (agglomerative or divisive), offering flexibility without preset K. It’s ideal for gene expression analysis, unlike K-Means’ fixed clusters.
 96 | 
 97 | 24. **How does DBSCAN handle clusters of varying density?**  
 98 |     DBSCAN groups points by density (core points, border points, noise), excelling at irregular clusters and anomaly detection (e.g., outlier identification).
 99 | 
100 | 25. **What is t-SNE, and how does it differ from PCA?**  
101 |     t-SNE reduces dimensions for visualization by preserving local structure (e.g., word embeddings), while PCA focuses on global variance. t-SNE is non-linear and computationally intensive.
102 | 
103 | 26. **How does FP-Growth improve upon the Apriori algorithm?**  
104 |     FP-Growth uses a tree structure (FP-tree) to mine frequent patterns without candidate generation, making it faster for large transactional datasets in ML.
105 | 
106 | ## ML Pipelines
107 | ML pipelines streamline data preprocessing to deployment for reproducible, scalable models.
108 | 
109 | ### Basic
110 | 27. **What is a machine learning pipeline, and why is it important?**  
111 |     A pipeline sequences data preprocessing, modeling, and evaluation steps. It ensures consistency and scalability in ML workflows, e.g., automating sales prediction.
112 | 
113 | 28. **What are some common data preprocessing techniques?**  
114 |     - **Normalization/Scaling**: Adjusts feature ranges.  
115 |     - **Encoding**: Converts categories to numbers.  
116 |     - **Imputation**: Fills missing values. These prepare raw data for modeling.
117 | 
118 | 29. **What is the purpose of splitting data into training and testing sets?**  
119 |     It trains the model on one subset and evaluates generalization on another, ensuring real-world performance (e.g., 80/20 split).
120 | 
121 | 30. **What is overfitting, and how can it be detected?**  
122 |     Overfitting occurs when a model learns noise, not patterns, performing well on training data but poorly on test data. It’s detected by comparing train vs. test accuracy.
123 | 
124 | ### Intermediate
125 | 31. **Explain feature scaling and its importance in machine learning.**  
126 |     Feature scaling (e.g., standardization) normalizes feature ranges, ensuring equal contribution. It’s critical for distance-based algorithms like KNN or gradient descent.
127 | 
128 | 32. **How do you handle missing values in a dataset?**  
129 |     - **Imputation**: Fill with mean/median/mode or predictive models.  
130 |     - **Deletion**: Remove rows/columns. In ML, this maintains data integrity for accurate predictions.
131 | 
132 | 33. **What is encoding, and why is it necessary for categorical variables?**  
133 |     Encoding (e.g., one-hot, label encoding) converts categories to numbers, enabling algorithms to process them—essential for tasks like sentiment analysis.
134 | 
135 | 34. **Explain K-Fold Cross-Validation and its advantages.**  
136 |     K-Fold splits data into K subsets, training on K-1 and testing on 1, rotating through all. It provides robust performance estimates, reducing overfitting risk.
137 | 
138 | 35. **How does Grid Search help in hyperparameter tuning?**  
139 |     Grid Search tests all combinations of hyperparameters (e.g., learning rate), selecting the best via cross-validation—optimizing models like SVMs.
140 | 
141 | ### Advanced
142 | 36. **What is feature engineering, and how does it improve model performance?**  
143 |     Feature engineering creates new features (e.g., interaction terms) or transforms existing ones, enhancing predictive power for tasks like sales forecasting.
144 | 
145 | 37. **How do you create polynomial features, and when might you use them?**  
146 |     Polynomial features add terms like x² or xy (e.g., via Scikit-learn’s `PolynomialFeatures`). They’re used for non-linear patterns, like financial modeling.
147 | 
148 | 38. **What is stratified K-Fold, and when is it used?**  
149 |     Stratified K-Fold ensures class proportions in each fold match the dataset, critical for imbalanced data (e.g., rare disease detection).
150 | 
151 | 39. **What is Random Search, and how does it compare to Grid Search?**  
152 |     Random Search samples hyperparameter combinations randomly, often faster and equally effective for large spaces vs. Grid Search’s exhaustive approach.
153 | 
154 | 40. **How can you address overfitting in machine learning models?**  
155 |     - **Regularization**: Penalizes complexity (L1/L2).  
156 |     - **Cross-Validation**: Ensures generalization.  
157 |     - **More Data**: Reduces noise impact. These improve robustness in ML.
158 | 
159 | 41. **How do you serialize a machine learning model?**  
160 |     Serialization saves models (e.g., using Python’s `pickle` or `joblib`) for reuse:  
161 |     ```python
162 |     import pickle
163 |     model = LinearRegression()
164 |     with open("model.pkl", "wb") as f:
165 |         pickle.dump(model, f)
166 |     ```
167 | 
168 | 42. **Explain how to integrate a machine learning model into an API.**  
169 |     Load a serialized model into a web framework (e.g., Flask), exposing an endpoint:  
170 |     ```python
171 |     from flask import Flask, request
172 |     app = Flask(__name__)
173 |     with open("model.pkl", "rb") as f:
174 |         model = pickle.load(f)
175 |     @app.route("/predict", methods=["POST"])
176 |     def predict():
177 |         data = request.json
178 |         return {"prediction": model.predict([data["features"]])[0]}
179 |     ```
180 | 
181 | ## Ensemble Methods
182 | Ensemble methods combine models for better accuracy and robustness, critical in high-stakes predictions.
183 | 
184 | ### Basic
185 | 43. **What are ensemble methods in machine learning?**  
186 |     Ensemble methods combine multiple models (e.g., trees) to improve predictions, leveraging diversity for tasks like medical diagnosis.
187 | 
188 | 44. **What is bagging, and how does it work?**  
189 |     Bagging (Bootstrap Aggregating) trains models on random data subsets, averaging predictions (e.g., Random Forest). It reduces variance in ML.
190 | 
191 | 45. **What is boosting, and how does it differ from bagging?**  
192 |     Boosting trains models sequentially, focusing on errors, unlike bagging’s parallel approach. It reduces bias for better accuracy.
193 | 
194 | ### Intermediate
195 | 46. **Explain Bootstrap Aggregating (Bagging) and its role in reducing variance.**  
196 |     Bagging samples data with replacement, training diverse models and aggregating outputs. It stabilizes predictions, e.g., in financial modeling.
197 | 
198 | 47. **Explain AdaBoost and its working principle.**  
199 |     AdaBoost assigns weights to misclassified samples, iteratively improving weak learners (e.g., stumps) by focusing on errors—effective for classification.
200 | 
201 | 48. **Compare and contrast bagging and boosting.**  
202 |     - **Bagging**: Parallel, reduces variance (e.g., Random Forest).  
203 |     - **Boosting**: Sequential, reduces bias (e.g., AdaBoost).  
204 |     In ML, choose based on overfitting vs. underfitting needs.
205 | 
206 | ### Advanced
207 | 49. **How does Random Forest use bagging to improve model performance?**  
208 |     Random Forest applies bagging to decision trees with random feature subsets, averaging outputs for robust predictions—used in risk assessment.
209 | 
210 | 50. **What is Gradient Boosting, and how does it build upon previous models?**  
211 |     Gradient Boosting minimizes a loss function (e.g., MSE) by adding models that correct residuals of prior ones—powers tools like XGBoost.
212 | 
213 | 51. **Why are ensemble methods often more accurate than individual models?**  
214 |     They reduce errors via diversity (bagging) or error correction (boosting), improving generalization for complex tasks.
215 | 
216 | ## Additional Questions
217 | 52. **How can generative AI be used for data augmentation in supervised learning?**  
218 |     Generative AI (e.g., GANs) creates synthetic data (e.g., images) to expand training sets, improving model robustness in tasks like image classification.
219 | 
220 | 53. **What is AutoML, and how does it benefit ML pipelines?**  
221 |     AutoML automates model selection, tuning, and preprocessing (e.g., Google AutoML), speeding up development and democratizing ML.
222 | 
223 | 54. **How do you handle imbalanced datasets in classification tasks?**  
224 |     - **SMOTE**: Oversamples minority class.  
225 |     - **Class Weights**: Adjusts loss function.  
226 |     - **Resampling**: Balances classes. These ensure fairness, e.g., in fraud detection.
227 | 
228 | 55. **Write a Python function to implement linear regression using Scikit-learn.**  
229 |     Demonstrates practical model building:  
230 |     ```python
231 |     from sklearn.linear_model import LinearRegression
232 |     def fit_linear_regression(X, y):
233 |         model = LinearRegression()
234 |         model.fit(X, y)
235 |         return model
236 |     # Example usage
237 |     X = [[1], [2], [3]]
238 |     y = [2, 4, 6]
239 |     model = fit_linear_regression(X, y)
240 |     print(model.predict([[4]]))  # Predicts ~8
241 |     ```


--------------------------------------------------------------------------------