├── Start └── linear_regression.ipynb /Start: -------------------------------------------------------------------------------- 1 | Commit 2 | -------------------------------------------------------------------------------- /linear_regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Titanic Dataset: Survival Analysis and Death Prediction\n", 8 | "\n", 9 | "This notebook demonstrates a workflow for analyzing the Titanic dataset to predict passenger survival (or death). It includes steps for data loading, extensive preprocessing tailored to the Titanic dataset, training a classification model (Logistic Regression), evaluating its performance using appropriate classification metrics, and visualizing the results." 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## 1. Import Libraries\n", 17 | "\n", 18 | "This section imports all the necessary Python libraries for the workflow." 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "# Common libraries for data manipulation, machine learning, and plotting.\n", 28 | "import pandas as pd\n", 29 | "import numpy as np\n", 30 | "from sklearn.model_selection import train_test_split\n", 31 | "from sklearn.linear_model import LogisticRegression\n", 32 | "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix\n", 33 | "import matplotlib.pyplot as plt\n", 34 | "import seaborn as sns" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "## 2. Load and Prepare Data\n", 42 | "\n", 43 | "This section covers loading the dataset and initial preparation. For your own use, you will need to replace the sample data generation with code to load your specific dataset (e.g., from a CSV file)." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "# --- CUSTOMIZE THIS SECTION ---\n", 53 | "# IMPORTANT: Load your custom data here.\n", 54 | "import seaborn as sns # Ensure seaborn is imported for loading the dataset\n", 55 | "# Example: \n", 56 | "# df = pd.read_csv('your_data.csv') # Replace 'your_data.csv' with the path to your data file.\n", 57 | "# X = df[['YourFeatureColumn']] # Replace 'YourFeatureColumn' with the actual name of your feature column(s).\n", 58 | "# y = df['YourTargetColumn'] # Replace 'YourTargetColumn' with the actual name of your target variable column.\n", 59 | "\n", 60 | "# The following lines generate sample data for demonstration purposes. \n", 61 | "# Remove or comment out when using your own data.\n", 62 | "# X_np = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])\n", 63 | "# y_np = np.array([2, 4, 5, 4, 6, 8, 9, 11, 10, 12])\n", 64 | "\n", 65 | "# Convert to pandas DataFrames\n", 66 | "# X = pd.DataFrame(X_np, columns=['Feature'])\n", 67 | "# y = pd.DataFrame(y_np, columns=['Target'])\n", 68 | "\n", 69 | "# Load the Titanic dataset from Seaborn\n", 70 | "df = sns.load_dataset('titanic')\n", 71 | "\n", 72 | "print(\"Titanic DataFrame head (First 5 rows):\")\n", 73 | "print(df.head())\n", 74 | "# print(\"X head (First 5 rows of features):\")\n", 75 | "# print(X.head())\n", 76 | "# print(\"\\ny head (First 5 rows of target variable):\")\n", 77 | "# print(y.head())" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## 2.1. Initial Data Exploration\n", 85 | "Let's take a first look at the dataset's structure, data types, summary statistics, and missing values." 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "print(\"--- Dataset Info ---\")\n", 95 | "df.info()\n", 96 | "print(\"\\n--- Dataset Description (Numerical Features) ---\")\n", 97 | "print(df.describe())\n", 98 | "print(\"\\n--- Dataset Description (Categorical Features) ---\")\n", 99 | "print(df.describe(include=['object', 'category']))\n", 100 | "print(\"\\n--- Missing Values per Column ---\")\n", 101 | "print(df.isnull().sum())\n", 102 | "print(\"\\n--- First 5 Rows (already printed during load but good for quick check) ---\")\n", 103 | "print(df.head())" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### 2.2. Preprocess Data for Death Prediction\n", 111 | "This section covers the necessary preprocessing steps for the Titanic dataset to prepare it for predicting death. This includes defining the target variable, feature selection, handling missing values, and encoding categorical features." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "# Define target variable 'Died' (1 if not survived, 0 if survived)\n", 121 | "df['Died'] = 1 - df['survived']\n", 122 | "\n", 123 | "# Drop original 'survived' column as 'Died' is now our target\n", 124 | "# Drop 'deck' (too many missing, 'Cabin' is similar), 'embark_town' (redundant with 'Embarked'), 'alive' (same as 'survived')\n", 125 | "# Drop 'who', 'adult_male', 'class' as they are highly correlated or redundant with 'Sex', 'Age', 'Pclass'\n", 126 | "# Drop 'Cabin' due to high missing values and complexity for this iteration.\n", 127 | "df.drop(['survived', 'alive', 'deck', 'embark_town', 'who', 'adult_male', 'class', 'cabin'], axis=1, inplace=True)\n", 128 | "# Also drop 'passengerId' as it's an identifier and not useful for prediction\n", 129 | "if 'passengerId' in df.columns:\n", 130 | " df.drop(['passengerId'], axis=1, inplace=True)\n", 131 | "\n", 132 | "# Impute 'age' with the median\n", 133 | "df['age'].fillna(df['age'].median(), inplace=True)\n", 134 | "# Impute 'embarked' with the mode\n", 135 | "df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)\n", 136 | "\n", 137 | "# Convert 'sex' and 'embarked' to numerical using one-hot encoding\n", 138 | "df = pd.get_dummies(df, columns=['sex', 'embarked'], drop_first=True)\n", 139 | "# drop_first=True helps to avoid multicollinearity\n", 140 | "\n", 141 | "# Define features (X) and target (y)\n", 142 | "X = df.drop('Died', axis=1)\n", 143 | "y = df['Died']\n", 144 | "\n", 145 | "print(\"--- Preprocessing Complete ---\")\n", 146 | "print(\"\\nSelected features (X head):\")\n", 147 | "print(X.head())\n", 148 | "print(\"\\nTarget variable (y head):\")\n", 149 | "print(y.head())\n", 150 | "print(\"\\nMissing values in X after preprocessing:\")\n", 151 | "print(X.isnull().sum())\n", 152 | "print(\"\\nMissing values in y after preprocessing:\")\n", 153 | "print(y.isnull().sum())\n", 154 | "print(\"\\nX shape:\")\n", 155 | "print(X.shape)\n", 156 | "print(\"\\ny shape:\")\n", 157 | "print(y.shape)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "## 3. Split Data into Training and Testing Sets\n", 165 | "The preprocessed data (features `X` and target `y` representing 'Died') is now split into training and testing sets. The training set is used to train the classification model, and the testing set is used to evaluate its performance on unseen data." 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "# You can adjust the test_size and random_state parameters as needed.\n", 175 | "# `test_size` determines the proportion of data to allocate to the test set (e.g., 0.2 means 20% test data).\n", 176 | "# `random_state` ensures reproducibility. Using the same number will give you the same split every time.\n", 177 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", 178 | "\n", 179 | "print(\"X_train shape (features for training):\", X_train.shape)\n", 180 | "print(\"X_test shape (features for testing):\", X_test.shape)\n", 181 | "print(\"y_train shape (target for training):\", y_train.shape)\n", 182 | "print(\"y_test shape (target for testing):\", y_test.shape)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "## 4. Train the Logistic Regression Model\n", 190 | "This section focuses on training the Logistic Regression model using the training data (`X_train`, `y_train`). Logistic Regression is chosen here as we are dealing with a binary classification problem (predicting if a passenger Died or Survived)." 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "# Initialize the LogisticRegression model. You can explore its hyperparameters for more advanced tuning.\n", 200 | "model = LogisticRegression(random_state=42, solver='liblinear')\n", 201 | "\n", 202 | "# Fit the model to the training data.\n", 203 | "model.fit(X_train, y_train)\n", 204 | "\n", 205 | "print(\"Logistic Regression model trained successfully.\")\n", 206 | "# For Logistic Regression, intercept_ is an array (one per class for multi-class, or one for binary if fit_intercept=True)\n", 207 | "print(\"Intercept (bias term):\", model.intercept_)\n", 208 | "# coef_ will show coefficients for each feature for the positive class\n", 209 | "print(\"Coefficients (weights for each feature):\", model.coef_)" 210 | ] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "metadata": {}, 215 | "source": [ 216 | "## 5. Make Predictions\n", 217 | "Once the Logistic Regression model is trained, it can be used to make predictions on new, unseen data (in this case, `X_test`). These predictions will indicate whether the model thinks a passenger died (1) or survived (0)." 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": {}, 224 | "outputs": [], 225 | "source": [ 226 | "# Use the trained model to predict target values for the test set.\n", 227 | "y_pred = model.predict(X_test)\n", 228 | "\n", 229 | "# Displaying the first few predictions and actual values for a quick comparison.\n", 230 | "print(\"First 5 predictions:\", y_pred[:5].ravel()) # .ravel() to make it a 1D array for cleaner printing\n", 231 | "print(\"First 5 actual values:\", y_test.head().values.ravel()[:5])" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "## 6. Evaluate the Classification Model\n", 239 | "Model evaluation helps to understand how well the classification model performs. We will use metrics like Accuracy, Precision, Recall, F1-score, and the Confusion Matrix." 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# Accuracy: Proportion of correct predictions\n", 249 | "accuracy = accuracy_score(y_test, y_pred)\n", 250 | "# Precision: Proportion of positive identifications that were actually correct\n", 251 | "precision = precision_score(y_test, y_pred) # For class 1 (Died)\n", 252 | "# Recall: Proportion of actual positives that were identified correctly\n", 253 | "recall = recall_score(y_test, y_pred) # For class 1 (Died)\n", 254 | "# F1-score: Weighted average of Precision and Recall\n", 255 | "f1 = f1_score(y_test, y_pred) # For class 1 (Died)\n", 256 | "# Confusion Matrix: Shows true positives, false positives, true negatives, false negatives\n", 257 | "cm = confusion_matrix(y_test, y_pred)\n", 258 | "\n", 259 | "print(f\"Accuracy: {accuracy:.4f}\")\n", 260 | "print(f\"Precision (for Died=1): {precision:.4f}\")\n", 261 | "print(f\"Recall (for Died=1): {recall:.4f}\")\n", 262 | "print(f\"F1-score (for Died=1): {f1:.4f}\")\n", 263 | "print(\"\\nConfusion Matrix:\")\n", 264 | "print(cm)\n", 265 | "\n", 266 | "# Interpretation:\n", 267 | "# Accuracy: Overall, how often is the classifier correct? (TP+TN)/total\n", 268 | "# Precision: When it predicts someone died, how often is it correct? TP/(TP+FP)\n", 269 | "# Recall (Sensitivity): Of all the people who actually died, how many did the classifier correctly identify? TP/(TP+FN)\n", 270 | "# F1-score: A balance between precision and recall. Useful when class distribution is uneven.\n", 271 | "# Confusion Matrix: \n", 272 | "# [[TN, FP],\n", 273 | "# [FN, TP]]\n", 274 | "# TN = True Negatives (Correctly predicted Survived)\n", 275 | "# FP = False Positives (Incorrectly predicted Died - Type I error)\n", 276 | "# FN = False Negatives (Incorrectly predicted Survived - Type II error)\n", 277 | "# TP = True Positives (Correctly predicted Died)" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "## 7. Visualize Classification Results\n", 285 | "Visualizing the results can provide further insights. For classification, a heatmap of the confusion matrix is a common and useful visualization." 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "# Visualize the Confusion Matrix using a Heatmap\n", 295 | "plt.figure(figsize=(8, 6))\n", 296 | "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,\n", 297 | " xticklabels=['Predicted Survived (0)', 'Predicted Died (1)'],\n", 298 | " yticklabels=['Actual Survived (0)', 'Actual Died (1)'])\n", 299 | "plt.xlabel('Predicted Label')\n", 300 | "plt.ylabel('True Label')\n", 301 | "plt.title('Confusion Matrix Heatmap')\n", 302 | "plt.show()\n", 303 | "# The heatmap provides a visual representation of the confusion matrix:\n", 304 | "# Top-left (TN): Correctly predicted Survived.\n", 305 | "# Top-right (FP): Incorrectly predicted Died.\n", 306 | "# Bottom-left (FN): Incorrectly predicted Survived.\n", 307 | "# Bottom-right (TP): Correctly predicted Died." 308 | ] 309 | } 310 | ], 311 | "metadata": { 312 | "kernelspec": { 313 | "display_name": "Python 3", 314 | "language": "python", 315 | "name": "python3" 316 | }, 317 | "language_info": { 318 | "codemirror_mode": { 319 | "name": "ipython", 320 | "version": 3 321 | }, 322 | "file_extension": ".py", 323 | "mimetype": "text/x-python", 324 | "name": "python", 325 | "nbconvert_exporter": "python", 326 | "pygments_lexer": "ipython3", 327 | "version": "3.11.6" 328 | } 329 | }, 330 | "nbformat": 4, 331 | "nbformat_minor": 5 332 | } 333 | --------------------------------------------------------------------------------