├── eda_stop_counts.png ├── eda_arrival_hist.png ├── eda_stop_locations.png ├── top_routes_by_trips.png ├── correlation_stop_coords.png ├── trip_duration_boxplot.png ├── trip_duration_outliers.png ├── README.md ├── Project_Code └── mainproject.py /eda_stop_counts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_stop_counts.png -------------------------------------------------------------------------------- /eda_arrival_hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_arrival_hist.png -------------------------------------------------------------------------------- /eda_stop_locations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_stop_locations.png -------------------------------------------------------------------------------- /top_routes_by_trips.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/top_routes_by_trips.png -------------------------------------------------------------------------------- /correlation_stop_coords.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/correlation_stop_coords.png -------------------------------------------------------------------------------- /trip_duration_boxplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/trip_duration_boxplot.png -------------------------------------------------------------------------------- /trip_duration_outliers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/trip_duration_outliers.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Delhi-Transport-Analysis 2 | # 🚌 Static Public Transport Analysis of Delhi (GTFS) 3 | 4 | This project analyzes the static public transportation data of Delhi using the GTFS (General Transit Feed Specification) format. It provides insights into how services operate, stop coverage, route popularity, and time-based patterns — all powered by Python. 5 | 6 | --- 7 | 8 | ## 📁 Dataset Source 9 | - **Format:** GTFS (General Transit Feed Specification) 10 | - **Files used:** `routes.txt`, `trips.txt`, `stop_times.txt`, `stops.txt`, `calendar.txt` 11 | - **Source:** [Delhi Open Transit Data](https://otd.delhi.gov.in/data/static/) 12 | --- 13 | 14 | ## 🎯 Project Objectives 15 | 16 | 1. **Exploratory Data Analysis (EDA)** 17 | - Data overview, structure, and relationships across GTFS files 18 | 19 | 2. **Summary Statistics** 20 | - Descriptive insights for trips, stops, and services 21 | 22 | 3. **Average Stop Sequence Length per Trip** 23 | - Understanding trip length variability 24 | 25 | 4. **Histogram of Arrival Times** 26 | - Identifying peak time patterns 27 | 28 | 5. **Distribution of Stop Locations** 29 | - Geospatial spread of the city’s bus stop network 30 | 31 | 6. **Route Popularity** 32 | - Top 20 most frequently operated bus routes 33 | 34 | --- 35 | 36 | ## 🧠 Learning Experience 37 | 38 | This project helped me: 39 | - Work with real-world public transit data formats 40 | - Use Python libraries like `pandas`, `numpy`, `matplotlib`, and `seaborn` for data manipulation and visualization 41 | - Clean and preprocess large, messy datasets effectively 42 | - Visualize route-level and trip-level patterns to understand urban mobility 43 | - Strengthen my understanding of public infrastructure data 44 | 45 | --- 46 | 47 | ## 📊 Tools & Libraries 48 | 49 | - `Python` 50 | - `Pandas` 51 | - `NumPy` 52 | - `Matplotlib` 53 | - `Seaborn` 54 | 55 | --- 56 | 57 | ## 📁 Folder Structure 58 | 📦project-root/ ┣ 📂data/ 59 | ┗ 📜 routes.txt, trips.txt, stops.txt, stop_times.txt, calendar.txt 60 | ┣ 📂figures/ 61 | ┗ 📊 All exported graphs (PNG format) 62 | ┣ 📜 mainproject.py 63 | ┣ 📜 README.md 64 | 65 | 66 | 67 | 68 | 69 | 70 | -------------------------------------------------------------------------------- /Project_Code: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | routes = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\routes.txt") 7 | trips = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\trips.txt") 8 | stop_times = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stop_times.txt") 9 | stops = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stops.txt") 10 | calendar = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\calendar.txt") 11 | 12 | 13 | 14 | # === Objective 1: Data Cleaning & Management === 15 | 16 | # Drop rows with missing arrival or departure time 17 | stop_times.dropna(subset=['arrival_time', 'departure_time'], inplace=True) 18 | 19 | # Keep only rows where time is in HH:MM:SS or HH:MM format 20 | stop_times = stop_times[ 21 | stop_times['arrival_time'].str.contains(':') & 22 | stop_times['departure_time'].str.contains(':') 23 | ] 24 | 25 | # Convert time strings to minutes from midnight (basic transformation) 26 | def time_to_minutes(t): 27 | try: 28 | parts = t.split(':') 29 | return int(parts[0]) * 60 + int(parts[1]) 30 | except: 31 | return np.nan 32 | 33 | stop_times['arrival_minutes'] = stop_times['arrival_time'].apply(time_to_minutes) 34 | stop_times['departure_minutes'] = stop_times['departure_time'].apply(time_to_minutes) 35 | 36 | # Optional: Drop rows where conversion failed 37 | stop_times.dropna(subset=['arrival_minutes', 'departure_minutes'], inplace=True) 38 | 39 | print("Data cleaned successfully! Shape after cleaning:", stop_times.shape) 40 | 41 | 42 | # Remove any rows with missing critical fields like route_id or route_short_name 43 | routes.dropna(subset=['route_id', 'route_short_name'], inplace=True) 44 | 45 | # Ensure route_id is string 46 | routes['route_id'] = routes['route_id'].astype(str) 47 | 48 | print("Cleaned 'routes' — shape:", routes.shape) 49 | 50 | 51 | # Drop rows with missing trip_id or route_id 52 | trips.dropna(subset=['trip_id', 'route_id'], inplace=True) 53 | 54 | # Convert IDs to string for consistency 55 | trips['trip_id'] = trips['trip_id'].astype(str) 56 | trips['route_id'] = trips['route_id'].astype(str) 57 | 58 | print("Cleaned 'trips' — shape:", trips.shape) 59 | 60 | 61 | # Drop rows with missing service_id or all-zero days 62 | calendar.dropna(subset=['service_id'], inplace=True) 63 | 64 | # Drop services that are not active on any day 65 | calendar = calendar[calendar[['monday','tuesday','wednesday','thursday','friday','saturday','sunday']].sum(axis=1) > 0] 66 | 67 | # Ensure service_id is string 68 | calendar['service_id'] = calendar['service_id'].astype(str) 69 | 70 | print("Cleaned 'calendar' — shape:", calendar.shape) 71 | 72 | 73 | # Drop rows with missing stop_id, stop_name or coordinates 74 | stops.dropna(subset=['stop_id', 'stop_name', 'stop_lat', 'stop_lon'], inplace=True) 75 | 76 | # Convert stop_id to string 77 | stops['stop_id'] = stops['stop_id'].astype(str) 78 | 79 | # Ensure coordinates are float 80 | stops['stop_lat'] = stops['stop_lat'].astype(float) 81 | stops['stop_lon'] = stops['stop_lon'].astype(float) 82 | 83 | print("Cleaned 'stops' — shape:", stops.shape) 84 | 85 | 86 | 87 | 88 | 89 | 90 | # === Objective 2: Exploratory Data Analysis (EDA) === 91 | 92 | # Ensure Seaborn styling is used 93 | sns.set(style='whitegrid') 94 | 95 | 96 | 97 | 98 | # --- 1. Distribution of Stop Locations (Latitude & Longitude) --- 99 | plt.figure(figsize=(8, 6)) 100 | sns.scatterplot(data=stops, x='stop_lon', y='stop_lat', s=10, alpha=0.5) 101 | plt.title('Bus Stops in Delhi (Longitude vs Latitude)') 102 | plt.xlabel('Longitude') 103 | plt.ylabel('Latitude') 104 | plt.tight_layout() 105 | plt.savefig('figures/eda_stop_locations.png') 106 | plt.close() 107 | 108 | 109 | # --- 2. Histogram of Arrival Times --- 110 | plt.figure(figsize=(10, 4)) 111 | sns.histplot(stop_times['arrival_minutes'].dropna(), bins=48, kde=True, color='teal') 112 | plt.title('Distribution of Bus Arrival Times') 113 | plt.xlabel('Minutes from Midnight') 114 | plt.ylabel('Frequency') 115 | plt.tight_layout() 116 | plt.savefig('figures/eda_arrival_hist.png') 117 | plt.close() 118 | 119 | 120 | 121 | # --- 3. Average Stop Sequence Length per Trip --- 122 | sequence_stats = stop_times.groupby('trip_id')['stop_sequence'].count() 123 | 124 | plt.figure(figsize=(8, 4)) 125 | sns.histplot(sequence_stats, bins=30, kde=True, color='darkorange') 126 | plt.title('Number of Stops per Trip') 127 | plt.xlabel('Stop Count') 128 | plt.ylabel('Number of Trips') 129 | plt.tight_layout() 130 | plt.savefig('figures/eda_stop_counts.png') 131 | plt.close() 132 | 133 | 134 | print("EDA completed! Visualizations saved in the 'figures' folder.") 135 | 136 | 137 | 138 | # === Objective 3: Summary Statistics === 139 | 140 | # 1. Summary of 'routes' 141 | print("\n🔹 ROUTES Summary Statistics:") 142 | print(routes.describe(include='all')) 143 | 144 | # 2. Summary of 'trips' 145 | print("\n🔹 TRIPS Summary Statistics:") 146 | print(trips.describe(include='all')) 147 | 148 | # Total number of unique trips 149 | unique_trips = trips['trip_id'].nunique() 150 | print(f"\nTotal Unique Trips: {unique_trips}") 151 | 152 | # 3. Summary of 'calendar' – Count services active on each day 153 | print("\n🔹 CALENDAR Service Availability by Day:") 154 | for day in ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']: 155 | count = calendar[calendar[day] == 1].shape[0] 156 | print(f"{day.capitalize()}: {count} services") 157 | 158 | # 4. Summary of 'stops' 159 | print("\n🔹 STOPS Summary Statistics:") 160 | print(stops[['stop_lat', 'stop_lon']].describe()) 161 | 162 | # Count unique stops 163 | print(f"\nTotal Unique Stops: {stops['stop_id'].nunique()}") 164 | 165 | # Check for missing values in each dataset 166 | print("\n🔹 Missing Values Check:") 167 | print("Routes:\n", routes.isnull().sum()) 168 | print("Trips:\n", trips.isnull().sum()) 169 | print("Calendar:\n", calendar.isnull().sum()) 170 | print("Stops:\n", stops.isnull().sum()) 171 | 172 | 173 | # === Objective 4: Correlation & Covariance === 174 | 175 | # Drop rows with all NaNs in calendar day columns 176 | calendar_days = calendar[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].dropna(how='all') 177 | 178 | # Drop rows with NaN in latitude or longitude 179 | stops_coords = stops[['stop_lat', 'stop_lon']].dropna() 180 | 181 | # Create 'figures' folder if not present (OPTIONAL in your case since no os module is used) 182 | # You can manually create it in your folder if needed 183 | 184 | # ==== CALENDAR CORRELATION ==== 185 | if calendar_days.shape[0] >= 2: 186 | calendar_corr = calendar_days.corr() 187 | 188 | if not calendar_corr.isna().all().all(): 189 | plt.figure(figsize=(8, 6)) 190 | sns.heatmap(calendar_corr, annot=True, cmap='coolwarm', fmt='.2f') 191 | plt.title('Correlation between Service Days (Calendar)') 192 | plt.tight_layout() 193 | plt.savefig('figures/correlation_calendar.png') 194 | plt.close() 195 | else: 196 | print("All values in calendar correlation are NaN.") 197 | else: 198 | print("Not enough valid rows in calendar data for correlation.") 199 | 200 | # ==== CALENDAR COVARIANCE ==== 201 | if calendar_days.shape[0] >= 2: 202 | calendar_cov = calendar_days.cov() 203 | 204 | if not calendar_cov.isna().all().all(): 205 | plt.figure(figsize=(8, 6)) 206 | sns.heatmap(calendar_cov, annot=True, cmap='YlGnBu', fmt='.2f') 207 | plt.title('Covariance between Service Days (Calendar)') 208 | plt.tight_layout() 209 | plt.savefig('figures/covariance_calendar.png') 210 | plt.close() 211 | else: 212 | print("All values in calendar covariance are NaN.") 213 | else: 214 | print("Not enough valid rows in calendar data for covariance.") 215 | 216 | # ==== STOPS LAT/LON CORRELATION ==== 217 | if stops_coords.shape[0] >= 2: 218 | stops_corr = stops_coords.corr() 219 | 220 | if not stops_corr.isna().all().all(): 221 | plt.figure(figsize=(5, 4)) 222 | sns.heatmap(stops_corr, annot=True, cmap='magma', fmt='.2f') 223 | plt.title('Correlation between Stop Latitude and Longitude') 224 | plt.tight_layout() 225 | plt.savefig('figures/correlation_stop_coords.png') 226 | plt.close() 227 | else: 228 | print("All values in stop coordinates correlation are NaN.") 229 | else: 230 | print("Not enough valid stop coordinates for correlation.") 231 | 232 | 233 | # === Objective 5: Outlier Detection (Duration Outliers) === 234 | 235 | # === Step 1: Compute Trip Durations === 236 | 237 | # Merge stop_times with trips to calculate duration (end time - start time) 238 | stop_times['arrival_time'] = pd.to_timedelta(stop_times['arrival_time'], errors='coerce') 239 | stop_times['departure_time'] = pd.to_timedelta(stop_times['departure_time'], errors='coerce') 240 | 241 | trip_durations = stop_times.groupby('trip_id').agg( 242 | start_time=('departure_time', 'min'), 243 | end_time=('arrival_time', 'max') 244 | ).dropna() 245 | 246 | trip_durations['duration_min'] = (trip_durations['end_time'] - trip_durations['start_time']).dt.total_seconds() / 60 247 | 248 | # Filter out unrealistic durations (optional) 249 | trip_durations = trip_durations[(trip_durations['duration_min'] > 0) & (trip_durations['duration_min'] < 500)] 250 | 251 | # === Step 2: Detect Outliers using IQR === 252 | 253 | Q1 = trip_durations['duration_min'].quantile(0.25) 254 | Q3 = trip_durations['duration_min'].quantile(0.75) 255 | IQR = Q3 - Q1 256 | 257 | lower_bound = Q1 - 1.5 * IQR 258 | upper_bound = Q3 + 1.5 * IQR 259 | 260 | outliers = trip_durations[(trip_durations['duration_min'] < lower_bound) | (trip_durations['duration_min'] > upper_bound)] 261 | non_outliers = trip_durations[(trip_durations['duration_min'] >= lower_bound) & (trip_durations['duration_min'] <= upper_bound)] 262 | 263 | print(f"Total Trips Analyzed: {len(trip_durations)}") 264 | print(f"Outliers Detected: {len(outliers)}") 265 | 266 | # === Step 3: Visualize Outliers === 267 | 268 | # Boxplot 269 | plt.figure(figsize=(8, 4)) 270 | sns.boxplot(x=trip_durations['duration_min'], color='skyblue') 271 | plt.title('Boxplot of Trip Durations (in Minutes)') 272 | plt.xlabel('Duration (minutes)') 273 | plt.tight_layout() 274 | plt.savefig('figures/trip_duration_boxplot.png') 275 | plt.close() 276 | 277 | # Histogram with outliers highlighted 278 | plt.figure(figsize=(10, 6)) 279 | sns.histplot(non_outliers['duration_min'], bins=50, color='green', label='Normal Trips', kde=True) 280 | sns.histplot(outliers['duration_min'], bins=50, color='red', label='Outliers', kde=True) 281 | plt.title('Distribution of Trip Durations') 282 | plt.xlabel('Duration (minutes)') 283 | plt.ylabel('Number of Trips') 284 | plt.legend() 285 | plt.tight_layout() 286 | plt.savefig('figures/trip_duration_outliers.png') 287 | plt.close() 288 | 289 | 290 | # === Objective 6: Route Popularity === 291 | 292 | 293 | sns.set(style='whitegrid') 294 | 295 | # === Step 1: Count trips per route_id === 296 | route_trip_counts = trips['route_id'].value_counts().reset_index() 297 | route_trip_counts.columns = ['route_id', 'num_trips'] 298 | 299 | # === Step 2: Select top 20 routes === 300 | top_routes = route_trip_counts.sort_values(by='num_trips', ascending=False).head(20) 301 | 302 | # === Step 3: Plot using route_id instead of route_short_name === 303 | plt.figure(figsize=(12, 6)) 304 | sns.barplot(data=top_routes, x='route_id', y='num_trips', color='skyblue') 305 | plt.title('Top 20 Most Popular Route IDs by Number of Trips') 306 | plt.xlabel('Route ID') 307 | plt.ylabel('Number of Trips') 308 | plt.xticks(rotation=45) 309 | plt.tight_layout() 310 | plt.savefig('figures/top_routes_by_trips.png') 311 | plt.show() 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | -------------------------------------------------------------------------------- /mainproject.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | routes = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\routes.txt") 7 | trips = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\trips.txt") 8 | stop_times = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stop_times.txt") 9 | stops = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stops.txt") 10 | calendar = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\calendar.txt") 11 | 12 | 13 | 14 | # === Objective 1: Data Cleaning & Management === 15 | 16 | # Drop rows with missing arrival or departure time 17 | stop_times.dropna(subset=['arrival_time', 'departure_time'], inplace=True) 18 | 19 | # Keep only rows where time is in HH:MM:SS or HH:MM format 20 | stop_times = stop_times[ 21 | stop_times['arrival_time'].str.contains(':') & 22 | stop_times['departure_time'].str.contains(':') 23 | ] 24 | 25 | # Convert time strings to minutes from midnight (basic transformation) 26 | def time_to_minutes(t): 27 | try: 28 | parts = t.split(':') 29 | return int(parts[0]) * 60 + int(parts[1]) 30 | except: 31 | return np.nan 32 | 33 | stop_times['arrival_minutes'] = stop_times['arrival_time'].apply(time_to_minutes) 34 | stop_times['departure_minutes'] = stop_times['departure_time'].apply(time_to_minutes) 35 | 36 | # Optional: Drop rows where conversion failed 37 | stop_times.dropna(subset=['arrival_minutes', 'departure_minutes'], inplace=True) 38 | 39 | print("Data cleaned successfully! Shape after cleaning:", stop_times.shape) 40 | 41 | 42 | # Remove any rows with missing critical fields like route_id or route_short_name 43 | routes.dropna(subset=['route_id', 'route_short_name'], inplace=True) 44 | 45 | # Ensure route_id is string 46 | routes['route_id'] = routes['route_id'].astype(str) 47 | 48 | print("Cleaned 'routes' — shape:", routes.shape) 49 | 50 | 51 | # Drop rows with missing trip_id or route_id 52 | trips.dropna(subset=['trip_id', 'route_id'], inplace=True) 53 | 54 | # Convert IDs to string for consistency 55 | trips['trip_id'] = trips['trip_id'].astype(str) 56 | trips['route_id'] = trips['route_id'].astype(str) 57 | 58 | print("Cleaned 'trips' — shape:", trips.shape) 59 | 60 | 61 | # Drop rows with missing service_id or all-zero days 62 | calendar.dropna(subset=['service_id'], inplace=True) 63 | 64 | # Drop services that are not active on any day 65 | calendar = calendar[calendar[['monday','tuesday','wednesday','thursday','friday','saturday','sunday']].sum(axis=1) > 0] 66 | 67 | # Ensure service_id is string 68 | calendar['service_id'] = calendar['service_id'].astype(str) 69 | 70 | print("Cleaned 'calendar' — shape:", calendar.shape) 71 | 72 | 73 | # Drop rows with missing stop_id, stop_name or coordinates 74 | stops.dropna(subset=['stop_id', 'stop_name', 'stop_lat', 'stop_lon'], inplace=True) 75 | 76 | # Convert stop_id to string 77 | stops['stop_id'] = stops['stop_id'].astype(str) 78 | 79 | # Ensure coordinates are float 80 | stops['stop_lat'] = stops['stop_lat'].astype(float) 81 | stops['stop_lon'] = stops['stop_lon'].astype(float) 82 | 83 | print("Cleaned 'stops' — shape:", stops.shape) 84 | 85 | 86 | 87 | 88 | 89 | 90 | # === Objective 2: Exploratory Data Analysis (EDA) === 91 | 92 | # Ensure Seaborn styling is used 93 | sns.set(style='whitegrid') 94 | 95 | 96 | 97 | 98 | # --- 1. Distribution of Stop Locations (Latitude & Longitude) --- 99 | plt.figure(figsize=(8, 6)) 100 | sns.scatterplot(data=stops, x='stop_lon', y='stop_lat', s=10, alpha=0.5) 101 | plt.title('Bus Stops in Delhi (Longitude vs Latitude)') 102 | plt.xlabel('Longitude') 103 | plt.ylabel('Latitude') 104 | plt.tight_layout() 105 | plt.savefig('figures/eda_stop_locations.png') 106 | plt.close() 107 | 108 | 109 | # --- 2. Histogram of Arrival Times --- 110 | plt.figure(figsize=(10, 4)) 111 | sns.histplot(stop_times['arrival_minutes'].dropna(), bins=48, kde=True, color='teal') 112 | plt.title('Distribution of Bus Arrival Times') 113 | plt.xlabel('Minutes from Midnight') 114 | plt.ylabel('Frequency') 115 | plt.tight_layout() 116 | plt.savefig('figures/eda_arrival_hist.png') 117 | plt.close() 118 | 119 | 120 | 121 | # --- 3. Average Stop Sequence Length per Trip --- 122 | sequence_stats = stop_times.groupby('trip_id')['stop_sequence'].count() 123 | 124 | plt.figure(figsize=(8, 4)) 125 | sns.histplot(sequence_stats, bins=30, kde=True, color='darkorange') 126 | plt.title('Number of Stops per Trip') 127 | plt.xlabel('Stop Count') 128 | plt.ylabel('Number of Trips') 129 | plt.tight_layout() 130 | plt.savefig('figures/eda_stop_counts.png') 131 | plt.close() 132 | 133 | 134 | print("EDA completed! Visualizations saved in the 'figures' folder.") 135 | 136 | 137 | 138 | # === Objective 3: Summary Statistics === 139 | 140 | # 1. Summary of 'routes' 141 | print("\n🔹 ROUTES Summary Statistics:") 142 | print(routes.describe(include='all')) 143 | 144 | # 2. Summary of 'trips' 145 | print("\n🔹 TRIPS Summary Statistics:") 146 | print(trips.describe(include='all')) 147 | 148 | # Total number of unique trips 149 | unique_trips = trips['trip_id'].nunique() 150 | print(f"\nTotal Unique Trips: {unique_trips}") 151 | 152 | # 3. Summary of 'calendar' – Count services active on each day 153 | print("\n🔹 CALENDAR Service Availability by Day:") 154 | for day in ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']: 155 | count = calendar[calendar[day] == 1].shape[0] 156 | print(f"{day.capitalize()}: {count} services") 157 | 158 | # 4. Summary of 'stops' 159 | print("\n🔹 STOPS Summary Statistics:") 160 | print(stops[['stop_lat', 'stop_lon']].describe()) 161 | 162 | # Count unique stops 163 | print(f"\nTotal Unique Stops: {stops['stop_id'].nunique()}") 164 | 165 | # Check for missing values in each dataset 166 | print("\n🔹 Missing Values Check:") 167 | print("Routes:\n", routes.isnull().sum()) 168 | print("Trips:\n", trips.isnull().sum()) 169 | print("Calendar:\n", calendar.isnull().sum()) 170 | print("Stops:\n", stops.isnull().sum()) 171 | 172 | 173 | # === Objective 4: Correlation & Covariance === 174 | 175 | # Drop rows with all NaNs in calendar day columns 176 | calendar_days = calendar[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].dropna(how='all') 177 | 178 | # Drop rows with NaN in latitude or longitude 179 | stops_coords = stops[['stop_lat', 'stop_lon']].dropna() 180 | 181 | # Create 'figures' folder if not present (OPTIONAL in your case since no os module is used) 182 | # You can manually create it in your folder if needed 183 | 184 | # ==== CALENDAR CORRELATION ==== 185 | if calendar_days.shape[0] >= 2: 186 | calendar_corr = calendar_days.corr() 187 | 188 | if not calendar_corr.isna().all().all(): 189 | plt.figure(figsize=(8, 6)) 190 | sns.heatmap(calendar_corr, annot=True, cmap='coolwarm', fmt='.2f') 191 | plt.title('Correlation between Service Days (Calendar)') 192 | plt.tight_layout() 193 | plt.savefig('figures/correlation_calendar.png') 194 | plt.close() 195 | else: 196 | print("All values in calendar correlation are NaN.") 197 | else: 198 | print("Not enough valid rows in calendar data for correlation.") 199 | 200 | # ==== CALENDAR COVARIANCE ==== 201 | if calendar_days.shape[0] >= 2: 202 | calendar_cov = calendar_days.cov() 203 | 204 | if not calendar_cov.isna().all().all(): 205 | plt.figure(figsize=(8, 6)) 206 | sns.heatmap(calendar_cov, annot=True, cmap='YlGnBu', fmt='.2f') 207 | plt.title('Covariance between Service Days (Calendar)') 208 | plt.tight_layout() 209 | plt.savefig('figures/covariance_calendar.png') 210 | plt.close() 211 | else: 212 | print("All values in calendar covariance are NaN.") 213 | else: 214 | print("Not enough valid rows in calendar data for covariance.") 215 | 216 | # ==== STOPS LAT/LON CORRELATION ==== 217 | if stops_coords.shape[0] >= 2: 218 | stops_corr = stops_coords.corr() 219 | 220 | if not stops_corr.isna().all().all(): 221 | plt.figure(figsize=(5, 4)) 222 | sns.heatmap(stops_corr, annot=True, cmap='magma', fmt='.2f') 223 | plt.title('Correlation between Stop Latitude and Longitude') 224 | plt.tight_layout() 225 | plt.savefig('figures/correlation_stop_coords.png') 226 | plt.close() 227 | else: 228 | print("All values in stop coordinates correlation are NaN.") 229 | else: 230 | print("Not enough valid stop coordinates for correlation.") 231 | 232 | 233 | # === Objective 5: Outlier Detection (Duration Outliers) === 234 | 235 | # === Step 1: Compute Trip Durations === 236 | 237 | # Merge stop_times with trips to calculate duration (end time - start time) 238 | stop_times['arrival_time'] = pd.to_timedelta(stop_times['arrival_time'], errors='coerce') 239 | stop_times['departure_time'] = pd.to_timedelta(stop_times['departure_time'], errors='coerce') 240 | 241 | trip_durations = stop_times.groupby('trip_id').agg( 242 | start_time=('departure_time', 'min'), 243 | end_time=('arrival_time', 'max') 244 | ).dropna() 245 | 246 | trip_durations['duration_min'] = (trip_durations['end_time'] - trip_durations['start_time']).dt.total_seconds() / 60 247 | 248 | # Filter out unrealistic durations (optional) 249 | trip_durations = trip_durations[(trip_durations['duration_min'] > 0) & (trip_durations['duration_min'] < 500)] 250 | 251 | # === Step 2: Detect Outliers using IQR === 252 | 253 | Q1 = trip_durations['duration_min'].quantile(0.25) 254 | Q3 = trip_durations['duration_min'].quantile(0.75) 255 | IQR = Q3 - Q1 256 | 257 | lower_bound = Q1 - 1.5 * IQR 258 | upper_bound = Q3 + 1.5 * IQR 259 | 260 | outliers = trip_durations[(trip_durations['duration_min'] < lower_bound) | (trip_durations['duration_min'] > upper_bound)] 261 | non_outliers = trip_durations[(trip_durations['duration_min'] >= lower_bound) & (trip_durations['duration_min'] <= upper_bound)] 262 | 263 | print(f"Total Trips Analyzed: {len(trip_durations)}") 264 | print(f"Outliers Detected: {len(outliers)}") 265 | 266 | # === Step 3: Visualize Outliers === 267 | 268 | # Boxplot 269 | plt.figure(figsize=(8, 4)) 270 | sns.boxplot(x=trip_durations['duration_min'], color='skyblue') 271 | plt.title('Boxplot of Trip Durations (in Minutes)') 272 | plt.xlabel('Duration (minutes)') 273 | plt.tight_layout() 274 | plt.savefig('figures/trip_duration_boxplot.png') 275 | plt.close() 276 | 277 | # Histogram with outliers highlighted 278 | plt.figure(figsize=(10, 6)) 279 | sns.histplot(non_outliers['duration_min'], bins=50, color='green', label='Normal Trips', kde=True) 280 | sns.histplot(outliers['duration_min'], bins=50, color='red', label='Outliers', kde=True) 281 | plt.title('Distribution of Trip Durations') 282 | plt.xlabel('Duration (minutes)') 283 | plt.ylabel('Number of Trips') 284 | plt.legend() 285 | plt.tight_layout() 286 | plt.savefig('figures/trip_duration_outliers.png') 287 | plt.close() 288 | 289 | 290 | # === Objective 6: Route Popularity === 291 | 292 | 293 | sns.set(style='whitegrid') 294 | 295 | # === Step 1: Count trips per route_id === 296 | route_trip_counts = trips['route_id'].value_counts().reset_index() 297 | route_trip_counts.columns = ['route_id', 'num_trips'] 298 | 299 | # === Step 2: Select top 20 routes === 300 | top_routes = route_trip_counts.sort_values(by='num_trips', ascending=False).head(20) 301 | 302 | # === Step 3: Plot using route_id instead of route_short_name === 303 | plt.figure(figsize=(12, 6)) 304 | sns.barplot(data=top_routes, x='route_id', y='num_trips', color='skyblue') 305 | plt.title('Top 20 Most Popular Route IDs by Number of Trips') 306 | plt.xlabel('Route ID') 307 | plt.ylabel('Number of Trips') 308 | plt.xticks(rotation=45) 309 | plt.tight_layout() 310 | plt.savefig('figures/top_routes_by_trips.png') 311 | plt.show() 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | --------------------------------------------------------------------------------