├── eda_stop_counts.png
├── eda_arrival_hist.png
├── eda_stop_locations.png
├── top_routes_by_trips.png
├── correlation_stop_coords.png
├── trip_duration_boxplot.png
├── trip_duration_outliers.png
├── README.md
├── Project_Code
└── mainproject.py


/eda_stop_counts.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_stop_counts.png


--------------------------------------------------------------------------------
/eda_arrival_hist.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_arrival_hist.png


--------------------------------------------------------------------------------
/eda_stop_locations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/eda_stop_locations.png


--------------------------------------------------------------------------------
/top_routes_by_trips.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/top_routes_by_trips.png


--------------------------------------------------------------------------------
/correlation_stop_coords.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/correlation_stop_coords.png


--------------------------------------------------------------------------------
/trip_duration_boxplot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/trip_duration_boxplot.png


--------------------------------------------------------------------------------
/trip_duration_outliers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ali1Azam/Delhi-Transport-Analysis/HEAD/trip_duration_outliers.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Delhi-Transport-Analysis
 2 | # 🚌 Static Public Transport Analysis of Delhi (GTFS)
 3 | 
 4 | This project analyzes the static public transportation data of Delhi using the GTFS (General Transit Feed Specification) format. It provides insights into how services operate, stop coverage, route popularity, and time-based patterns — all powered by Python.
 5 | 
 6 | ---     
 7 |   
 8 | ## 📁 Dataset Source   
 9 | - **Format:** GTFS (General Transit Feed Specification) 
10 | - **Files used:** `routes.txt`, `trips.txt`, `stop_times.txt`, `stops.txt`, `calendar.txt` 
11 | - **Source:** [Delhi Open Transit Data](https://otd.delhi.gov.in/data/static/)    
12 | --- 
13 |  
14 | ## 🎯 Project Objectives
15 | 
16 | 1. **Exploratory Data Analysis (EDA)**  
17 |    - Data overview, structure, and relationships across GTFS files
18 | 
19 | 2. **Summary Statistics**  
20 |    - Descriptive insights for trips, stops, and services
21 | 
22 | 3. **Average Stop Sequence Length per Trip**  
23 |    - Understanding trip length variability
24 | 
25 | 4. **Histogram of Arrival Times**  
26 |    - Identifying peak time patterns
27 | 
28 | 5. **Distribution of Stop Locations**  
29 |    - Geospatial spread of the city’s bus stop network
30 | 
31 | 6. **Route Popularity**  
32 |    - Top 20 most frequently operated bus routes
33 | 
34 | ---
35 | 
36 | ## 🧠 Learning Experience
37 | 
38 | This project helped me:
39 | - Work with real-world public transit data formats
40 | - Use Python libraries like `pandas`, `numpy`, `matplotlib`, and `seaborn` for data manipulation and visualization
41 | - Clean and preprocess large, messy datasets effectively
42 | - Visualize route-level and trip-level patterns to understand urban mobility
43 | - Strengthen my understanding of public infrastructure data
44 | 
45 | ---
46 | 
47 | ## 📊 Tools & Libraries
48 | 
49 | - `Python`
50 | - `Pandas`
51 | - `NumPy`
52 | - `Matplotlib`
53 | - `Seaborn`
54 | 
55 | ---
56 | 
57 | ## 📁 Folder Structure
58 | 📦project-root/ ┣ 📂data/ 
59 |                    ┗ 📜 routes.txt, trips.txt, stops.txt, stop_times.txt, calendar.txt 
60 |                 ┣ 📂figures/ 
61 |                    ┗ 📊 All exported graphs (PNG format)
62 |                 ┣ 📜 mainproject.py 
63 |                 ┣ 📜 README.md
64 | 
65 | 
66 | 
67 | 
68 | 
69 | 
70 | 


--------------------------------------------------------------------------------
/Project_Code:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns
  5 | 
  6 | routes = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\routes.txt")
  7 | trips = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\trips.txt")
  8 | stop_times = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stop_times.txt")
  9 | stops = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stops.txt")
 10 | calendar = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\calendar.txt")
 11 | 
 12 | 
 13 | 
 14 | # === Objective 1: Data Cleaning & Management ===
 15 | 
 16 | # Drop rows with missing arrival or departure time
 17 | stop_times.dropna(subset=['arrival_time', 'departure_time'], inplace=True)
 18 | 
 19 | # Keep only rows where time is in HH:MM:SS or HH:MM format
 20 | stop_times = stop_times[
 21 |     stop_times['arrival_time'].str.contains(':') &
 22 |     stop_times['departure_time'].str.contains(':')
 23 | ]
 24 | 
 25 | # Convert time strings to minutes from midnight (basic transformation)
 26 | def time_to_minutes(t):
 27 |     try:
 28 |         parts = t.split(':')
 29 |         return int(parts[0]) * 60 + int(parts[1])
 30 |     except:
 31 |         return np.nan
 32 | 
 33 | stop_times['arrival_minutes'] = stop_times['arrival_time'].apply(time_to_minutes)
 34 | stop_times['departure_minutes'] = stop_times['departure_time'].apply(time_to_minutes)
 35 | 
 36 | # Optional: Drop rows where conversion failed
 37 | stop_times.dropna(subset=['arrival_minutes', 'departure_minutes'], inplace=True)
 38 | 
 39 | print("Data cleaned successfully! Shape after cleaning:", stop_times.shape)
 40 | 
 41 | 
 42 | # Remove any rows with missing critical fields like route_id or route_short_name
 43 | routes.dropna(subset=['route_id', 'route_short_name'], inplace=True)
 44 | 
 45 | # Ensure route_id is string
 46 | routes['route_id'] = routes['route_id'].astype(str)
 47 | 
 48 | print("Cleaned 'routes' — shape:", routes.shape)
 49 | 
 50 | 
 51 | # Drop rows with missing trip_id or route_id
 52 | trips.dropna(subset=['trip_id', 'route_id'], inplace=True)
 53 | 
 54 | # Convert IDs to string for consistency
 55 | trips['trip_id'] = trips['trip_id'].astype(str)
 56 | trips['route_id'] = trips['route_id'].astype(str)
 57 | 
 58 | print("Cleaned 'trips' — shape:", trips.shape)
 59 | 
 60 | 
 61 | # Drop rows with missing service_id or all-zero days
 62 | calendar.dropna(subset=['service_id'], inplace=True)
 63 | 
 64 | # Drop services that are not active on any day
 65 | calendar = calendar[calendar[['monday','tuesday','wednesday','thursday','friday','saturday','sunday']].sum(axis=1) > 0]
 66 | 
 67 | # Ensure service_id is string
 68 | calendar['service_id'] = calendar['service_id'].astype(str)
 69 | 
 70 | print("Cleaned 'calendar' — shape:", calendar.shape)
 71 | 
 72 | 
 73 | # Drop rows with missing stop_id, stop_name or coordinates
 74 | stops.dropna(subset=['stop_id', 'stop_name', 'stop_lat', 'stop_lon'], inplace=True)
 75 | 
 76 | # Convert stop_id to string
 77 | stops['stop_id'] = stops['stop_id'].astype(str)
 78 | 
 79 | # Ensure coordinates are float
 80 | stops['stop_lat'] = stops['stop_lat'].astype(float)
 81 | stops['stop_lon'] = stops['stop_lon'].astype(float)
 82 | 
 83 | print("Cleaned 'stops' — shape:", stops.shape)
 84 | 
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | # === Objective 2: Exploratory Data Analysis (EDA) ===
 91 | 
 92 | # Ensure Seaborn styling is used
 93 | sns.set(style='whitegrid')
 94 | 
 95 | 
 96 | 
 97 | 
 98 | # --- 1. Distribution of Stop Locations (Latitude & Longitude) ---
 99 | plt.figure(figsize=(8, 6))
100 | sns.scatterplot(data=stops, x='stop_lon', y='stop_lat', s=10, alpha=0.5)
101 | plt.title('Bus Stops in Delhi (Longitude vs Latitude)')
102 | plt.xlabel('Longitude')
103 | plt.ylabel('Latitude')
104 | plt.tight_layout()
105 | plt.savefig('figures/eda_stop_locations.png')
106 | plt.close()
107 | 
108 | 
109 | # --- 2. Histogram of Arrival Times ---
110 | plt.figure(figsize=(10, 4))
111 | sns.histplot(stop_times['arrival_minutes'].dropna(), bins=48, kde=True, color='teal')
112 | plt.title('Distribution of Bus Arrival Times')
113 | plt.xlabel('Minutes from Midnight')
114 | plt.ylabel('Frequency')
115 | plt.tight_layout()
116 | plt.savefig('figures/eda_arrival_hist.png')
117 | plt.close()
118 | 
119 | 
120 | 
121 | # --- 3. Average Stop Sequence Length per Trip ---
122 | sequence_stats = stop_times.groupby('trip_id')['stop_sequence'].count()
123 | 
124 | plt.figure(figsize=(8, 4))
125 | sns.histplot(sequence_stats, bins=30, kde=True, color='darkorange')
126 | plt.title('Number of Stops per Trip')
127 | plt.xlabel('Stop Count')
128 | plt.ylabel('Number of Trips')
129 | plt.tight_layout()
130 | plt.savefig('figures/eda_stop_counts.png')
131 | plt.close()
132 | 
133 | 
134 | print("EDA completed! Visualizations saved in the 'figures' folder.")
135 | 
136 | 
137 | 
138 | # === Objective 3: Summary Statistics ===
139 | 
140 | # 1. Summary of 'routes'
141 | print("\n🔹 ROUTES Summary Statistics:")
142 | print(routes.describe(include='all'))
143 | 
144 | # 2. Summary of 'trips'
145 | print("\n🔹 TRIPS Summary Statistics:")
146 | print(trips.describe(include='all'))
147 | 
148 | # Total number of unique trips
149 | unique_trips = trips['trip_id'].nunique()
150 | print(f"\nTotal Unique Trips: {unique_trips}")
151 | 
152 | # 3. Summary of 'calendar' – Count services active on each day
153 | print("\n🔹 CALENDAR Service Availability by Day:")
154 | for day in ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']:
155 |     count = calendar[calendar[day] == 1].shape[0]
156 |     print(f"{day.capitalize()}: {count} services")
157 | 
158 | # 4. Summary of 'stops'
159 | print("\n🔹 STOPS Summary Statistics:")
160 | print(stops[['stop_lat', 'stop_lon']].describe())
161 | 
162 | # Count unique stops
163 | print(f"\nTotal Unique Stops: {stops['stop_id'].nunique()}")
164 | 
165 | # Check for missing values in each dataset
166 | print("\n🔹 Missing Values Check:")
167 | print("Routes:\n", routes.isnull().sum())
168 | print("Trips:\n", trips.isnull().sum())
169 | print("Calendar:\n", calendar.isnull().sum())
170 | print("Stops:\n", stops.isnull().sum())
171 | 
172 | 
173 | # === Objective 4: Correlation & Covariance ===
174 | 
175 | # Drop rows with all NaNs in calendar day columns
176 | calendar_days = calendar[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].dropna(how='all')
177 | 
178 | # Drop rows with NaN in latitude or longitude
179 | stops_coords = stops[['stop_lat', 'stop_lon']].dropna()
180 | 
181 | # Create 'figures' folder if not present (OPTIONAL in your case since no os module is used)
182 | # You can manually create it in your folder if needed
183 | 
184 | # ==== CALENDAR CORRELATION ====
185 | if calendar_days.shape[0] >= 2:
186 |     calendar_corr = calendar_days.corr()
187 | 
188 |     if not calendar_corr.isna().all().all():
189 |         plt.figure(figsize=(8, 6))
190 |         sns.heatmap(calendar_corr, annot=True, cmap='coolwarm', fmt='.2f')
191 |         plt.title('Correlation between Service Days (Calendar)')
192 |         plt.tight_layout()
193 |         plt.savefig('figures/correlation_calendar.png')
194 |         plt.close()
195 |     else:
196 |         print("All values in calendar correlation are NaN.")
197 | else:
198 |     print("Not enough valid rows in calendar data for correlation.")
199 | 
200 | # ==== CALENDAR COVARIANCE ====
201 | if calendar_days.shape[0] >= 2:
202 |     calendar_cov = calendar_days.cov()
203 | 
204 |     if not calendar_cov.isna().all().all():
205 |         plt.figure(figsize=(8, 6))
206 |         sns.heatmap(calendar_cov, annot=True, cmap='YlGnBu', fmt='.2f')
207 |         plt.title('Covariance between Service Days (Calendar)')
208 |         plt.tight_layout()
209 |         plt.savefig('figures/covariance_calendar.png')
210 |         plt.close()
211 |     else:
212 |         print("All values in calendar covariance are NaN.")
213 | else:
214 |     print("Not enough valid rows in calendar data for covariance.")
215 | 
216 | # ==== STOPS LAT/LON CORRELATION ====
217 | if stops_coords.shape[0] >= 2:
218 |     stops_corr = stops_coords.corr()
219 | 
220 |     if not stops_corr.isna().all().all():
221 |         plt.figure(figsize=(5, 4))
222 |         sns.heatmap(stops_corr, annot=True, cmap='magma', fmt='.2f')
223 |         plt.title('Correlation between Stop Latitude and Longitude')
224 |         plt.tight_layout()
225 |         plt.savefig('figures/correlation_stop_coords.png')
226 |         plt.close()
227 |     else:
228 |         print("All values in stop coordinates correlation are NaN.")
229 | else:
230 |     print("Not enough valid stop coordinates for correlation.")
231 | 
232 | 
233 | # === Objective 5: Outlier Detection (Duration Outliers) ===
234 | 
235 | # === Step 1: Compute Trip Durations ===
236 | 
237 | # Merge stop_times with trips to calculate duration (end time - start time)
238 | stop_times['arrival_time'] = pd.to_timedelta(stop_times['arrival_time'], errors='coerce')
239 | stop_times['departure_time'] = pd.to_timedelta(stop_times['departure_time'], errors='coerce')
240 | 
241 | trip_durations = stop_times.groupby('trip_id').agg(
242 |     start_time=('departure_time', 'min'),
243 |     end_time=('arrival_time', 'max')
244 | ).dropna()
245 | 
246 | trip_durations['duration_min'] = (trip_durations['end_time'] - trip_durations['start_time']).dt.total_seconds() / 60
247 | 
248 | # Filter out unrealistic durations (optional)
249 | trip_durations = trip_durations[(trip_durations['duration_min'] > 0) & (trip_durations['duration_min'] < 500)]
250 | 
251 | # === Step 2: Detect Outliers using IQR ===
252 | 
253 | Q1 = trip_durations['duration_min'].quantile(0.25)
254 | Q3 = trip_durations['duration_min'].quantile(0.75)
255 | IQR = Q3 - Q1
256 | 
257 | lower_bound = Q1 - 1.5 * IQR
258 | upper_bound = Q3 + 1.5 * IQR
259 | 
260 | outliers = trip_durations[(trip_durations['duration_min'] < lower_bound) | (trip_durations['duration_min'] > upper_bound)]
261 | non_outliers = trip_durations[(trip_durations['duration_min'] >= lower_bound) & (trip_durations['duration_min'] <= upper_bound)]
262 | 
263 | print(f"Total Trips Analyzed: {len(trip_durations)}")
264 | print(f"Outliers Detected: {len(outliers)}")
265 | 
266 | # === Step 3: Visualize Outliers ===
267 | 
268 | # Boxplot
269 | plt.figure(figsize=(8, 4))
270 | sns.boxplot(x=trip_durations['duration_min'], color='skyblue')
271 | plt.title('Boxplot of Trip Durations (in Minutes)')
272 | plt.xlabel('Duration (minutes)')
273 | plt.tight_layout()
274 | plt.savefig('figures/trip_duration_boxplot.png')
275 | plt.close()
276 | 
277 | # Histogram with outliers highlighted
278 | plt.figure(figsize=(10, 6))
279 | sns.histplot(non_outliers['duration_min'], bins=50, color='green', label='Normal Trips', kde=True)
280 | sns.histplot(outliers['duration_min'], bins=50, color='red', label='Outliers', kde=True)
281 | plt.title('Distribution of Trip Durations')
282 | plt.xlabel('Duration (minutes)')
283 | plt.ylabel('Number of Trips')
284 | plt.legend()
285 | plt.tight_layout()
286 | plt.savefig('figures/trip_duration_outliers.png')
287 | plt.close()
288 | 
289 | 
290 | # === Objective 6: Route Popularity ===
291 | 
292 | 
293 | sns.set(style='whitegrid')
294 | 
295 | # === Step 1: Count trips per route_id ===
296 | route_trip_counts = trips['route_id'].value_counts().reset_index()
297 | route_trip_counts.columns = ['route_id', 'num_trips']
298 | 
299 | # === Step 2: Select top 20 routes ===
300 | top_routes = route_trip_counts.sort_values(by='num_trips', ascending=False).head(20)
301 | 
302 | # === Step 3: Plot using route_id instead of route_short_name ===
303 | plt.figure(figsize=(12, 6))
304 | sns.barplot(data=top_routes, x='route_id', y='num_trips', color='skyblue')
305 | plt.title('Top 20 Most Popular Route IDs by Number of Trips')
306 | plt.xlabel('Route ID')
307 | plt.ylabel('Number of Trips')
308 | plt.xticks(rotation=45)
309 | plt.tight_layout()
310 | plt.savefig('figures/top_routes_by_trips.png')
311 | plt.show()
312 | 
313 | 
314 | 
315 | 
316 | 
317 | 
318 | 
319 | 
320 | 
321 | 
322 | 


--------------------------------------------------------------------------------
/mainproject.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns
  5 | 
  6 | routes = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\routes.txt")
  7 | trips = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\trips.txt")
  8 | stop_times = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stop_times.txt")
  9 | stops = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\stops.txt")
 10 | calendar = pd.read_csv(r"D:\SEM4\Python Data Science\Delhi-Transport-Analysis\Data\Raw\calendar.txt")
 11 | 
 12 | 
 13 | 
 14 | # === Objective 1: Data Cleaning & Management ===
 15 | 
 16 | # Drop rows with missing arrival or departure time
 17 | stop_times.dropna(subset=['arrival_time', 'departure_time'], inplace=True)
 18 | 
 19 | # Keep only rows where time is in HH:MM:SS or HH:MM format
 20 | stop_times = stop_times[
 21 |     stop_times['arrival_time'].str.contains(':') &
 22 |     stop_times['departure_time'].str.contains(':')
 23 | ]
 24 | 
 25 | # Convert time strings to minutes from midnight (basic transformation)
 26 | def time_to_minutes(t):
 27 |     try:
 28 |         parts = t.split(':')
 29 |         return int(parts[0]) * 60 + int(parts[1])
 30 |     except:
 31 |         return np.nan
 32 | 
 33 | stop_times['arrival_minutes'] = stop_times['arrival_time'].apply(time_to_minutes)
 34 | stop_times['departure_minutes'] = stop_times['departure_time'].apply(time_to_minutes)
 35 | 
 36 | # Optional: Drop rows where conversion failed
 37 | stop_times.dropna(subset=['arrival_minutes', 'departure_minutes'], inplace=True)
 38 | 
 39 | print("Data cleaned successfully! Shape after cleaning:", stop_times.shape)
 40 | 
 41 | 
 42 | # Remove any rows with missing critical fields like route_id or route_short_name
 43 | routes.dropna(subset=['route_id', 'route_short_name'], inplace=True)
 44 | 
 45 | # Ensure route_id is string
 46 | routes['route_id'] = routes['route_id'].astype(str)
 47 | 
 48 | print("Cleaned 'routes' — shape:", routes.shape)
 49 | 
 50 | 
 51 | # Drop rows with missing trip_id or route_id
 52 | trips.dropna(subset=['trip_id', 'route_id'], inplace=True)
 53 | 
 54 | # Convert IDs to string for consistency
 55 | trips['trip_id'] = trips['trip_id'].astype(str)
 56 | trips['route_id'] = trips['route_id'].astype(str)
 57 | 
 58 | print("Cleaned 'trips' — shape:", trips.shape)
 59 | 
 60 | 
 61 | # Drop rows with missing service_id or all-zero days
 62 | calendar.dropna(subset=['service_id'], inplace=True)
 63 | 
 64 | # Drop services that are not active on any day
 65 | calendar = calendar[calendar[['monday','tuesday','wednesday','thursday','friday','saturday','sunday']].sum(axis=1) > 0]
 66 | 
 67 | # Ensure service_id is string
 68 | calendar['service_id'] = calendar['service_id'].astype(str)
 69 | 
 70 | print("Cleaned 'calendar' — shape:", calendar.shape)
 71 | 
 72 | 
 73 | # Drop rows with missing stop_id, stop_name or coordinates
 74 | stops.dropna(subset=['stop_id', 'stop_name', 'stop_lat', 'stop_lon'], inplace=True)
 75 | 
 76 | # Convert stop_id to string
 77 | stops['stop_id'] = stops['stop_id'].astype(str)
 78 | 
 79 | # Ensure coordinates are float
 80 | stops['stop_lat'] = stops['stop_lat'].astype(float)
 81 | stops['stop_lon'] = stops['stop_lon'].astype(float)
 82 | 
 83 | print("Cleaned 'stops' — shape:", stops.shape)
 84 | 
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | # === Objective 2: Exploratory Data Analysis (EDA) ===
 91 | 
 92 | # Ensure Seaborn styling is used
 93 | sns.set(style='whitegrid')
 94 | 
 95 | 
 96 | 
 97 | 
 98 | # --- 1. Distribution of Stop Locations (Latitude & Longitude) ---
 99 | plt.figure(figsize=(8, 6))
100 | sns.scatterplot(data=stops, x='stop_lon', y='stop_lat', s=10, alpha=0.5)
101 | plt.title('Bus Stops in Delhi (Longitude vs Latitude)')
102 | plt.xlabel('Longitude')
103 | plt.ylabel('Latitude')
104 | plt.tight_layout()
105 | plt.savefig('figures/eda_stop_locations.png')
106 | plt.close()
107 | 
108 | 
109 | # --- 2. Histogram of Arrival Times ---
110 | plt.figure(figsize=(10, 4))
111 | sns.histplot(stop_times['arrival_minutes'].dropna(), bins=48, kde=True, color='teal')
112 | plt.title('Distribution of Bus Arrival Times')
113 | plt.xlabel('Minutes from Midnight')
114 | plt.ylabel('Frequency')
115 | plt.tight_layout()
116 | plt.savefig('figures/eda_arrival_hist.png')
117 | plt.close()
118 | 
119 | 
120 | 
121 | # --- 3. Average Stop Sequence Length per Trip ---
122 | sequence_stats = stop_times.groupby('trip_id')['stop_sequence'].count()
123 | 
124 | plt.figure(figsize=(8, 4))
125 | sns.histplot(sequence_stats, bins=30, kde=True, color='darkorange')
126 | plt.title('Number of Stops per Trip')
127 | plt.xlabel('Stop Count')
128 | plt.ylabel('Number of Trips')
129 | plt.tight_layout()
130 | plt.savefig('figures/eda_stop_counts.png')
131 | plt.close()
132 | 
133 | 
134 | print("EDA completed! Visualizations saved in the 'figures' folder.")
135 | 
136 | 
137 | 
138 | # === Objective 3: Summary Statistics ===
139 | 
140 | # 1. Summary of 'routes'
141 | print("\n🔹 ROUTES Summary Statistics:")
142 | print(routes.describe(include='all'))
143 | 
144 | # 2. Summary of 'trips'
145 | print("\n🔹 TRIPS Summary Statistics:")
146 | print(trips.describe(include='all'))
147 | 
148 | # Total number of unique trips
149 | unique_trips = trips['trip_id'].nunique()
150 | print(f"\nTotal Unique Trips: {unique_trips}")
151 | 
152 | # 3. Summary of 'calendar' – Count services active on each day
153 | print("\n🔹 CALENDAR Service Availability by Day:")
154 | for day in ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']:
155 |     count = calendar[calendar[day] == 1].shape[0]
156 |     print(f"{day.capitalize()}: {count} services")
157 | 
158 | # 4. Summary of 'stops'
159 | print("\n🔹 STOPS Summary Statistics:")
160 | print(stops[['stop_lat', 'stop_lon']].describe())
161 | 
162 | # Count unique stops
163 | print(f"\nTotal Unique Stops: {stops['stop_id'].nunique()}")
164 | 
165 | # Check for missing values in each dataset
166 | print("\n🔹 Missing Values Check:")
167 | print("Routes:\n", routes.isnull().sum())
168 | print("Trips:\n", trips.isnull().sum())
169 | print("Calendar:\n", calendar.isnull().sum())
170 | print("Stops:\n", stops.isnull().sum())
171 | 
172 | 
173 | # === Objective 4: Correlation & Covariance ===
174 | 
175 | # Drop rows with all NaNs in calendar day columns
176 | calendar_days = calendar[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].dropna(how='all')
177 | 
178 | # Drop rows with NaN in latitude or longitude
179 | stops_coords = stops[['stop_lat', 'stop_lon']].dropna()
180 | 
181 | # Create 'figures' folder if not present (OPTIONAL in your case since no os module is used)
182 | # You can manually create it in your folder if needed
183 | 
184 | # ==== CALENDAR CORRELATION ====
185 | if calendar_days.shape[0] >= 2:
186 |     calendar_corr = calendar_days.corr()
187 | 
188 |     if not calendar_corr.isna().all().all():
189 |         plt.figure(figsize=(8, 6))
190 |         sns.heatmap(calendar_corr, annot=True, cmap='coolwarm', fmt='.2f')
191 |         plt.title('Correlation between Service Days (Calendar)')
192 |         plt.tight_layout()
193 |         plt.savefig('figures/correlation_calendar.png')
194 |         plt.close()
195 |     else:
196 |         print("All values in calendar correlation are NaN.")
197 | else:
198 |     print("Not enough valid rows in calendar data for correlation.")
199 | 
200 | # ==== CALENDAR COVARIANCE ====
201 | if calendar_days.shape[0] >= 2:
202 |     calendar_cov = calendar_days.cov()
203 | 
204 |     if not calendar_cov.isna().all().all():
205 |         plt.figure(figsize=(8, 6))
206 |         sns.heatmap(calendar_cov, annot=True, cmap='YlGnBu', fmt='.2f')
207 |         plt.title('Covariance between Service Days (Calendar)')
208 |         plt.tight_layout()
209 |         plt.savefig('figures/covariance_calendar.png')
210 |         plt.close()
211 |     else:
212 |         print("All values in calendar covariance are NaN.")
213 | else:
214 |     print("Not enough valid rows in calendar data for covariance.")
215 | 
216 | # ==== STOPS LAT/LON CORRELATION ====
217 | if stops_coords.shape[0] >= 2:
218 |     stops_corr = stops_coords.corr()
219 | 
220 |     if not stops_corr.isna().all().all():
221 |         plt.figure(figsize=(5, 4))
222 |         sns.heatmap(stops_corr, annot=True, cmap='magma', fmt='.2f')
223 |         plt.title('Correlation between Stop Latitude and Longitude')
224 |         plt.tight_layout()
225 |         plt.savefig('figures/correlation_stop_coords.png')
226 |         plt.close()
227 |     else:
228 |         print("All values in stop coordinates correlation are NaN.")
229 | else:
230 |     print("Not enough valid stop coordinates for correlation.")
231 | 
232 | 
233 | # === Objective 5: Outlier Detection (Duration Outliers) ===
234 | 
235 | # === Step 1: Compute Trip Durations ===
236 | 
237 | # Merge stop_times with trips to calculate duration (end time - start time)
238 | stop_times['arrival_time'] = pd.to_timedelta(stop_times['arrival_time'], errors='coerce')
239 | stop_times['departure_time'] = pd.to_timedelta(stop_times['departure_time'], errors='coerce')
240 | 
241 | trip_durations = stop_times.groupby('trip_id').agg(
242 |     start_time=('departure_time', 'min'),
243 |     end_time=('arrival_time', 'max')
244 | ).dropna()
245 | 
246 | trip_durations['duration_min'] = (trip_durations['end_time'] - trip_durations['start_time']).dt.total_seconds() / 60
247 | 
248 | # Filter out unrealistic durations (optional)
249 | trip_durations = trip_durations[(trip_durations['duration_min'] > 0) & (trip_durations['duration_min'] < 500)]
250 | 
251 | # === Step 2: Detect Outliers using IQR ===
252 | 
253 | Q1 = trip_durations['duration_min'].quantile(0.25)
254 | Q3 = trip_durations['duration_min'].quantile(0.75)
255 | IQR = Q3 - Q1
256 | 
257 | lower_bound = Q1 - 1.5 * IQR
258 | upper_bound = Q3 + 1.5 * IQR
259 | 
260 | outliers = trip_durations[(trip_durations['duration_min'] < lower_bound) | (trip_durations['duration_min'] > upper_bound)]
261 | non_outliers = trip_durations[(trip_durations['duration_min'] >= lower_bound) & (trip_durations['duration_min'] <= upper_bound)]
262 | 
263 | print(f"Total Trips Analyzed: {len(trip_durations)}")
264 | print(f"Outliers Detected: {len(outliers)}")
265 | 
266 | # === Step 3: Visualize Outliers ===
267 | 
268 | # Boxplot
269 | plt.figure(figsize=(8, 4))
270 | sns.boxplot(x=trip_durations['duration_min'], color='skyblue')
271 | plt.title('Boxplot of Trip Durations (in Minutes)')
272 | plt.xlabel('Duration (minutes)')
273 | plt.tight_layout()
274 | plt.savefig('figures/trip_duration_boxplot.png')
275 | plt.close()
276 | 
277 | # Histogram with outliers highlighted
278 | plt.figure(figsize=(10, 6))
279 | sns.histplot(non_outliers['duration_min'], bins=50, color='green', label='Normal Trips', kde=True)
280 | sns.histplot(outliers['duration_min'], bins=50, color='red', label='Outliers', kde=True)
281 | plt.title('Distribution of Trip Durations')
282 | plt.xlabel('Duration (minutes)')
283 | plt.ylabel('Number of Trips')
284 | plt.legend()
285 | plt.tight_layout()
286 | plt.savefig('figures/trip_duration_outliers.png')
287 | plt.close()
288 | 
289 | 
290 | # === Objective 6: Route Popularity ===
291 | 
292 | 
293 | sns.set(style='whitegrid')
294 | 
295 | # === Step 1: Count trips per route_id ===
296 | route_trip_counts = trips['route_id'].value_counts().reset_index()
297 | route_trip_counts.columns = ['route_id', 'num_trips']
298 | 
299 | # === Step 2: Select top 20 routes ===
300 | top_routes = route_trip_counts.sort_values(by='num_trips', ascending=False).head(20)
301 | 
302 | # === Step 3: Plot using route_id instead of route_short_name ===
303 | plt.figure(figsize=(12, 6))
304 | sns.barplot(data=top_routes, x='route_id', y='num_trips', color='skyblue')
305 | plt.title('Top 20 Most Popular Route IDs by Number of Trips')
306 | plt.xlabel('Route ID')
307 | plt.ylabel('Number of Trips')
308 | plt.xticks(rotation=45)
309 | plt.tight_layout()
310 | plt.savefig('figures/top_routes_by_trips.png')
311 | plt.show()
312 | 
313 | 
314 | 
315 | 
316 | 
317 | 
318 | 
319 | 
320 | 
321 | 
322 | 


--------------------------------------------------------------------------------