├── README.md
└── pythoncode.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Predictive Analysis and Visualization of Stolen Vehicle Patterns  
 2 | 
 3 | ## Project Overview  
 4 | This project analyzes real-world vehicle theft data to uncover patterns, trends, and high-risk factors using Python. Key objectives include identifying frequently stolen vehicle types, colors, and model years, as well as geographical hotspots. The analysis leverages **Exploratory Data Analysis (EDA)** and visualization techniques to provide actionable insights for law enforcement, policymakers, and vehicle owners.  
 5 | 
 6 | ## Key Features  
 7 | - **Location-Based Analysis**: Identifies states and cities with the highest theft incidents.  
 8 | - **Time-Based Trends**: Examines seasonal or temporal patterns in vehicle thefts.  
 9 | - **Vehicle Attributes**: Highlights vulnerable makes, types, and colors.  
10 | - **Statistical Insights**: Uses correlation matrices and boxplots to detect outliers and relationships.  
11 | - **Interactive Visualizations**: Bar charts, heatmaps, and boxplots to communicate findings effectively.  
12 | 
13 | ## Tools & Technologies  
14 | - **Python Libraries**: Pandas, NumPy, Matplotlib, Seaborn.  
15 | - **Data Source**: [Motor Vehicle Thefts Dataset](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Motor%20Vehicle%20Thefts).  
16 | 
17 | ## Future Scope  
18 | - Predictive modeling (e.g., Random Forest, LSTM) for theft risk forecasting.  
19 | - Interactive dashboards (Power BI/Tableau) for real-time monitoring.  
20 | - Integration with urban/traffic data to enhance analysis.  
21 | 
22 | ## Repository Structure  
23 | - `data/`: Dataset used (`Agriculture Sector 1.csv`).  
24 | - `notebooks/`: Jupyter notebooks for EDA and analysis.  
25 | - `visualizations/`: Generated plots and charts.  
26 | - `README.md`: Project documentation (this file).  
27 | 
28 | ## How to Use  
29 | 1. Clone the repository.  
30 | 2. Install dependencies:  
31 |    ```bash  
32 |    pip install pandas matplotlib seaborn numpy  
33 |    ```  
34 | 3. Run Jupyter notebooks to reproduce the analysis.  
35 | 
36 | ## Acknowledgments  
37 | Special thanks to **Baljinder Kaur** (Lovely Professional University) for guidance, and Maven Analytics for the dataset.  
38 | 
39 | --- 
40 | 
41 | ### Notes for GitHub:  
42 | - Replace placeholder paths (e.g., `data/`) with your actual file structure.  
43 | - Add a `requirements.txt` file if needed for dependencies.  
44 | - Include screenshots of visualizations in the README for better engagement.  
45 | 


--------------------------------------------------------------------------------
/pythoncode.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | import seaborn as sns
  4 | import matplotlib.pyplot as plt
  5 | from scipy.stats import skew, kurtosis
  6 | from statsmodels.stats.weightstats import ztest
  7 | from scipy.stats import ttest_ind, ttest_1samp, chi2_contingency
  8 | import warnings
  9 | warnings.filterwarnings("ignore")
 10 | 
 11 | df = pd.read_csv("cleaned_stolen_vehicles.csv", encoding='utf-8-sig')
 12 | #Dataset Overview
 13 | print("Dataset Overview:")
 14 | print("\nDataset Shape:", df.shape)
 15 | print("Column Names:\n", df.columns.tolist())
 16 | print(f"Shape: {df.shape}")
 17 | print("\nFirst 5 rows:",df.head())
 18 | print("\nMissing values:\n", df.isnull().sum())
 19 | print("\nCleaned dataset info:\n")
 20 | print(df.info())
 21 | df.to_csv('cleaned_stolen_vehicles.csv', index=False)
 22 | df = df.drop_duplicates()
 23 | numeric_cols = df.select_dtypes(include='number').columns
 24 | df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
 25 | #Fill categorical columns with mode
 26 | categorical_cols = df.select_dtypes(include='object').columns
 27 | for col in categorical_cols:
 28 |     if df[col].isnull().any():
 29 |         df[col] = df[col].fillna(df[col].mode()[0])
 30 | #Rename columns for consistency
 31 | df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
 32 | # Convert 'date_stolen' to datetime
 33 | if 'date_stolen' in df.columns:
 34 |     df['date_stolen'] = pd.to_datetime(df['date_stolen'], errors='coerce')
 35 | #Final check
 36 | print("\nCleaned dataset info:\n")
 37 | print(df.info())
 38 | df.to_csv('cleaned_stolen_vehicles.csv', index=False)
 39 | 
 40 | # Numerical Summary
 41 | numerical = df.select_dtypes(include=[np.number])
 42 | print("\nNumerical Summary:")
 43 | print(numerical.describe().T)
 44 | 
 45 | #Advanced Stats: Skewness & Kurtosis
 46 | print("\nSkewness & Kurtosis:")
 47 | for col in numerical.columns:
 48 |     print(f"\n{col}")
 49 |     print(f"Skewness : {skew(df[col].dropna()):.2f}")
 50 |     print(f"Kurtosis : {kurtosis(df[col].dropna()):.2f}")
 51 | 
 52 | #Central Tendency + Spread
 53 | print("\nCentral Tendency + Spread:")
 54 | for col in numerical.columns:
 55 |     col_data = df[col].dropna()
 56 |     print(f"\n{col}")
 57 |     print(f"Mean      : {col_data.mean():.2f}")
 58 |     print(f"Median    : {col_data.median():.2f}")
 59 |     print(f"Mode      : {col_data.mode().iloc[0] if not col_data.mode().empty else 'N/A'}")
 60 |     print(f"Min       : {col_data.min()}")
 61 |     print(f"Max       : {col_data.max()}")
 62 |     print(f"Range     : {col_data.max() - col_data.min()}")
 63 |     print(f"Std Dev   : {col_data.std():.2f}")
 64 | 
 65 | #Numerical Summary
 66 | numerical = df.select_dtypes(include=['int64', 'float64'])
 67 | print("\nNumerical Summary:")
 68 | print(numerical.describe())
 69 | #Categorical Summary
 70 | categorical = df.select_dtypes(include=['object'])
 71 | print("\nCategorical Summary:")
 72 | for col in categorical.columns:
 73 |     print(f"\n{col}")
 74 |     print(f"Unique Values : {df[col].nunique()}")
 75 |     print(f"Top Value     : {df[col].mode()[0] if not df[col].mode().empty else 'N/A'}")
 76 |     print(f"Top Frequency : {df[col].value_counts().iloc[0] if not df[col].value_counts().empty else 'N/A'}")
 77 | #Missing Data Overview
 78 | missing = df.isnull().sum()
 79 | missing = missing[missing > 0].sort_values(ascending=False)
 80 | print("\nMissing Value Report:")
 81 | if not missing.empty:
 82 |     print(missing)
 83 | else:
 84 |     print("No missing values found!")
 85 | # Correlation Matrix (for numerical features)
 86 | print("\nCorrelation Matrix:\n")
 87 | print(numerical.corr().round(2))
 88 | 
 89 | #Outlier Detection using IQR Method (for model_year)
 90 | print("\nOutlier Detection (IQR Method) for Model Year:")
 91 | col = 'model_year'
 92 | Q1 = df[col].quantile(0.25)
 93 | Q3 = df[col].quantile(0.75)
 94 | IQR = Q3 - Q1
 95 | lower_bound = Q1 - 1.5 * IQR
 96 | upper_bound = Q3 + 1.5 * IQR
 97 | outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
 98 | print(f"\nColumn: {col}")
 99 | print(f"Outliers Detected: {outliers.shape[0]}")
100 | print(f"Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}")
101 | 
102 | # ──────────────────────────────────────────────
103 | # Statistical Tests
104 | # ──────────────────────────────────────────────
105 | 
106 | # 7. Custom Z-Test implementation
107 | def z_test(sample, popmean):
108 |     sample_mean = np.mean(sample)
109 |     sample_std = np.std(sample, ddof=1)
110 |     n = len(sample)
111 |     z = (sample_mean - popmean) / (sample_std / np.sqrt(n))
112 |     p = 2 * (1 - norm.cdf(abs(z)))  # Two-tailed test
113 |     return z, p
114 | 
115 | # 8. T-Test: Difference in model year between Trailers and Roadbikes
116 | trailers = df[df['vehicle_type'] == 'Trailer']['model_year'].dropna()
117 | roadbikes = df[df['vehicle_type'] == 'Roadbike']['model_year'].dropna()
118 | 
119 | if len(trailers) > 1 and len(roadbikes) > 1:  # Need at least 2 samples for t-test
120 |     t_stat, p_val = ttest_ind(trailers, roadbikes, equal_var=False)
121 |     print("\nT-Test: Difference in model year between Trailers and Roadbikes")
122 |     print(f"T-statistic = {t_stat:.2f}")
123 |     print(f"P-value     = {p_val:.4f}")
124 |     
125 |     if p_val < 0.05:
126 |         print("Result: Statistically significant difference in means (reject H0)")
127 |     else:
128 |         print("Result: No statistically significant difference (fail to reject H0)")
129 | else:
130 |     print("\nInsufficient data for T-Test between Trailers and Roadbikes")
131 | 
132 | # 9. Chi-Square Test: Is vehicle_type independent of color?
133 | if len(df['vehicle_type'].unique()) > 1 and len(df['color'].unique()) > 1:
134 |     contingency_table = pd.crosstab(df['vehicle_type'], df['color'])
135 |     if contingency_table.size > 0:  # Check if table has data
136 |         chi2, p_val, dof, expected = chi2_contingency(contingency_table)
137 |         print("\nChi-Square Test: Is vehicle_type independent of color?")
138 |         print(f"Chi2 Statistic = {chi2:.2f}")
139 |         print(f"P-value        = {p_val:.4f}")
140 |     else:
141 |         print("\nNot enough data for Chi-Square Test")
142 | else:
143 |     print("\nNot enough categories for Chi-Square Test")
144 | 
145 | # ──────────────────────────────────────────────
146 | # Visualizations
147 | # ──────────────────────────────────────────────
148 | 
149 | # 10. Line Plot - Theft Trends Over Time (if date_stolen is properly formatted)
150 | # Convert date_stolen to datetime if needed
151 | plt.figure(figsize=(10, 5))
152 | df['date_stolen'] = pd.to_datetime(df['date_stolen'])
153 | df = df.sort_values('date_stolen')
154 | df['date_stolen'].value_counts().sort_index().plot(kind='line', color='yellow')
155 | plt.grid(True, linestyle='--', alpha=0.5)
156 | plt.title("Daily Vehicle Thefts Over Time")
157 | plt.xlabel("Date")
158 | plt.ylabel("Number of Thefts")
159 | plt.xticks(rotation=45)
160 | plt.tight_layout()
161 | plt.show()
162 | 
163 | # 11. Bar Plot - Top 10 Most Stolen Vehicle Types
164 | plt.figure(figsize=(10, 5))
165 | df['vehicle_type'].value_counts().head(10).plot(kind='bar', color='grey')
166 | plt.title("Top 10 Most Stolen Vehicle Types")
167 | plt.xlabel("Vehicle Type")
168 | plt.ylabel("Number of Thefts")
169 | plt.xticks(rotation=45)
170 | plt.tight_layout()
171 | plt.show()
172 | 
173 | # 12. Histogram - Distribution of Model Years
174 | plt.figure(figsize=(10, 6))
175 | plt.hist(df['model_year'], bins=20, color='orange', edgecolor='red')
176 | plt.xlabel("Model Year")
177 | plt.ylabel("Frequency")
178 | plt.title("Distribution of Stolen Vehicle Model Years")
179 | plt.tight_layout()
180 | plt.show()
181 | 
182 | # 13. Pie Chart - Proportion of Thefts by Vehicle Type
183 | vehicle_counts = df['vehicle_type'].value_counts().head(6)
184 | plt.figure(figsize=(7, 4))
185 | plt.pie(vehicle_counts, labels=vehicle_counts.index, autopct='%1.1f%%', startangle=140)
186 | plt.title("Theft Distribution by Vehicle Type")
187 | plt.axis('equal')
188 | plt.tight_layout()
189 | plt.show()
190 | 
191 | # 14. Box Plot - Model Year Distribution by Vehicle Type
192 | plt.figure(figsize=(8, 6))
193 | top_types = df['vehicle_type'].value_counts().head(10).index
194 | sns.boxplot(x='vehicle_type', y='model_year', data=df[df['vehicle_type'].isin(top_types)],palette="Set3")
195 | plt.title("Model Year Distribution by Vehicle Type")
196 | plt.xlabel("Vehicle Type")
197 | plt.ylabel("Model Year")
198 | plt.xticks(rotation=45)
199 | plt.tight_layout()
200 | plt.show()
201 | 
202 | # 15. Heatmap - Correlation Matrix of Numerical Variables
203 | plt.figure(figsize=(8, 6))
204 | corr_matrix = df.select_dtypes(include='number').corr()
205 | sns.heatmap(corr_matrix, annot=True, linewidths=0.5, cmap="Spectral")
206 | plt.title("Correlation Heatmap of Numerical Features")
207 | plt.tight_layout()
208 | plt.show()
209 | 
210 | # 16. Count Plot - Top Vehicle Colors
211 | plt.figure(figsize=(10, 5))
212 | top_colors = df['color'].value_counts().head(10).index
213 | color_palette = {
214 |     "Silver": "#C0C0C0",
215 |     "White": "#FFFFFF",
216 |     "Black": "#000000",
217 |     "Blue": "#0000FF",
218 |     "Red": "#FF0000",
219 |     "Grey": "#808080",
220 |     "Green": "#008000",
221 |     "Gold": "#FFD700",
222 |     "Brown": "#A52A2A",
223 |     "Yellow": "#FFFF00"
224 | }
225 | palette = [color_palette[color] for color in top_colors]
226 | 
227 | sns.countplot(
228 |     y='color',
229 |     data=df[df['color'].isin(top_colors)],
230 |     order=top_colors,
231 |     palette=palette, 
232 |     edgecolor='black'
233 | )
234 | 
235 | plt.title("Top 10 Colors of Stolen Vehicles (True Colors)")
236 | plt.xlabel("Count")
237 | plt.ylabel("Color")
238 | plt.tight_layout()
239 | plt.show()
240 | 
241 | # 17. Violin Plot - Model Year Distribution for Top Vehicle Types
242 | top_types = df['vehicle_type'].value_counts().head(5).index
243 | df_top_types = df[df['vehicle_type'].isin(top_types)]
244 | plt.figure(figsize=(10, 6))
245 | sns.violinplot(x='vehicle_type', y='model_year', data=df_top_types, palette='pastel')
246 | plt.title("Model Year Distribution for Top Stolen Vehicle Types")
247 | plt.xlabel("Vehicle Type")
248 | plt.ylabel("Model Year")
249 | plt.xticks(rotation=45)
250 | plt.tight_layout()
251 | plt.show()
252 | 


--------------------------------------------------------------------------------