├── README.md └── pythoncode.py /README.md: -------------------------------------------------------------------------------- 1 | # Predictive Analysis and Visualization of Stolen Vehicle Patterns 2 | 3 | ## Project Overview 4 | This project analyzes real-world vehicle theft data to uncover patterns, trends, and high-risk factors using Python. Key objectives include identifying frequently stolen vehicle types, colors, and model years, as well as geographical hotspots. The analysis leverages **Exploratory Data Analysis (EDA)** and visualization techniques to provide actionable insights for law enforcement, policymakers, and vehicle owners. 5 | 6 | ## Key Features 7 | - **Location-Based Analysis**: Identifies states and cities with the highest theft incidents. 8 | - **Time-Based Trends**: Examines seasonal or temporal patterns in vehicle thefts. 9 | - **Vehicle Attributes**: Highlights vulnerable makes, types, and colors. 10 | - **Statistical Insights**: Uses correlation matrices and boxplots to detect outliers and relationships. 11 | - **Interactive Visualizations**: Bar charts, heatmaps, and boxplots to communicate findings effectively. 12 | 13 | ## Tools & Technologies 14 | - **Python Libraries**: Pandas, NumPy, Matplotlib, Seaborn. 15 | - **Data Source**: [Motor Vehicle Thefts Dataset](https://mavenanalytics.io/data-playground?order=date_added%2Cdesc&search=Motor%20Vehicle%20Thefts). 16 | 17 | ## Future Scope 18 | - Predictive modeling (e.g., Random Forest, LSTM) for theft risk forecasting. 19 | - Interactive dashboards (Power BI/Tableau) for real-time monitoring. 20 | - Integration with urban/traffic data to enhance analysis. 21 | 22 | ## Repository Structure 23 | - `data/`: Dataset used (`Agriculture Sector 1.csv`). 24 | - `notebooks/`: Jupyter notebooks for EDA and analysis. 25 | - `visualizations/`: Generated plots and charts. 26 | - `README.md`: Project documentation (this file). 27 | 28 | ## How to Use 29 | 1. Clone the repository. 30 | 2. Install dependencies: 31 | ```bash 32 | pip install pandas matplotlib seaborn numpy 33 | ``` 34 | 3. Run Jupyter notebooks to reproduce the analysis. 35 | 36 | ## Acknowledgments 37 | Special thanks to **Baljinder Kaur** (Lovely Professional University) for guidance, and Maven Analytics for the dataset. 38 | 39 | --- 40 | 41 | ### Notes for GitHub: 42 | - Replace placeholder paths (e.g., `data/`) with your actual file structure. 43 | - Add a `requirements.txt` file if needed for dependencies. 44 | - Include screenshots of visualizations in the README for better engagement. 45 | -------------------------------------------------------------------------------- /pythoncode.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import seaborn as sns 4 | import matplotlib.pyplot as plt 5 | from scipy.stats import skew, kurtosis 6 | from statsmodels.stats.weightstats import ztest 7 | from scipy.stats import ttest_ind, ttest_1samp, chi2_contingency 8 | import warnings 9 | warnings.filterwarnings("ignore") 10 | 11 | df = pd.read_csv("cleaned_stolen_vehicles.csv", encoding='utf-8-sig') 12 | #Dataset Overview 13 | print("Dataset Overview:") 14 | print("\nDataset Shape:", df.shape) 15 | print("Column Names:\n", df.columns.tolist()) 16 | print(f"Shape: {df.shape}") 17 | print("\nFirst 5 rows:",df.head()) 18 | print("\nMissing values:\n", df.isnull().sum()) 19 | print("\nCleaned dataset info:\n") 20 | print(df.info()) 21 | df.to_csv('cleaned_stolen_vehicles.csv', index=False) 22 | df = df.drop_duplicates() 23 | numeric_cols = df.select_dtypes(include='number').columns 24 | df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean()) 25 | #Fill categorical columns with mode 26 | categorical_cols = df.select_dtypes(include='object').columns 27 | for col in categorical_cols: 28 | if df[col].isnull().any(): 29 | df[col] = df[col].fillna(df[col].mode()[0]) 30 | #Rename columns for consistency 31 | df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_') 32 | # Convert 'date_stolen' to datetime 33 | if 'date_stolen' in df.columns: 34 | df['date_stolen'] = pd.to_datetime(df['date_stolen'], errors='coerce') 35 | #Final check 36 | print("\nCleaned dataset info:\n") 37 | print(df.info()) 38 | df.to_csv('cleaned_stolen_vehicles.csv', index=False) 39 | 40 | # Numerical Summary 41 | numerical = df.select_dtypes(include=[np.number]) 42 | print("\nNumerical Summary:") 43 | print(numerical.describe().T) 44 | 45 | #Advanced Stats: Skewness & Kurtosis 46 | print("\nSkewness & Kurtosis:") 47 | for col in numerical.columns: 48 | print(f"\n{col}") 49 | print(f"Skewness : {skew(df[col].dropna()):.2f}") 50 | print(f"Kurtosis : {kurtosis(df[col].dropna()):.2f}") 51 | 52 | #Central Tendency + Spread 53 | print("\nCentral Tendency + Spread:") 54 | for col in numerical.columns: 55 | col_data = df[col].dropna() 56 | print(f"\n{col}") 57 | print(f"Mean : {col_data.mean():.2f}") 58 | print(f"Median : {col_data.median():.2f}") 59 | print(f"Mode : {col_data.mode().iloc[0] if not col_data.mode().empty else 'N/A'}") 60 | print(f"Min : {col_data.min()}") 61 | print(f"Max : {col_data.max()}") 62 | print(f"Range : {col_data.max() - col_data.min()}") 63 | print(f"Std Dev : {col_data.std():.2f}") 64 | 65 | #Numerical Summary 66 | numerical = df.select_dtypes(include=['int64', 'float64']) 67 | print("\nNumerical Summary:") 68 | print(numerical.describe()) 69 | #Categorical Summary 70 | categorical = df.select_dtypes(include=['object']) 71 | print("\nCategorical Summary:") 72 | for col in categorical.columns: 73 | print(f"\n{col}") 74 | print(f"Unique Values : {df[col].nunique()}") 75 | print(f"Top Value : {df[col].mode()[0] if not df[col].mode().empty else 'N/A'}") 76 | print(f"Top Frequency : {df[col].value_counts().iloc[0] if not df[col].value_counts().empty else 'N/A'}") 77 | #Missing Data Overview 78 | missing = df.isnull().sum() 79 | missing = missing[missing > 0].sort_values(ascending=False) 80 | print("\nMissing Value Report:") 81 | if not missing.empty: 82 | print(missing) 83 | else: 84 | print("No missing values found!") 85 | # Correlation Matrix (for numerical features) 86 | print("\nCorrelation Matrix:\n") 87 | print(numerical.corr().round(2)) 88 | 89 | #Outlier Detection using IQR Method (for model_year) 90 | print("\nOutlier Detection (IQR Method) for Model Year:") 91 | col = 'model_year' 92 | Q1 = df[col].quantile(0.25) 93 | Q3 = df[col].quantile(0.75) 94 | IQR = Q3 - Q1 95 | lower_bound = Q1 - 1.5 * IQR 96 | upper_bound = Q3 + 1.5 * IQR 97 | outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)] 98 | print(f"\nColumn: {col}") 99 | print(f"Outliers Detected: {outliers.shape[0]}") 100 | print(f"Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}") 101 | 102 | # ────────────────────────────────────────────── 103 | # Statistical Tests 104 | # ────────────────────────────────────────────── 105 | 106 | # 7. Custom Z-Test implementation 107 | def z_test(sample, popmean): 108 | sample_mean = np.mean(sample) 109 | sample_std = np.std(sample, ddof=1) 110 | n = len(sample) 111 | z = (sample_mean - popmean) / (sample_std / np.sqrt(n)) 112 | p = 2 * (1 - norm.cdf(abs(z))) # Two-tailed test 113 | return z, p 114 | 115 | # 8. T-Test: Difference in model year between Trailers and Roadbikes 116 | trailers = df[df['vehicle_type'] == 'Trailer']['model_year'].dropna() 117 | roadbikes = df[df['vehicle_type'] == 'Roadbike']['model_year'].dropna() 118 | 119 | if len(trailers) > 1 and len(roadbikes) > 1: # Need at least 2 samples for t-test 120 | t_stat, p_val = ttest_ind(trailers, roadbikes, equal_var=False) 121 | print("\nT-Test: Difference in model year between Trailers and Roadbikes") 122 | print(f"T-statistic = {t_stat:.2f}") 123 | print(f"P-value = {p_val:.4f}") 124 | 125 | if p_val < 0.05: 126 | print("Result: Statistically significant difference in means (reject H0)") 127 | else: 128 | print("Result: No statistically significant difference (fail to reject H0)") 129 | else: 130 | print("\nInsufficient data for T-Test between Trailers and Roadbikes") 131 | 132 | # 9. Chi-Square Test: Is vehicle_type independent of color? 133 | if len(df['vehicle_type'].unique()) > 1 and len(df['color'].unique()) > 1: 134 | contingency_table = pd.crosstab(df['vehicle_type'], df['color']) 135 | if contingency_table.size > 0: # Check if table has data 136 | chi2, p_val, dof, expected = chi2_contingency(contingency_table) 137 | print("\nChi-Square Test: Is vehicle_type independent of color?") 138 | print(f"Chi2 Statistic = {chi2:.2f}") 139 | print(f"P-value = {p_val:.4f}") 140 | else: 141 | print("\nNot enough data for Chi-Square Test") 142 | else: 143 | print("\nNot enough categories for Chi-Square Test") 144 | 145 | # ────────────────────────────────────────────── 146 | # Visualizations 147 | # ────────────────────────────────────────────── 148 | 149 | # 10. Line Plot - Theft Trends Over Time (if date_stolen is properly formatted) 150 | # Convert date_stolen to datetime if needed 151 | plt.figure(figsize=(10, 5)) 152 | df['date_stolen'] = pd.to_datetime(df['date_stolen']) 153 | df = df.sort_values('date_stolen') 154 | df['date_stolen'].value_counts().sort_index().plot(kind='line', color='yellow') 155 | plt.grid(True, linestyle='--', alpha=0.5) 156 | plt.title("Daily Vehicle Thefts Over Time") 157 | plt.xlabel("Date") 158 | plt.ylabel("Number of Thefts") 159 | plt.xticks(rotation=45) 160 | plt.tight_layout() 161 | plt.show() 162 | 163 | # 11. Bar Plot - Top 10 Most Stolen Vehicle Types 164 | plt.figure(figsize=(10, 5)) 165 | df['vehicle_type'].value_counts().head(10).plot(kind='bar', color='grey') 166 | plt.title("Top 10 Most Stolen Vehicle Types") 167 | plt.xlabel("Vehicle Type") 168 | plt.ylabel("Number of Thefts") 169 | plt.xticks(rotation=45) 170 | plt.tight_layout() 171 | plt.show() 172 | 173 | # 12. Histogram - Distribution of Model Years 174 | plt.figure(figsize=(10, 6)) 175 | plt.hist(df['model_year'], bins=20, color='orange', edgecolor='red') 176 | plt.xlabel("Model Year") 177 | plt.ylabel("Frequency") 178 | plt.title("Distribution of Stolen Vehicle Model Years") 179 | plt.tight_layout() 180 | plt.show() 181 | 182 | # 13. Pie Chart - Proportion of Thefts by Vehicle Type 183 | vehicle_counts = df['vehicle_type'].value_counts().head(6) 184 | plt.figure(figsize=(7, 4)) 185 | plt.pie(vehicle_counts, labels=vehicle_counts.index, autopct='%1.1f%%', startangle=140) 186 | plt.title("Theft Distribution by Vehicle Type") 187 | plt.axis('equal') 188 | plt.tight_layout() 189 | plt.show() 190 | 191 | # 14. Box Plot - Model Year Distribution by Vehicle Type 192 | plt.figure(figsize=(8, 6)) 193 | top_types = df['vehicle_type'].value_counts().head(10).index 194 | sns.boxplot(x='vehicle_type', y='model_year', data=df[df['vehicle_type'].isin(top_types)],palette="Set3") 195 | plt.title("Model Year Distribution by Vehicle Type") 196 | plt.xlabel("Vehicle Type") 197 | plt.ylabel("Model Year") 198 | plt.xticks(rotation=45) 199 | plt.tight_layout() 200 | plt.show() 201 | 202 | # 15. Heatmap - Correlation Matrix of Numerical Variables 203 | plt.figure(figsize=(8, 6)) 204 | corr_matrix = df.select_dtypes(include='number').corr() 205 | sns.heatmap(corr_matrix, annot=True, linewidths=0.5, cmap="Spectral") 206 | plt.title("Correlation Heatmap of Numerical Features") 207 | plt.tight_layout() 208 | plt.show() 209 | 210 | # 16. Count Plot - Top Vehicle Colors 211 | plt.figure(figsize=(10, 5)) 212 | top_colors = df['color'].value_counts().head(10).index 213 | color_palette = { 214 | "Silver": "#C0C0C0", 215 | "White": "#FFFFFF", 216 | "Black": "#000000", 217 | "Blue": "#0000FF", 218 | "Red": "#FF0000", 219 | "Grey": "#808080", 220 | "Green": "#008000", 221 | "Gold": "#FFD700", 222 | "Brown": "#A52A2A", 223 | "Yellow": "#FFFF00" 224 | } 225 | palette = [color_palette[color] for color in top_colors] 226 | 227 | sns.countplot( 228 | y='color', 229 | data=df[df['color'].isin(top_colors)], 230 | order=top_colors, 231 | palette=palette, 232 | edgecolor='black' 233 | ) 234 | 235 | plt.title("Top 10 Colors of Stolen Vehicles (True Colors)") 236 | plt.xlabel("Count") 237 | plt.ylabel("Color") 238 | plt.tight_layout() 239 | plt.show() 240 | 241 | # 17. Violin Plot - Model Year Distribution for Top Vehicle Types 242 | top_types = df['vehicle_type'].value_counts().head(5).index 243 | df_top_types = df[df['vehicle_type'].isin(top_types)] 244 | plt.figure(figsize=(10, 6)) 245 | sns.violinplot(x='vehicle_type', y='model_year', data=df_top_types, palette='pastel') 246 | plt.title("Model Year Distribution for Top Stolen Vehicle Types") 247 | plt.xlabel("Vehicle Type") 248 | plt.ylabel("Model Year") 249 | plt.xticks(rotation=45) 250 | plt.tight_layout() 251 | plt.show() 252 | --------------------------------------------------------------------------------