├── README.md ├── index.py └── pollution_data.xlsx /README.md: -------------------------------------------------------------------------------- 1 | # Pollution Data Analysis and Visualization 2 | 3 | This project provides a comprehensive exploratory data analysis (EDA) of pollution data using Python. The goal is to analyze pollutant levels across different cities and over time to identify trends, hotspots, and seasonal variations. 4 | 5 | ## 📁 Dataset 6 | 7 | The dataset used in this analysis is `pollution_data.xlsx`, which includes information on: 8 | 9 | - **City** 10 | - **Pollutant ID** 11 | - **Pollutant Average** 12 | - **Last Update (Date & Time)** 13 | 14 | ## 🧪 Key Steps in Analysis 15 | 16 | 1. **Data Cleaning** 17 | - Removed leading/trailing whitespaces from column names. 18 | - Converted date fields to datetime format. 19 | - Removed rows with missing pollutant averages. 20 | 21 | 2. **Data Exploration** 22 | - Summary statistics 23 | - Missing values count 24 | - Duplicate detection 25 | - Top and bottom rows preview 26 | 27 | 3. **Trend Analysis** 28 | - Average pollutant levels over time using a heatmap. 29 | - Identification of top 10 cities with highest PM2.5 levels. 30 | - Pollutant comparison across major cities (Delhi, Mumbai, Chennai, Kolkata). 31 | - Seasonal variation of PM2.5 levels using boxplots. 32 | 33 | 4. **City-wise Heatmap** 34 | - Heatmap of average pollutant levels across all cities. 35 | 36 | ## 📊 Visualizations 37 | 38 | - **Heatmap** of pollutant levels over time 39 | - **Bar chart** of top 10 PM2.5 hotspots 40 | - **Grouped bar chart** comparing pollutants across major cities 41 | - **Monthly boxplot** showing seasonal PM2.5 trends 42 | - **Heatmap** of city-wise average pollutant levels 43 | 44 | ## 🛠️ Tools & Libraries 45 | 46 | - Python 3.x 47 | - Pandas 48 | - NumPy 49 | - Matplotlib 50 | - Seaborn 51 | 52 | ## ▶️ How to Run 53 | 54 | 1. Clone the repository or download the files. 55 | 2. Make sure `pollution_data.xlsx` is in the same directory as the script. 56 | 3. Install the required packages: 57 | ```bash 58 | pip install pandas numpy matplotlib seaborn 59 | -------------------------------------------------------------------------------- /index.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | import seaborn as sns 5 | 6 | 7 | df=pd.read_excel("pollution_data.xlsx") 8 | 9 | # Removing whitespace from column names for ease 10 | df.columns = df.columns.str.strip() 11 | 12 | #data summary 13 | print(df.isnull().sum()) 14 | 15 | print("Top 5 rows of dataset\n") 16 | print(df.head()) 17 | 18 | print("Last 5 rows of dataset\n") 19 | print(df.tail()) 20 | 21 | print("Information about the dataset\n") 22 | print(df.info()) 23 | 24 | print("Available columns\n") 25 | print(df.columns.tolist(),"\n") 26 | 27 | #checking for duplicates 28 | duplicates=df.duplicated().sum() 29 | print(duplicates) 30 | 31 | print("Statistical summary\n") 32 | print(df.describe()) 33 | 34 | 35 | # Convert 'last_update' to datetime 36 | df['last_update'] = pd.to_datetime(df['last_update'], errors='coerce') 37 | 38 | # Drop rows with missing avg values 39 | df = df.dropna(subset=['pollutant_avg']) 40 | 41 | # Group by date and pollutant 42 | trend_df = df.groupby([df['last_update'].dt.date, 'pollutant_id'])['pollutant_avg'].mean().unstack() 43 | 44 | # Check if data exists 45 | print("Grouped data preview:") 46 | print(trend_df.head()) 47 | 48 | # Plot 49 | heatmap_data = df.groupby([df['last_update'].dt.date, 'pollutant_id'])['pollutant_avg'].mean().unstack() 50 | 51 | # Plot heatmap 52 | plt.figure(figsize=(12, 6)) 53 | sns.heatmap(heatmap_data.T, cmap="YlGnBu", linewidths=0.5, linecolor='gray') 54 | 55 | plt.title("Pollutant Levels Over Time (Heatmap)") 56 | plt.xlabel("Date") 57 | plt.ylabel("Pollutant") 58 | plt.xticks(rotation=45) 59 | plt.tight_layout() 60 | plt.show() 61 | 62 | 63 | # Group by city and pollutant, then find average values 64 | hotspot_df = df.groupby(['city', 'pollutant_id'])['pollutant_avg'].mean().unstack() 65 | 66 | # Find cities with highest average PM2.5 67 | top_cities = hotspot_df.sort_values(by='PM2.5', ascending=False).head(10) 68 | 69 | # Plot 70 | top_cities.plot(kind='bar', figsize=(12,6), title="Top 10 Pollution Hotspots by PM2.5") 71 | plt.ylabel("Average PM2.5 Level") 72 | plt.xticks(rotation=45, ha='right') 73 | plt.tight_layout() 74 | plt.show() 75 | 76 | 77 | # Average pollutant levels per city 78 | pollutant_comparison = df.groupby(['city', 'pollutant_id'])['pollutant_avg'].mean().unstack() 79 | 80 | # Plot for selected cities 81 | selected_cities = ['Delhi', 'Mumbai', 'Chennai', 'Kolkata'] 82 | pollutant_comparison.loc[selected_cities].plot(kind='bar', figsize=(10,6)) 83 | plt.title("Pollutant Comparison Across Major Cities") 84 | plt.ylabel("Average Level") 85 | plt.xticks(rotation=45) 86 | plt.tight_layout() 87 | plt.show() 88 | 89 | 90 | # Extract month from datetime 91 | df['month'] = df['last_update'].dt.month_name() 92 | 93 | # Boxplot: Seasonal variation for PM2.5 94 | plt.figure(figsize=(12,6)) 95 | sns.boxplot(x='month', y='pollutant_avg', data=df[df['pollutant_id'] == 'PM2.5']) 96 | plt.title("Monthly Variation in PM2.5 Levels") 97 | plt.ylabel("PM2.5") 98 | plt.xticks(rotation=45) 99 | plt.tight_layout() 100 | plt.show() 101 | 102 | ################################################### 103 | 104 | # Heatmap of average pollutants per city 105 | 106 | plt.figure(figsize=(14,8)) 107 | sns.heatmap(hotspot_df.fillna(0).T, cmap="Reds", annot=False) 108 | plt.title("Heatmap: Average Pollutant Levels Across Cities") 109 | plt.xlabel("City") 110 | plt.ylabel("Pollutant") 111 | plt.tight_layout() 112 | plt.show() 113 | -------------------------------------------------------------------------------- /pollution_data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/snehakundu00/python_project/1810c972268778558fc900727eadb2828e962fd5/pollution_data.xlsx --------------------------------------------------------------------------------