├── README.md
├── index.py
└── pollution_data.xlsx


/README.md:
--------------------------------------------------------------------------------
 1 | # Pollution Data Analysis and Visualization
 2 | 
 3 | This project provides a comprehensive exploratory data analysis (EDA) of pollution data using Python. The goal is to analyze pollutant levels across different cities and over time to identify trends, hotspots, and seasonal variations.
 4 | 
 5 | ## 📁 Dataset
 6 | 
 7 | The dataset used in this analysis is `pollution_data.xlsx`, which includes information on:
 8 | 
 9 | - **City**
10 | - **Pollutant ID**
11 | - **Pollutant Average**
12 | - **Last Update (Date & Time)**
13 | 
14 | ## 🧪 Key Steps in Analysis
15 | 
16 | 1. **Data Cleaning**
17 |    - Removed leading/trailing whitespaces from column names.
18 |    - Converted date fields to datetime format.
19 |    - Removed rows with missing pollutant averages.
20 | 
21 | 2. **Data Exploration**
22 |    - Summary statistics
23 |    - Missing values count
24 |    - Duplicate detection
25 |    - Top and bottom rows preview
26 | 
27 | 3. **Trend Analysis**
28 |    - Average pollutant levels over time using a heatmap.
29 |    - Identification of top 10 cities with highest PM2.5 levels.
30 |    - Pollutant comparison across major cities (Delhi, Mumbai, Chennai, Kolkata).
31 |    - Seasonal variation of PM2.5 levels using boxplots.
32 | 
33 | 4. **City-wise Heatmap**
34 |    - Heatmap of average pollutant levels across all cities.
35 | 
36 | ## 📊 Visualizations
37 | 
38 | - **Heatmap** of pollutant levels over time
39 | - **Bar chart** of top 10 PM2.5 hotspots
40 | - **Grouped bar chart** comparing pollutants across major cities
41 | - **Monthly boxplot** showing seasonal PM2.5 trends
42 | - **Heatmap** of city-wise average pollutant levels
43 | 
44 | ## 🛠️ Tools & Libraries
45 | 
46 | - Python 3.x
47 | - Pandas
48 | - NumPy
49 | - Matplotlib
50 | - Seaborn
51 | 
52 | ## ▶️ How to Run
53 | 
54 | 1. Clone the repository or download the files.
55 | 2. Make sure `pollution_data.xlsx` is in the same directory as the script.
56 | 3. Install the required packages:
57 |    ```bash
58 |    pip install pandas numpy matplotlib seaborn
59 | 


--------------------------------------------------------------------------------
/index.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import matplotlib.pyplot as plt
  4 | import seaborn as sns
  5 | 
  6 | 
  7 | df=pd.read_excel("pollution_data.xlsx")
  8 | 
  9 | # Removing whitespace from column names for ease
 10 | df.columns = df.columns.str.strip()  
 11 | 
 12 | #data summary
 13 | print(df.isnull().sum())
 14 | 
 15 | print("Top 5 rows of dataset\n")
 16 | print(df.head())
 17 | 
 18 | print("Last 5 rows of dataset\n")
 19 | print(df.tail())
 20 | 
 21 | print("Information about the dataset\n")
 22 | print(df.info())
 23 | 
 24 | print("Available columns\n")
 25 | print(df.columns.tolist(),"\n")
 26 | 
 27 | #checking for duplicates
 28 | duplicates=df.duplicated().sum()
 29 | print(duplicates)
 30 | 
 31 | print("Statistical summary\n")
 32 | print(df.describe())
 33 | 
 34 | 
 35 | # Convert 'last_update' to datetime
 36 | df['last_update'] = pd.to_datetime(df['last_update'], errors='coerce')
 37 | 
 38 | # Drop rows with missing avg values
 39 | df = df.dropna(subset=['pollutant_avg'])
 40 | 
 41 | # Group by date and pollutant
 42 | trend_df = df.groupby([df['last_update'].dt.date, 'pollutant_id'])['pollutant_avg'].mean().unstack()
 43 | 
 44 | # Check if data exists
 45 | print("Grouped data preview:")
 46 | print(trend_df.head())
 47 | 
 48 | # Plot
 49 | heatmap_data = df.groupby([df['last_update'].dt.date, 'pollutant_id'])['pollutant_avg'].mean().unstack()
 50 | 
 51 | # Plot heatmap
 52 | plt.figure(figsize=(12, 6))
 53 | sns.heatmap(heatmap_data.T, cmap="YlGnBu", linewidths=0.5, linecolor='gray')
 54 | 
 55 | plt.title("Pollutant Levels Over Time (Heatmap)")
 56 | plt.xlabel("Date")
 57 | plt.ylabel("Pollutant")
 58 | plt.xticks(rotation=45)
 59 | plt.tight_layout()
 60 | plt.show()
 61 | 
 62 | 
 63 | # Group by city and pollutant, then find average values
 64 | hotspot_df = df.groupby(['city', 'pollutant_id'])['pollutant_avg'].mean().unstack()
 65 | 
 66 | # Find cities with highest average PM2.5
 67 | top_cities = hotspot_df.sort_values(by='PM2.5', ascending=False).head(10)
 68 | 
 69 | # Plot
 70 | top_cities.plot(kind='bar', figsize=(12,6), title="Top 10 Pollution Hotspots by PM2.5")
 71 | plt.ylabel("Average PM2.5 Level")
 72 | plt.xticks(rotation=45, ha='right')
 73 | plt.tight_layout()
 74 | plt.show()
 75 | 
 76 | 
 77 | # Average pollutant levels per city
 78 | pollutant_comparison = df.groupby(['city', 'pollutant_id'])['pollutant_avg'].mean().unstack()
 79 | 
 80 | # Plot for selected cities
 81 | selected_cities = ['Delhi', 'Mumbai', 'Chennai', 'Kolkata']
 82 | pollutant_comparison.loc[selected_cities].plot(kind='bar', figsize=(10,6))
 83 | plt.title("Pollutant Comparison Across Major Cities")
 84 | plt.ylabel("Average Level")
 85 | plt.xticks(rotation=45)
 86 | plt.tight_layout()
 87 | plt.show()
 88 | 
 89 | 
 90 | # Extract month from datetime
 91 | df['month'] = df['last_update'].dt.month_name()
 92 | 
 93 | # Boxplot: Seasonal variation for PM2.5
 94 | plt.figure(figsize=(12,6))
 95 | sns.boxplot(x='month', y='pollutant_avg', data=df[df['pollutant_id'] == 'PM2.5'])
 96 | plt.title("Monthly Variation in PM2.5 Levels")
 97 | plt.ylabel("PM2.5")
 98 | plt.xticks(rotation=45)
 99 | plt.tight_layout()
100 | plt.show()
101 | 
102 | ###################################################
103 | 
104 | # Heatmap of average pollutants per city
105 | 
106 | plt.figure(figsize=(14,8))
107 | sns.heatmap(hotspot_df.fillna(0).T, cmap="Reds", annot=False)
108 | plt.title("Heatmap: Average Pollutant Levels Across Cities")
109 | plt.xlabel("City")
110 | plt.ylabel("Pollutant")
111 | plt.tight_layout()
112 | plt.show()
113 | 


--------------------------------------------------------------------------------
/pollution_data.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/snehakundu00/python_project/1810c972268778558fc900727eadb2828e962fd5/pollution_data.xlsx


--------------------------------------------------------------------------------