├── excel project (3).xlsx ├── EDA PYTHON PROJECT REPORT .doc ├── README.md └── eda project python code.py /excel project (3).xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrabhkiratSingh123/EDA-PROJECT-RELATED-TO-Demographics-An-Exploratory-Analysis-of-Census-Data/HEAD/excel project (3).xlsx -------------------------------------------------------------------------------- /EDA PYTHON PROJECT REPORT .doc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PrabhkiratSingh123/EDA-PROJECT-RELATED-TO-Demographics-An-Exploratory-Analysis-of-Census-Data/HEAD/EDA PYTHON PROJECT REPORT .doc -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EDA-PROJECT-RELATED-TO-Demographics-An-Exploratory-Analysis-of-Census-Data 2 | # 🧠 DATA SCIENCE TOOLBOX: PYTHON PROGRAMMING 3 | ## 📊 Demographics: An Exploratory Analysis of Census Data 4 | 5 | **Project Semester:** January–April 2025 6 | **Course Code:** INT375 7 | **Program & Section:** B.Tech (CSE) - K23GD 8 | **Submitted By:** Prabhkirat Singh (Reg. No: 12309872) 9 | **Guided By:** Mrs. Baljinder Kaur, Discipline of CSE/IT, Lovely Professional University 10 | 11 | ## 📘 Project Overview 12 | 13 | This project conducts an **Exploratory Data Analysis (EDA)** of the Indian Census dataset to uncover insights about the population based on factors such as: 14 | 15 | - Gender distribution 16 | - Urban vs. Rural residence 17 | - Place of birth 18 | - Age groups 19 | - Correlation between demographic variables 20 | 21 | The findings are visualized using Python libraries including **Pandas**, **Seaborn**, **Matplotlib**, and **NumPy**. 22 | 23 | --- 24 | 25 | ## 📂 Dataset Source 26 | 27 | The dataset is sourced from the [Government of India's Census Portal](https://censusindia.gov.in/nada/index.php/catalog/10717) and contains the following key features: 28 | 29 | - State, District, Area Name 30 | - Age Group 31 | - Birthplace 32 | - Total / Male / Female population 33 | - Rural and Urban segmentation 34 | 35 | --- 36 | 37 | ## 🔧 EDA Process 38 | 39 | The following steps were performed: 40 | 41 | - **Data Cleaning:** Handling nulls, fixing inconsistent entries 42 | - **Transformation:** Filtering, type conversions 43 | - **Aggregation:** Using `groupby()` and `sum()` 44 | - **Visualization:** Pie, bar, line, violin, and heatmap plots 45 | 46 | ### 🛠 Libraries Used 47 | 48 | - `pandas` 49 | - `numpy` 50 | - `matplotlib` 51 | - `seaborn` 52 | - `squarify` 53 | 54 | --- 55 | 56 | ## 📈 Key Analyses & Visualizations 57 | 58 | ### 1. Gender Distribution 59 | - **Insight:** Slight male dominance, otherwise balanced 60 | - **Visualization:** Donut chart 61 | 62 | ### 2. Urban vs Rural 63 | - **Insight:** Rural population still forms the majority 64 | - **Visualization:** Bar and pie charts 65 | 66 | ### 3. Top Birthplaces 67 | - **Insight:** Majority born locally; UP leads in external migration 68 | - **Visualization:** Treemap 69 | 70 | ### 4. Age Group Distribution 71 | - **Insight:** Peak in working-age population; decline in senior groups 72 | - **Visualization:** Line plot 73 | 74 | ### 5. Correlation Matrix 75 | - **Insight:** Strong positive correlation between related variables 76 | - **Visualization:** Heatmap 77 | 78 | ### 6. Population by Birthplace 79 | - **Insight:** Uneven contribution by regions; some have higher spread 80 | - **Visualization:** Violin plot 81 | 82 | --- 83 | 84 | ## ✅ Conclusion 85 | 86 | The project highlighted: 87 | 88 | - Significant urban-rural population contrasts 89 | - Balanced but slightly male-skewed gender ratios 90 | - Insightful birthplace patterns suggesting internal migration 91 | - Dominance of the working-age population 92 | 93 | EDA proves to be a powerful step in making census data actionable for policy-making. 94 | 95 | --- 96 | 97 | ## 🚀 Future Scope 98 | 99 | - Predictive Modeling for migration/urban growth 100 | - Geo-mapping with tools like `GeoPandas` 101 | - Interactive Dashboards (Plotly/Tableau) 102 | - Time-series forecasting on census trends 103 | - Machine Learning for clustering and classification 104 | 105 | --- 106 | 107 | ## 📚 References 108 | 109 | - [Census India Dataset](https://censusindia.gov.in/nada/index.php/catalog/10717) 110 | - [Seaborn](https://seaborn.pydata.org) 111 | - [Matplotlib](https://matplotlib.org) 112 | - [Pandas](https://pandas.pydata.org) 113 | 114 | --- 115 | 116 | 117 | -------------------------------------------------------------------------------- /eda project python code.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | import seaborn as sns 4 | import squarify 5 | 6 | # Load dataset 7 | df = pd.read_csv(r"C:\Users\ssard\OneDrive\Documents\Dataset for project.csv") 8 | 9 | # Display dataset description 10 | print("📝 Dataset Description:\n") 11 | print(df.describe(include="all")) 12 | print("\n📊 Columns:\n", df.columns) 13 | 14 | # Labels & sizes for treemap 15 | labels = [ 16 | "Total Population\n67,151,764", 17 | "Born within India\n66,333,160", 18 | "Born in the place of enumeration\n38,688,772", 19 | "Within the state of enumeration\n40,578,120", 20 | "States in India\nbeyond the state of enumeration\n25,755,040", 21 | "Uttar Pradesh\n11,619,444" 22 | ] 23 | sizes = [67151764, 66333160, 38688772, 40578120, 25755040, 11619444] 24 | colors = ['#a2d4c5', '#ffffcc', '#ffb3b3', '#d5ccff', '#a3c2c2', '#ffcc99'] 25 | 26 | # Clean numeric columns 27 | cols_to_convert = [ 28 | "Total Persons", "Total Males", "Total Females", 29 | "Rural Persons", "Rural Males", "Rural Females", 30 | "Urban Persons", "Urban Males", "Urban Females" 31 | ] 32 | for col in cols_to_convert: 33 | df[col] = df[col].str.replace(",", "").astype(int) 34 | 35 | # Aggregated data 36 | gender_counts = df[["Total Males", "Total Females"]].sum() 37 | urban_rural = df[["Urban Persons", "Rural Persons"]].sum() 38 | birthplace_counts = df.groupby("Birth place ")["Total Persons"].sum().sort_values(ascending=False) 39 | age_group_dist = df.groupby("Age-group")["Total Persons"].sum().reset_index() 40 | top_birthplaces = birthplace_counts.head(6) 41 | df_birth_top = df[df["Birth place "].isin(top_birthplaces.index)] 42 | 43 | # Set seaborn style 44 | sns.set(style="whitegrid") 45 | 46 | # 1. Donut Chart - Gender Distribution 47 | plt.figure(figsize=(6, 6)) 48 | plt.pie(gender_counts, labels=gender_counts.index, startangle=90, 49 | autopct="%1.1f%%", colors=["#5DADE2", "#F1948A"], wedgeprops=dict(width=0.4)) 50 | plt.title("Gender Distribution", fontsize=16) 51 | plt.show() 52 | 53 | # 2. Barplot - Urban vs Rural 54 | plt.figure(figsize=(6, 5)) 55 | sns.barplot(x=urban_rural.index, y=urban_rural.values, palette="Accent") 56 | plt.title("Urban vs Rural Population", fontsize=16) 57 | plt.ylabel("Population") 58 | plt.show() 59 | 60 | # 3. Treemap - Top Birthplaces 61 | plt.figure(figsize=(12, 6)) 62 | squarify.plot(sizes=sizes, label=labels, color=colors, pad=True, text_kwargs={'fontsize': 10}) 63 | plt.axis('off') 64 | plt.title("Top 6 Birthplaces by Population", fontsize=16) 65 | plt.show() 66 | 67 | # 4. Lineplot - Age Group 68 | plt.figure(figsize=(8, 5)) 69 | sns.lineplot(data=age_group_dist, x="Age-group", y="Total Persons", marker="o", color="#E67E22") 70 | plt.title("Age Group Distribution", fontsize=16) 71 | plt.xticks(rotation=45) 72 | plt.show() 73 | 74 | # 5. Heatmap - Correlation 75 | plt.figure(figsize=(8, 6)) 76 | corr = df[cols_to_convert].corr() 77 | sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", cbar_kws={"shrink": 0.8}) 78 | plt.title("Demographic Correlation", fontsize=16) 79 | plt.show() 80 | 81 | # 6. Violin Plot - Population by Birthplace (Top 6) 82 | plt.figure(figsize=(10, 6)) 83 | sns.violinplot(data=df_birth_top, x="Birth place ", y="Total Persons", palette="pastel") 84 | plt.title("Population Distribution in Top Birthplaces", fontsize=16) 85 | plt.xlabel("Birthplace") 86 | plt.ylabel("Total Persons") 87 | plt.xticks(rotation=45) 88 | plt.show() 89 | 90 | 91 | 92 | # 7. Pie Chart - Urban vs Rural Persons 93 | plt.figure(figsize=(6, 6)) 94 | plt.pie(urban_rural, labels=urban_rural.index, autopct='%1.1f%%', 95 | startangle=140, colors=["#82E0AA", "#F5B7B1"]) 96 | plt.title("Urban vs Rural Split", fontsize=16) 97 | plt.show() 98 | 99 | # 🔚 Combined Grid of 4 Charts (Donut, Bar, Line, Pie) 100 | fig, axs = plt.subplots(2, 2, figsize=(14, 10)) 101 | 102 | # Gender Donut Chart 103 | axs[0, 0].pie(gender_counts, labels=gender_counts.index, startangle=90, 104 | autopct="%1.1f%%", colors=["#5DADE2", "#F1948A"], wedgeprops=dict(width=0.4)) 105 | axs[0, 0].set_title("Gender Distribution") 106 | 107 | # Urban vs Rural Barplot 108 | sns.barplot(x=urban_rural.index, y=urban_rural.values, palette="Accent", ax=axs[0, 1]) 109 | axs[0, 1].set_title("Urban vs Rural Population") 110 | 111 | # Age Group Line Plot 112 | sns.lineplot(data=age_group_dist, x="Age-group", y="Total Persons", marker="o", 113 | color="#E67E22", ax=axs[1, 0]) 114 | axs[1, 0].set_title("Age Group Distribution") 115 | axs[1, 0].tick_params(axis='x', rotation=45) 116 | 117 | # Urban vs Rural Pie Chart 118 | axs[1, 1].pie(urban_rural, labels=urban_rural.index, autopct='%1.1f%%', 119 | startangle=140, colors=["#82E0AA", "#F5B7B1"]) 120 | axs[1, 1].set_title("Urban vs Rural Split") 121 | 122 | plt.tight_layout() 123 | plt.suptitle("📊 Combined Demographic Dashboard", fontsize=18, y=1.03) 124 | plt.show() 125 | --------------------------------------------------------------------------------