├── Class5
├── suppression.png
├── types_of_pii.png
├── quasi_identifiers.png
├── README.MD
├── Differential Privacy Applications.txt
└── Intro to Privacy implications.ipynb
├── Class6
├── Key_Points.jpeg
├── Code_Example.txt
└── README.MD
├── Class1 Orientation
├── iCodeGuru_Introduction.pdf
└── README.MD
├── Class32
└── README.MD
├── Class31
└── README.MD
├── Class22
├── README.MD
├── column-transformer.ipynb
└── house_price_prediction.ipynb
├── Class3
└── README.MD
├── Class34
└── README.MD
├── Class7
└── README.MD
├── Class4
└── README.MD
├── Class28
└── README.MD
├── Class33
└── README.MD
├── Class12
├── README.MD
└── datasets
│ └── homelessness.csv
├── Class27
└── README.MD
├── Class20
└── README.MD
├── Class2
└── README.MD
├── Class23
└── README.MD
├── Class25
└── README.md
├── Class11
├── README.MD
└── datasets
│ └── homelessness.csv
├── Class29
└── README.MD
├── Class30
└── README.MD
├── Class19
├── README.MD
└── standardization_normalization.ipynb
├── Class26
└── README.md
├── Class18
└── README.MD
├── Class10
├── README.MD
└── datasets
│ └── homelessness.csv
├── Class24
└── README.MD
├── Class17
└── README.MD
├── Class8
├── README.MD
└── Code_Example.txt
├── Class9
└── README.MD
├── Class21
├── README.MD
└── placement.csv
├── Class16
└── README.MD
├── Class14
└── README.MD
├── Class15
├── README.MD
└── google_playstore_apps.ipynb
├── Class13
└── README.MD
└── README.md
/Class5/suppression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ahmadjajja/Machine-Learning_and_its-privacy-implications/HEAD/Class5/suppression.png
--------------------------------------------------------------------------------
/Class5/types_of_pii.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ahmadjajja/Machine-Learning_and_its-privacy-implications/HEAD/Class5/types_of_pii.png
--------------------------------------------------------------------------------
/Class6/Key_Points.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ahmadjajja/Machine-Learning_and_its-privacy-implications/HEAD/Class6/Key_Points.jpeg
--------------------------------------------------------------------------------
/Class5/quasi_identifiers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ahmadjajja/Machine-Learning_and_its-privacy-implications/HEAD/Class5/quasi_identifiers.png
--------------------------------------------------------------------------------
/Class1 Orientation/iCodeGuru_Introduction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Ahmadjajja/Machine-Learning_and_its-privacy-implications/HEAD/Class1 Orientation/iCodeGuru_Introduction.pdf
--------------------------------------------------------------------------------
/Class32/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 22, 2024 | Week 7| Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - Took day off for data exploration
6 |
7 | https://www.kaggle.com/competitions/playground-series-s4e8/overview
8 |
--------------------------------------------------------------------------------
/Class31/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 21, 2024 | Week 7| Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | - EDA of Kaggle mushroom dataset
6 | - Handling inconsistent values
7 | - [Class 31 Video Link](https://web.facebook.com/iCodeguru/videos/1202049024146854)
8 |
--------------------------------------------------------------------------------
/Class6/Code_Example.txt:
--------------------------------------------------------------------------------
1 | This was the code:
2 | # Creating a list with mixed data types, including a dictionary
3 | mixed_list = [1, 2.5, '3', {'key': 'value'}, 4.7, 5]
4 |
5 | # Creating a NumPy array from the mixed data type list
6 | arr = np.array(mixed_list)
7 |
8 | print(arr)
--------------------------------------------------------------------------------
/Class22/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 08, 2024 | Week 5 | Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - Feature Transformation
6 | - Feature Construction
7 | - Feature Selection
8 | - Feature Extraction
9 |
10 | - [Class 22 Video Link](https://www.facebook.com/iCodeguru/videos/1018825216109320)
11 |
--------------------------------------------------------------------------------
/Class3/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 10, 2024 | Week 1 | Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | - Revision of previous topics to maintain the flow
6 | - importance of Machine Learning in the Generative AI ERa (optinal)
7 |
8 | [Class 3 Video Link](https://www.facebook.com/iCodeguru/videos/380675287961626)
9 |
--------------------------------------------------------------------------------
/Class34/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 26, 2024 | Week 8| Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Handling Outliers
6 | - Impact Analysis of Outliers
7 | - Machine Learning Pipeline for Binary Class Prediction
8 |
9 | [Class 34 Video Link](https://www.facebook.com/iCodeguru/videos/1834705580352310/)
10 |
--------------------------------------------------------------------------------
/Class7/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 18, 2024 | Week 2 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Typing Speed Test
6 | - Intro to Google Colab
7 | - Pandas Practical Implemetation
8 | - General Questions/Answers
9 |
10 | - [Class 7 Video Link](https://www.facebook.com/iCodeguru/videos/798917849033959)
11 |
--------------------------------------------------------------------------------
/Class5/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 12, 2024 | Week 1 | Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | - Applications of Differntial privacy in ML
6 | - Types of PIIs and reidentification attacks
7 | - Anonymization techniques
8 |
9 | [Class 5 Video Link](https://www.facebook.com/iCodeguru/videos/396304176791613)
10 |
--------------------------------------------------------------------------------
/Class4/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 11, 2024 | Week 1 | Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - Terminologies in ML.
6 | - ML Workflow.
7 | - Regression classification.
8 | - Privacy in ML.
9 | - Differential Privacy.
10 |
11 | [Class 4 Video Link](https://www.facebook.com/iCodeguru/videos/489366656910032)
12 |
--------------------------------------------------------------------------------
/Class28/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 16, 2024 | Week 6 | Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | - Exploring Kaggle competition
6 | - Kaggle Poisnous mushroom project overview
7 | - Jobs nature of machine learning and generative AI on upwork
8 | - [Class 28 Video Link](https://web.facebook.com/iCodeguru/videos/1657940491631428)
9 |
--------------------------------------------------------------------------------
/Class33/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 23, 2024 | Week 7| Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | * Data preprocessing of mushroom dataset
6 | * Missing value, inconsistent values treatment
7 | * Handling outliers and removing them
8 |
9 | [Class 33 Video Link](https://web.facebook.com/iCodeguru/videos/1013797100283976)
10 |
--------------------------------------------------------------------------------
/Class6/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 15, 2024 | Week 2 | Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Overview of Python Programming
6 | - Discuss libraries for Data Analysis
7 | - Panda
8 | - NumPy
9 | - General Questions/Answers
10 |
11 | [Class 6 Video Link](https://www.facebook.com/iCodeguru/videos/296444603534700)
12 |
--------------------------------------------------------------------------------
/Class12/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 25, 2024 | Week 3 | Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - Advanced Pandas techniques
6 | - Multi-level indexing
7 | - Data filtering
8 | - Comparison with Excel and CSV
9 | - Github copilot
10 |
11 | * [Class 12 Video Link](https://www.facebook.com/iCodeguru/videos/544365754583575/)
12 |
--------------------------------------------------------------------------------
/Class27/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 15, 2024 | Week 6 | Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - Applying classification models using scikit learn.
6 | - Using graphviz to visualize decision trees.
7 | - Saving trained models using joblib and pickle.
8 | - [Class 27 Video Link](https://www.facebook.com/iCodeguru/videos/1197321291583796)
9 |
--------------------------------------------------------------------------------
/Class20/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 06, 2024 | Week 5 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Supervised machine learning
6 | - Regression and classification algorithms
7 | - Random forest and decision trees working
8 | - Resampling and Bagging
9 | - [Class 20 Video Link](https://www.facebook.com/iCodeguru/videos/1323471588452378)
10 |
--------------------------------------------------------------------------------
/Class2/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 9, 2024 | Week 1 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Applications of Machine Learning
6 | - Understanding Parameters and Hyperparameters and their differences
7 | - Stages of machine learning in production
8 |
9 | [Class 2 Video Link](https://www.facebook.com/iCodeguru/videos/1172562033868588/)
10 |
--------------------------------------------------------------------------------
/Class23/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 09, 2024 | Week 5 | Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | - Recap of Feature Engineering Concepts
6 | - Analysis of Learning Approaches
7 | - Introduction to Big Data Tools
8 | - Acknowledgment and Resources
9 |
10 | - [Class 23 Video Link](https://www.facebook.com/iCodeguru/videos/544964651216277)
11 |
--------------------------------------------------------------------------------
/Class25/README.md:
--------------------------------------------------------------------------------
1 | ## _August 13, 2024 | Week 6 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | * Classification Metrics:
6 | Confusion matrix
7 | Precision
8 | Recall
9 | F1 score
10 | * Kaggle binary mushroom classification data overview
11 |
12 | [Class 25 Video Link](https://web.facebook.com/iCodeguru/videos/1930278197423471/)
13 |
--------------------------------------------------------------------------------
/Class11/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 24, 2024 | Week 3 | Day 3_
2 |
3 | **Contents covered include:**
4 | - Applied Pandas functions on Homelessness and sales dataset
5 | - Using operators for filtering data
6 | - isin()
7 | - groupby()
8 | - unique()
9 | - value_counts()
10 | - QnA
11 |
12 | - [Class 11 Video Link](https://www.facebook.com/iCodeguru/videos/2512086508988678/)
13 |
--------------------------------------------------------------------------------
/Class29/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 19, 2024 | Week 7| Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Difference between synthetic and augmented data
6 | - Quick machine learning workflow overview
7 | - Bias and variance in machine learning
8 | - Reasons for overfitting and underfitting
9 | - [Class 29 Video Link](https://web.facebook.com/iCodeguru/videos/3763928390593592)
10 |
--------------------------------------------------------------------------------
/Class30/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 20, 2024 | Week 7| Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Difference between synthetic and augmented data
6 | - Quick machine learning workflow overview
7 | - Bias and variance in machine learning
8 | - Reasons for overfitting and underfitting
9 | - [Class 30 Video Link](https://web.facebook.com/iCodeguru/videos/4196521850634064)
10 |
--------------------------------------------------------------------------------
/Class19/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 05, 2024 | Week 5 | Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Introduction to Data Preprocessing
6 | - Understanding Standardization and Normalization
7 | - Practical Example with Python Code
8 | - Overfitting and Underfitting
9 | - Kaggle Competitions as a Learning Platform
10 |
11 | - [Class 19 Video Link](https://www.facebook.com/iCodeguru/videos/876352897725102/)
12 |
--------------------------------------------------------------------------------
/Class26/README.md:
--------------------------------------------------------------------------------
1 | ## _August 14, 2024 | Week 6 | Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | * Discussion about GenAI
6 | * Pretrained models available on hugging face and required resources
7 | * Kaggle [mushroom dataset](https://www.kaggle.com/competitions/playground-series-s4e8/overview) overview and data exploration
8 | * [Class 26 Video Link](https://www.facebook.com/iCodeguru/videos/1243668773297682)
9 |
--------------------------------------------------------------------------------
/Class18/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 02, 2024 | Week 4 | Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | - Data Collection and Anonymization
6 | - Feature Engineering and Feature Selection
7 | - Data Cleaning and Imputation
8 |
9 | - [Google Collab Notebook Link](https://colab.research.google.com/drive/1mF2JiPXiqlc0bmZd02j8UNWjuC_LCdox?usp=sharing)
10 |
11 | - [Class 18 Video Link](https://www.facebook.com/iCodeguru/videos/782806057093677)
12 |
--------------------------------------------------------------------------------
/Class10/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 23, 2024 | Week 3 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Applied Pandas functions on Homelessness dataset
6 | - index_col
7 | - Parts of dataframe
8 | - Sorting (single columns, multiple columns, multiple column different directions)
9 | - Subsetting
10 | - Grouping and aggregating
11 | - Data manipulation
12 | - Filtering values
13 | - QnA
14 |
15 | - [Class 10 Video Link](https://www.facebook.com/iCodeguru/videos/3279815862154411)
16 |
--------------------------------------------------------------------------------
/Class24/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 12, 2024 | Week 6 | Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Introduction to Model Evaluation and Selection
6 | - Understanding Evaluation Metrics (ML Matrices)
7 | - Classification Metrics in Depth
8 | - Best Model Selection
9 | - Using the Zip Function in Machine Learning
10 |
11 | - [Article Link](https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/)
12 | - [Class 24 Video Link](https://www.facebook.com/iCodeguru/videos/1168843607708830)
13 |
--------------------------------------------------------------------------------
/Class17/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 01, 2024 | Week 4 | Day 4_
2 |
3 | **Contents covered include:**
4 |
5 | - How Does Differential Privacy Works?
6 | - Metric for Privacy Loss
7 | - Privacy Budget
8 |
9 | - [White Board Link](https://wbd.ms/share/v2/aHR0cHM6Ly93aGl0ZWJvYXJkLm1pY3Jvc29mdC5jb20vYXBpL3YxLjAvd2hpdGVib2FyZHMvcmVkZWVtLzBjZThmYTdhMDk2MDQ0MTJiMmJkZGQ0YTViYjVjYmEwX0JCQTcxNzYyLTEyRTAtNDJFMS1CMzI0LTVCMTMxRjQyNEUzRF84MjcwZGRlMi0yMWE4LTQ2M2QtOGQ1MC05OWQ4YTRiMjk1NGY=)
10 |
11 | - [Class 17 Video Link](https://web.facebook.com/iCodeguru/videos/1007156847751102)
12 |
--------------------------------------------------------------------------------
/Class8/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 19, 2024 | Week 2 | Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | - Attributes of Pandas:
6 |
7 | - Importing pandas and creating a dataframe
8 | - Creating dataframe
9 | - Head and Tail
10 | - Info
11 | - Describe
12 | - Selecting Columns
13 | - Selecting Rows
14 | - Filtering Data
15 | - Adding a New Column
16 | - Dropping a Column
17 | - Merging DataFrames
18 | - Handling Missing Values
19 |
20 | - General Questions/Answers
21 |
22 | [Class 8 Video Link](https://www.facebook.com/iCodeguru/videos/506803148528787)
23 |
--------------------------------------------------------------------------------
/Class9/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 22, 2024 | Week 3 | Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - Llama 3 hackathon discussion
6 | - Numpy basics
7 |
8 | 1. Creating Arrays
9 | 2. Array of Zeros and Ones
10 | 3. Range of Numbers
11 | 4. Reshaping Arrays
12 | 5. Basic Operations (Add, subtract, multiply, divide)
13 | 6. Array Statistics
14 | 7. Indexing and Slicing
15 | 8. Random Numbers
16 | 9. Identity Matrix
17 | 10. Flatten
18 | 11. Transpose
19 | 12. Concatenate
20 | 13. Unique
21 | 14. Sum along axis
22 |
23 | - [Class 9 Video Link](https://www.facebook.com/iCodeguru/videos/503766618767054)
24 |
--------------------------------------------------------------------------------
/Class21/README.MD:
--------------------------------------------------------------------------------
1 | ## _August 07, 2024 | Week 5 | Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | - Introduction to Machine Learning Workflow
6 | - Preprocessing and Exploratory Data Analysis (EDA)
7 | - Extracting Input and Output Columns
8 | - Scaling the Data
9 | - Train-Test Split
10 | - Training the Model
11 | - Evaluating the Model
12 | - Model Selection
13 | - Deploying the Model
14 | - Importance of Scikit-Learn
15 | - Introduction to Hyperparameter Tuning
16 |
17 | - [Scikit-Learn](https://scikit-learn.org/stable/)
18 |
19 | - [Class 21 Video Link](https://www.facebook.com/iCodeguru/videos/413629985029427)
20 |
--------------------------------------------------------------------------------
/Class1 Orientation/README.MD:
--------------------------------------------------------------------------------
1 | *July 8, 2024 | Week 1 | Day 1*
2 | ---
3 | # Agenda
4 |
5 | This first class covered the following:
6 |
7 | - Introduced new participants to the Icodeguru platform.
8 | - Shared our working agenda, workflow, success stories, and course outline.
9 | - Guided participants on maximizing their benefits as part of the community.
10 | - Addressed all participant questions.
11 | - Launched the session with an introduction to machine learning and its distinction from traditional programming with examples.
12 |
13 |
14 |
15 | [Class 1 Video Link](https://www.facebook.com/iCodeguru/videos/1140906600544854) s
16 |
--------------------------------------------------------------------------------
/Class5/Differential Privacy Applications.txt:
--------------------------------------------------------------------------------
1 |
2 | Privacy Applications in Machine Learning:
3 |
4 | 1. Federated learning
5 |
6 | - without transfering the data to central server (google gboad keyboard)
7 |
8 | 2. Secure Multi-Party Computation
9 |
10 | - example, multiple banks collaborate with each other for fraud detection but make to not share a single user data
11 | with any other company to make sure the privacy of user
12 |
13 | 3. Homomorphic Encryption:
14 |
15 | - In healthcare, encrypting the patient data before sending to the third part such model to perform analysis
16 | & getting insights in the form of ecrypted data, At the end, we have decrypt the results to see it for privacy
17 | of patient data.
18 |
19 | 4. Privacy-Preserving Data Publishing:
20 |
21 | - Publishing any user's data to social media without revealing their name.
--------------------------------------------------------------------------------
/Class16/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 31, 2024 | Week 4 | Day 3_
2 |
3 | **Contents covered include:**
4 |
5 | - Introduction to Data Visualization
6 |
7 | - The importance of data visualization for interpreting data effectively
8 | - Making data-driven decisions with visual insights
9 |
10 | - Data Visualization using Plotly
11 |
12 | - Creating interactive and informative visualizations with Plotly
13 | - Tools and features of Plotly for enhanced visual storytelling
14 |
15 | - Bar Charts
16 |
17 | - Understanding the importance and application of bar charts in data analysis
18 | - Practical examples and use cases
19 |
20 | - Exploring Data-to-Viz Website
21 |
22 | - Visiting [Data-to-Viz](https://lnkd.in/dQ4rAq2D) for guidance on selecting appropriate graph types
23 | - Applying the right visualization for various data scenarios
24 |
25 | - [Class 16 Video Link](https://web.facebook.com/iCodeguru/videos/517459164056447)
26 |
--------------------------------------------------------------------------------
/Class14/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 29, 2024 | Week 4 | Day 1_
2 |
3 | **Contents covered include:**
4 |
5 | - One Hot Encoding
6 |
7 | - Revision of one hot encoding
8 | - Importance of converting categorical variables into numerical form
9 | - Application in machine learning algorithms
10 |
11 | - Handling Outliers
12 |
13 | - Definition and examples of outliers
14 | - Impact of outliers on data analysis and machine learning models
15 | - Methods for identifying and handling outliers
16 |
17 | - Boxplot Analysis
18 |
19 | - Introduction to boxplots
20 | - Visualizing outliers using boxplots
21 | - Key information from boxplots:
22 | - Interquartile Range (IQR)
23 | - Median
24 | - Understanding the distribution and variability of data
25 |
26 | - Summary and Q&A
27 |
28 | - Recap of key points covered in the session
29 | - Open floor for questions and clarifications
30 |
31 | - [Class 14 Video Link](https://www.facebook.com/iCodeguru/videos/364592216474021)
32 |
--------------------------------------------------------------------------------
/Class15/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 30, 2024 | Week 4 | Day 2_
2 |
3 | **Contents covered include:**
4 |
5 | - Creating an Environment with Miniconda
6 |
7 | - Importance of setting up a dedicated environment for projects
8 | - Managing dependencies effectively with Miniconda
9 |
10 | - Project Documentation
11 |
12 | - Importance of detailing project documentation within the notebook
13 | - Ensuring clarity and reproducibility
14 |
15 | - Python Kernel and Dependencies
16 |
17 | - Understanding the importance of the Python kernel
18 | - Managing dependencies to ensure smooth notebook operation
19 |
20 | - Loading Notebooks on Kaggle
21 |
22 | - Steps to load and run notebooks on Kaggle.com
23 | - Benefits of sharing and collaborating on data projects on Kaggle
24 |
25 | - Data Visualization
26 |
27 | - Using Plotly and other libraries for data visualization
28 | - Emphasizing the power of visual storytelling in data analysis
29 |
30 | - [Class 15 Video Link](https://www.facebook.com/iCodeguru/videos/1153940089225901)
31 |
--------------------------------------------------------------------------------
/Class8/Code_Example.txt:
--------------------------------------------------------------------------------
1 | This was the code:-
2 |
3 | # Importing pandas and creating a dataframe
4 | import pandas as pd
5 | # creating dataframe
6 | data = {
7 | 'Name' : ['Ahmad', 'Ali', 'Hamza'],
8 | 'Age' : [22, 21, 20],
9 | 'City': ['FSD', "LHR", "ISB"]
10 | }
11 |
12 | df = pd.DataFrame(data)
13 | print(df)
14 | print("______________________")
15 |
16 | # Head and Tail
17 | print(df.head(2))
18 | print(df.tail(2))
19 |
20 | # Info
21 | print(df.info())
22 |
23 |
24 | # Describe
25 | print(df.describe())
26 |
27 | # Selecting Columns
28 | print(df[['Age', 'Name']])
29 |
30 | # Selecting Rows
31 |
32 | print(df.loc[1:2])
33 | print("__________")
34 | print(df.iloc[1:3])
35 |
36 | # Filtering Data
37 | print(df[df['Age'] >= 21])
38 |
39 | # Adding a New Column
40 | df['Degree'] = ['BSCS', "BSSE", "BSIT"]
41 |
42 | print(df)
43 | print("_____________")
44 |
45 |
46 | # Dropping a Column
47 | df = df.drop("Degree", axis=1)
48 | print(df)
49 |
50 | # Merging DataFrames
51 | data2 = {
52 | 'City': ['FSD', "LHR", "ISB"],
53 | 'population': [876543, 87654, 43254]
54 | }
55 |
56 | # df2 = pd.DataFrame(data2)
57 |
58 | merged_df = pd.merge(df, df2, on='City')
59 | print(merged_df)
60 |
61 |
62 | # Handling Missing Values
63 | data3 = {
64 | 'City': ['FSD', None, "ISB"],
65 | 'population': [None, 87654, 43254]
66 | }
67 |
68 | df3 = pd.DataFrame(data3)
69 | print(df3)
70 | print("__________________")
71 |
72 | df3_filled = df3.fillna("UnKnown")
73 | print(df3_filled)
74 | print("__________________")
75 |
76 |
77 | df3_dropped = df3.dropna()
78 | print(df3_dropped)
--------------------------------------------------------------------------------
/Class13/README.MD:
--------------------------------------------------------------------------------
1 | ## _July 26, 2024 | Week 3 | Day 5_
2 |
3 | **Contents covered include:**
4 |
5 | - Introduction to Data Processing Techniques
6 |
7 | - Overview of data processing importance in machine learning
8 | - Goals for today's session
9 |
10 | - Data Visualization & Cleaning
11 |
12 | - Importance of visualizing data
13 | - Tools and methods for data visualization
14 | - Bar charts
15 | - Heatmaps
16 | - Missing value matrix (`msno.matrix(df)`)
17 | - Data cleaning techniques
18 | - Handling missing values
19 | - Example: Filling missing values in 'workclass' column
20 |
21 | - Handling Categorical Features
22 |
23 | - Understanding categorical data
24 | - Textual representation
25 | - Numerical representation
26 | - Techniques for encoding categorical data
27 | - Label Encoding
28 | - One-Hot Encoding
29 | - Hashing
30 |
31 | - Example: Encoding "Date" Features
32 |
33 | - Textual representations of dates
34 | - Day of the week
35 | - Month
36 | - Season
37 | - Numerical representations of dates
38 | - Day of the week as integers
39 | - Month as integers
40 |
41 | - Identifying & Encoding Categorical Columns
42 |
43 | - Identifying categorical columns in a DataFrame
44 | - One-Hot Encoding process
45 | - Example: One-hot encoding categorical columns in a DataFrame
46 | - Handling rare categories (optional)
47 |
48 | - Importance of One-Hot Encoding
49 |
50 | - Benefits of converting categorical variables into numerical format
51 | - Avoiding ordinal relationship pitfalls with one-hot encoding
52 |
53 | - Summary and Q&A
54 |
55 | - Recap of key points covered in the session
56 | - Open floor for questions and clarifications
57 |
58 | - [Class 13 Video Link](https://www.facebook.com/iCodeguru/videos/842121497495971)
59 |
--------------------------------------------------------------------------------
/Class21/placement.csv:
--------------------------------------------------------------------------------
1 | ,cgpa,iq,placement
2 | 0,6.8,123.0,1
3 | 1,5.9,106.0,0
4 | 2,5.3,121.0,0
5 | 3,7.4,132.0,1
6 | 4,5.8,142.0,0
7 | 5,7.1,48.0,1
8 | 6,5.7,143.0,0
9 | 7,5.0,63.0,0
10 | 8,6.1,156.0,0
11 | 9,5.1,66.0,0
12 | 10,6.0,45.0,1
13 | 11,6.9,138.0,1
14 | 12,5.4,139.0,0
15 | 13,6.4,116.0,1
16 | 14,6.1,103.0,0
17 | 15,5.1,176.0,0
18 | 16,5.2,224.0,0
19 | 17,3.3,183.0,0
20 | 18,4.0,100.0,0
21 | 19,5.2,132.0,0
22 | 20,6.6,120.0,1
23 | 21,7.1,151.0,1
24 | 22,4.9,120.0,0
25 | 23,4.7,87.0,0
26 | 24,4.7,121.0,0
27 | 25,5.0,91.0,0
28 | 26,7.0,199.0,1
29 | 27,6.0,124.0,1
30 | 28,5.2,90.0,0
31 | 29,7.0,112.0,1
32 | 30,7.6,128.0,1
33 | 31,3.9,109.0,0
34 | 32,7.0,139.0,1
35 | 33,6.0,149.0,0
36 | 34,4.8,163.0,0
37 | 35,6.8,90.0,1
38 | 36,5.7,140.0,0
39 | 37,8.1,149.0,1
40 | 38,6.5,160.0,1
41 | 39,4.6,146.0,0
42 | 40,4.9,134.0,0
43 | 41,5.4,114.0,0
44 | 42,7.6,89.0,1
45 | 43,6.8,141.0,1
46 | 44,7.5,61.0,1
47 | 45,6.0,66.0,1
48 | 46,5.3,114.0,0
49 | 47,5.2,161.0,0
50 | 48,6.6,138.0,1
51 | 49,5.4,135.0,0
52 | 50,3.5,233.0,0
53 | 51,4.8,141.0,0
54 | 52,7.0,175.0,1
55 | 53,8.3,168.0,1
56 | 54,6.4,141.0,1
57 | 55,7.8,114.0,1
58 | 56,6.1,65.0,0
59 | 57,6.5,130.0,1
60 | 58,8.0,79.0,1
61 | 59,4.8,112.0,0
62 | 60,6.9,139.0,1
63 | 61,7.3,137.0,1
64 | 62,6.0,102.0,0
65 | 63,6.3,128.0,1
66 | 64,7.0,64.0,1
67 | 65,8.1,166.0,1
68 | 66,6.9,96.0,1
69 | 67,5.0,118.0,0
70 | 68,4.0,75.0,0
71 | 69,8.5,120.0,1
72 | 70,6.3,127.0,1
73 | 71,6.1,132.0,1
74 | 72,7.3,116.0,1
75 | 73,4.9,61.0,0
76 | 74,6.7,154.0,1
77 | 75,4.8,169.0,0
78 | 76,4.9,155.0,0
79 | 77,7.3,50.0,1
80 | 78,6.1,81.0,0
81 | 79,6.5,90.0,1
82 | 80,4.9,196.0,0
83 | 81,5.4,107.0,0
84 | 82,6.5,37.0,1
85 | 83,7.5,130.0,1
86 | 84,5.7,169.0,0
87 | 85,5.8,166.0,1
88 | 86,5.1,128.0,0
89 | 87,5.7,132.0,1
90 | 88,4.4,149.0,0
91 | 89,4.9,151.0,0
92 | 90,7.3,86.0,1
93 | 91,7.5,158.0,1
94 | 92,5.2,110.0,0
95 | 93,6.8,112.0,1
96 | 94,4.7,52.0,0
97 | 95,4.3,200.0,0
98 | 96,4.4,42.0,0
99 | 97,6.7,182.0,1
100 | 98,6.3,103.0,1
101 | 99,6.2,113.0,1
102 |
--------------------------------------------------------------------------------
/Class10/datasets/homelessness.csv:
--------------------------------------------------------------------------------
1 | ,region,state,individuals,family_members,state_pop
2 | 0,East South Central,Alabama,2570.0,864.0,4887681
3 | 1,Pacific,Alaska,1434.0,582.0,735139
4 | 2,Mountain,Arizona,7259.0,2606.0,7158024
5 | 3,West South Central,Arkansas,2280.0,432.0,3009733
6 | 4,Pacific,California,109008.0,20964.0,39461588
7 | 5,Mountain,Colorado,7607.0,3250.0,5691287
8 | 6,New England,Connecticut,2280.0,1696.0,3571520
9 | 7,South Atlantic,Delaware,708.0,374.0,965479
10 | 8,South Atlantic,District of Columbia,3770.0,3134.0,701547
11 | 9,South Atlantic,Florida,21443.0,9587.0,21244317
12 | 10,South Atlantic,Georgia,6943.0,2556.0,10511131
13 | 11,Pacific,Hawaii,4131.0,2399.0,1420593
14 | 12,Mountain,Idaho,1297.0,715.0,1750536
15 | 13,East North Central,Illinois,6752.0,3891.0,12723071
16 | 14,East North Central,Indiana,3776.0,1482.0,6695497
17 | 15,West North Central,Iowa,1711.0,1038.0,3148618
18 | 16,West North Central,Kansas,1443.0,773.0,2911359
19 | 17,East South Central,Kentucky,2735.0,953.0,4461153
20 | 18,West South Central,Louisiana,2540.0,519.0,4659690
21 | 19,New England,Maine,1450.0,1066.0,1339057
22 | 20,South Atlantic,Maryland,4914.0,2230.0,6035802
23 | 21,New England,Massachusetts,6811.0,13257.0,6882635
24 | 22,East North Central,Michigan,5209.0,3142.0,9984072
25 | 23,West North Central,Minnesota,3993.0,3250.0,5606249
26 | 24,East South Central,Mississippi,1024.0,328.0,2981020
27 | 25,West North Central,Missouri,3776.0,2107.0,6121623
28 | 26,Mountain,Montana,983.0,422.0,1060665
29 | 27,West North Central,Nebraska,1745.0,676.0,1925614
30 | 28,Mountain,Nevada,7058.0,486.0,3027341
31 | 29,New England,New Hampshire,835.0,615.0,1353465
32 | 30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
33 | 31,Mountain,New Mexico,1949.0,602.0,2092741
34 | 32,Mid-Atlantic,New York,39827.0,52070.0,19530351
35 | 33,South Atlantic,North Carolina,6451.0,2817.0,10381615
36 | 34,West North Central,North Dakota,467.0,75.0,758080
37 | 35,East North Central,Ohio,6929.0,3320.0,11676341
38 | 36,West South Central,Oklahoma,2823.0,1048.0,3940235
39 | 37,Pacific,Oregon,11139.0,3337.0,4181886
40 | 38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
41 | 39,New England,Rhode Island,747.0,354.0,1058287
42 | 40,South Atlantic,South Carolina,3082.0,851.0,5084156
43 | 41,West North Central,South Dakota,836.0,323.0,878698
44 | 42,East South Central,Tennessee,6139.0,1744.0,6771631
45 | 43,West South Central,Texas,19199.0,6111.0,28628666
46 | 44,Mountain,Utah,1904.0,972.0,3153550
47 | 45,New England,Vermont,780.0,511.0,624358
48 | 46,South Atlantic,Virginia,3928.0,2047.0,8501286
49 | 47,Pacific,Washington,16424.0,5880.0,7523869
50 | 48,South Atlantic,West Virginia,1021.0,222.0,1804291
51 | 49,East North Central,Wisconsin,2740.0,2167.0,5807406
52 | 50,Mountain,Wyoming,434.0,205.0,577601
53 |
--------------------------------------------------------------------------------
/Class11/datasets/homelessness.csv:
--------------------------------------------------------------------------------
1 | ,region,state,individuals,family_members,state_pop
2 | 0,East South Central,Alabama,2570.0,864.0,4887681
3 | 1,Pacific,Alaska,1434.0,582.0,735139
4 | 2,Mountain,Arizona,7259.0,2606.0,7158024
5 | 3,West South Central,Arkansas,2280.0,432.0,3009733
6 | 4,Pacific,California,109008.0,20964.0,39461588
7 | 5,Mountain,Colorado,7607.0,3250.0,5691287
8 | 6,New England,Connecticut,2280.0,1696.0,3571520
9 | 7,South Atlantic,Delaware,708.0,374.0,965479
10 | 8,South Atlantic,District of Columbia,3770.0,3134.0,701547
11 | 9,South Atlantic,Florida,21443.0,9587.0,21244317
12 | 10,South Atlantic,Georgia,6943.0,2556.0,10511131
13 | 11,Pacific,Hawaii,4131.0,2399.0,1420593
14 | 12,Mountain,Idaho,1297.0,715.0,1750536
15 | 13,East North Central,Illinois,6752.0,3891.0,12723071
16 | 14,East North Central,Indiana,3776.0,1482.0,6695497
17 | 15,West North Central,Iowa,1711.0,1038.0,3148618
18 | 16,West North Central,Kansas,1443.0,773.0,2911359
19 | 17,East South Central,Kentucky,2735.0,953.0,4461153
20 | 18,West South Central,Louisiana,2540.0,519.0,4659690
21 | 19,New England,Maine,1450.0,1066.0,1339057
22 | 20,South Atlantic,Maryland,4914.0,2230.0,6035802
23 | 21,New England,Massachusetts,6811.0,13257.0,6882635
24 | 22,East North Central,Michigan,5209.0,3142.0,9984072
25 | 23,West North Central,Minnesota,3993.0,3250.0,5606249
26 | 24,East South Central,Mississippi,1024.0,328.0,2981020
27 | 25,West North Central,Missouri,3776.0,2107.0,6121623
28 | 26,Mountain,Montana,983.0,422.0,1060665
29 | 27,West North Central,Nebraska,1745.0,676.0,1925614
30 | 28,Mountain,Nevada,7058.0,486.0,3027341
31 | 29,New England,New Hampshire,835.0,615.0,1353465
32 | 30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
33 | 31,Mountain,New Mexico,1949.0,602.0,2092741
34 | 32,Mid-Atlantic,New York,39827.0,52070.0,19530351
35 | 33,South Atlantic,North Carolina,6451.0,2817.0,10381615
36 | 34,West North Central,North Dakota,467.0,75.0,758080
37 | 35,East North Central,Ohio,6929.0,3320.0,11676341
38 | 36,West South Central,Oklahoma,2823.0,1048.0,3940235
39 | 37,Pacific,Oregon,11139.0,3337.0,4181886
40 | 38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
41 | 39,New England,Rhode Island,747.0,354.0,1058287
42 | 40,South Atlantic,South Carolina,3082.0,851.0,5084156
43 | 41,West North Central,South Dakota,836.0,323.0,878698
44 | 42,East South Central,Tennessee,6139.0,1744.0,6771631
45 | 43,West South Central,Texas,19199.0,6111.0,28628666
46 | 44,Mountain,Utah,1904.0,972.0,3153550
47 | 45,New England,Vermont,780.0,511.0,624358
48 | 46,South Atlantic,Virginia,3928.0,2047.0,8501286
49 | 47,Pacific,Washington,16424.0,5880.0,7523869
50 | 48,South Atlantic,West Virginia,1021.0,222.0,1804291
51 | 49,East North Central,Wisconsin,2740.0,2167.0,5807406
52 | 50,Mountain,Wyoming,434.0,205.0,577601
53 |
--------------------------------------------------------------------------------
/Class12/datasets/homelessness.csv:
--------------------------------------------------------------------------------
1 | ,region,state,individuals,family_members,state_pop
2 | 0,East South Central,Alabama,2570.0,864.0,4887681
3 | 1,Pacific,Alaska,1434.0,582.0,735139
4 | 2,Mountain,Arizona,7259.0,2606.0,7158024
5 | 3,West South Central,Arkansas,2280.0,432.0,3009733
6 | 4,Pacific,California,109008.0,20964.0,39461588
7 | 5,Mountain,Colorado,7607.0,3250.0,5691287
8 | 6,New England,Connecticut,2280.0,1696.0,3571520
9 | 7,South Atlantic,Delaware,708.0,374.0,965479
10 | 8,South Atlantic,District of Columbia,3770.0,3134.0,701547
11 | 9,South Atlantic,Florida,21443.0,9587.0,21244317
12 | 10,South Atlantic,Georgia,6943.0,2556.0,10511131
13 | 11,Pacific,Hawaii,4131.0,2399.0,1420593
14 | 12,Mountain,Idaho,1297.0,715.0,1750536
15 | 13,East North Central,Illinois,6752.0,3891.0,12723071
16 | 14,East North Central,Indiana,3776.0,1482.0,6695497
17 | 15,West North Central,Iowa,1711.0,1038.0,3148618
18 | 16,West North Central,Kansas,1443.0,773.0,2911359
19 | 17,East South Central,Kentucky,2735.0,953.0,4461153
20 | 18,West South Central,Louisiana,2540.0,519.0,4659690
21 | 19,New England,Maine,1450.0,1066.0,1339057
22 | 20,South Atlantic,Maryland,4914.0,2230.0,6035802
23 | 21,New England,Massachusetts,6811.0,13257.0,6882635
24 | 22,East North Central,Michigan,5209.0,3142.0,9984072
25 | 23,West North Central,Minnesota,3993.0,3250.0,5606249
26 | 24,East South Central,Mississippi,1024.0,328.0,2981020
27 | 25,West North Central,Missouri,3776.0,2107.0,6121623
28 | 26,Mountain,Montana,983.0,422.0,1060665
29 | 27,West North Central,Nebraska,1745.0,676.0,1925614
30 | 28,Mountain,Nevada,7058.0,486.0,3027341
31 | 29,New England,New Hampshire,835.0,615.0,1353465
32 | 30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
33 | 31,Mountain,New Mexico,1949.0,602.0,2092741
34 | 32,Mid-Atlantic,New York,39827.0,52070.0,19530351
35 | 33,South Atlantic,North Carolina,6451.0,2817.0,10381615
36 | 34,West North Central,North Dakota,467.0,75.0,758080
37 | 35,East North Central,Ohio,6929.0,3320.0,11676341
38 | 36,West South Central,Oklahoma,2823.0,1048.0,3940235
39 | 37,Pacific,Oregon,11139.0,3337.0,4181886
40 | 38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
41 | 39,New England,Rhode Island,747.0,354.0,1058287
42 | 40,South Atlantic,South Carolina,3082.0,851.0,5084156
43 | 41,West North Central,South Dakota,836.0,323.0,878698
44 | 42,East South Central,Tennessee,6139.0,1744.0,6771631
45 | 43,West South Central,Texas,19199.0,6111.0,28628666
46 | 44,Mountain,Utah,1904.0,972.0,3153550
47 | 45,New England,Vermont,780.0,511.0,624358
48 | 46,South Atlantic,Virginia,3928.0,2047.0,8501286
49 | 47,Pacific,Washington,16424.0,5880.0,7523869
50 | 48,South Atlantic,West Virginia,1021.0,222.0,1804291
51 | 49,East North Central,Wisconsin,2740.0,2167.0,5807406
52 | 50,Mountain,Wyoming,434.0,205.0,577601
53 |
--------------------------------------------------------------------------------
/Class5/Intro to Privacy implications.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
12 | "\n",
13 | "
Friday, July 12, 2024\n",
14 | "Week 1 | Day 5\n"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "**Privacy** is defined as the ability to ensure flows of information that satisfy social and legal norms. \n",
22 | "\n",
23 | "### Personally Identifiable Information\n",
24 | "• **Personally Identifiable Information (PII): **Data that can identify someone, either alone or with other data.\n",
25 | "• **Sensitive PII:** Information like full names, Social Security Numbers, and medical records that could cause harm if disclosed.\n",
26 | "• **Non-Sensitive PII:** Data like gender or zip code that cannot identify someone alone but can be used with other data to do so.\n",
27 | "\n",
28 | "> General Data Protection Regulation (GDPR): A regulation protecting Pll in Europe, ensuring data is processed lawfully and used for specified purposes.\n",
29 | "\n",
30 | "### Anonymization Techniques: Methods to protect PII\n",
31 | "\n",
32 | "**• Attribute Suppression:** Removing entire columns of sensitive data.\n",
33 | "\n",
34 | "**• Record Suppression:** Removing records with unique or sensitive values.\n",
35 | "\n",
36 | "**• Generalization:** Replacing values with more general categories.\n",
37 | "\n",
38 | "**• Pseudonymization:** Replacing sensitive values with fake values.\n",
39 | "\n",
40 | "**• Data masking**\n",
41 | "\n",
42 | "```\n",
43 | "# Example of attribute suppression in pandas:\n",
44 | "clients_df.drop('name', axis=1, inplace=True)\n",
45 | "```\n",
46 | "Understanding these principles helps protect privacy and comply with regulations.\n"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | ""
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "# Quasi Identifiers:\n",
61 | "\n",
62 | "Quasi Identifiers are attributes in a dataset that can be used to identify a person or an entity, but not uniquely. In other words, they are characteristics that can be linked to an individual, but not with absolute certainty.\n",
63 | "\n",
64 | "Examples of Quasi Identifiers include:\n",
65 | "\n",
66 | "Date of birth\n",
67 | "Zip code\n",
68 | "Gender\n",
69 | "Occupation\n",
70 | "Education level\n",
71 | "\n",
72 | "These attributes can be used in combination with other Quasi Identifiers to re-identify individuals in a dataset, which is a concern in data privacy and anonymization.\n"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "\n"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "\n"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "**Faker** is a popular Python library used to generate fake data. It's often used for testing, data anonymization, and populating databases with sample data.\n",
94 | "\n",
95 | "With Faker, you can generate a wide range of fake data, including:\n",
96 | "\n",
97 | "- Names and addresses\n",
98 | "- Phone numbers and email addresses\n",
99 | "- Text and paragraphs\n",
100 | "- Dates and times\n",
101 | "- Credit card numbers and expiration dates\n",
102 | "- Usernames and passwords\n",
103 | "\n",
104 | "Here's an example of how you can use Faker to generate some fake data:\n",
105 | "```\n",
106 | "from faker import Faker\n",
107 | "\n",
108 | "fake = Faker()\n",
109 | "\n",
110 | "print(fake.name()) # Output: Emily Patel\n",
111 | "print(fake.address()) # Output: 123 Main St, New York, NY 10001\n",
112 | "print(fake.text()) # Output: Lorem ipsum dolor sit amet, consectetur...\n",
113 | "```\n",
114 | "\n",
115 | "Faker supports multiple languages and locales, making it a versatile tool for generating fake data that's relevant to your specific use case."
116 | ]
117 | },
118 | {
119 | "cell_type": "markdown",
120 | "metadata": {},
121 | "source": []
122 | }
123 | ],
124 | "metadata": {
125 | "language_info": {
126 | "name": "python"
127 | }
128 | },
129 | "nbformat": 4,
130 | "nbformat_minor": 2
131 | }
132 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Machine Learning from Scratch & Its Privacy Implications
2 |
3 | ## Begin Date: 8th July - 27th Aug
4 |
5 | This course aims to teach the fundamentals of machine learning from scratch while also addressing the privacy implications at each step of the process. The curriculum is designed to provide a comprehensive understanding of machine learning techniques and their privacy considerations.
6 |
7 | ## Table of Contents
8 |
9 | - [Trainers](#trainers)
10 | - [Moderators](#moderators)
11 | - [Prerequisites](#prerequisites)
12 | - [Course Outline](#course-outline)
13 |
14 | ## Trainers
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 | Ahmad Jajja
24 |
25 | |
26 |
27 |
28 |
29 |
30 | Asjad Ali
31 |
32 | |
33 |
34 |
35 |
36 |
37 | Zartashia Afzal
38 |
39 | |
40 |
41 |
42 |
43 |
44 | ## Moderators
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 | Mahnoor Malik
53 |
54 | |
55 |
56 |
57 |
58 |
59 | Muhammad Arham
60 |
61 | |
62 |
63 |
64 |
65 |
66 | Sheraz Anwar
67 |
68 | |
69 |
70 |
71 |
72 |
73 | Sikander Nawaz
74 |
75 | |
76 |
77 |
78 |
79 |
80 | ## Prerequisites
81 |
82 | - There are no prerequisites to join this course. You'll learn from zero to advanced level.
83 |
84 | ## Course Outline
85 |
86 | ### Module 1: Introduction to Machine Learning
87 |
88 | - **What is Machine Learning?**
89 | - **Applications of Machine Learning**
90 | - **Machine Learning Development Life Cycle (MLDLC)**
91 | - **Importance of Machine Learning in the Generative AI Era (Optional)**
92 | - **Introduction to Differential Privacy (DP)**
93 | - **Definition and Importance**
94 | - [Class 1 Video Link](https://www.facebook.com/iCodeguru/videos/1140906600544854)
95 | - [Class 2 Video Link](https://www.facebook.com/iCodeguru/videos/1172562033868588/)
96 | - [Class 3 Video Link](https://www.facebook.com/iCodeguru/videos/380675287961626)
97 | - [Class 4 Video Link](https://www.facebook.com/iCodeguru/videos/489366656910032)
98 | - [Class 5 Video Link](https://www.facebook.com/iCodeguru/videos/396304176791613)
99 |
100 | ### Module 2: Python for Machine Learning (Optional)
101 |
102 | - **Introduction to Python Programming (if needed)**
103 | - **Libraries for Data Analysis: Pandas, NumPy**
104 | - **Introduction to Privacy Libraries in Python**
105 | - **Libraries for Implementing Differential Privacy: PySyft, PyTorch Opacus**
106 | - [Class 6 Video Link](https://www.facebook.com/iCodeguru/videos/296444603534700)
107 | - [Class 7 Video Link](https://www.facebook.com/iCodeguru/videos/798917849033959)
108 | - [Class 8 Video Link](https://www.facebook.com/iCodeguru/videos/506803148528787)
109 |
110 | ### Module 3: Data Preprocessing and Feature Engineering
111 |
112 | - **Data Analysis and Preprocessing Techniques**
113 | - **Data Cleaning: Handling Missing Data, Categorical Features, Outliers**
114 | - **Data Visualization with Seaborn and Matplotlib**
115 | - **Feature Engineering: Feature Transformation, Selection, Construction, and Extraction**
116 | - **Dimensionality Reduction with PCA (Principal Component Analysis)**
117 | - **Privacy-Preserving Data Preprocessing**
118 | - **Anonymization Techniques**
119 | - **Privacy Risks in Data Preprocessing**
120 | - [Class 9 Video Link](https://www.facebook.com/iCodeguru/videos/503766618767054)
121 | - [Class 10 Video Link](https://www.facebook.com/iCodeguru/videos/3279815862154411)
122 | - [Class 11 Video Link](https://www.facebook.com/iCodeguru/videos/2512086508988678/)
123 | - [Class 12 Video Link](https://www.facebook.com/iCodeguru/videos/544365754583575/)
124 | - [Class 13 Video Link](https://www.facebook.com/iCodeguru/videos/842121497495971)
125 | - [Class 14 Video Link](https://www.facebook.com/iCodeguru/videos/364592216474021)
126 | - [Class 15 Video Link](https://www.facebook.com/iCodeguru/videos/1153940089225901)
127 | - [Class 16 Video Link](https://web.facebook.com/iCodeguru/videos/517459164056447)
128 | - [Class 17 Video Link](https://web.facebook.com/iCodeguru/videos/1007156847751102)
129 | - [Class 18 Video Link](https://www.facebook.com/iCodeguru/videos/782806057093677)
130 | - [Class 19 Video Link](https://www.facebook.com/iCodeguru/videos/876352897725102/)
131 |
132 | ### Module 4: Machine Learning Fundamentals
133 |
134 | - **Learning Approaches: Batch vs Online, Model-based vs Instance-based**
135 | - **Types of Machine Learning: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning**
136 | - **Privacy Risks in Different Learning Approaches**
137 | - **Supervised Learning: Risks of Label Leakage**
138 | - **Unsupervised Learning: Risks in Clustering and Association**
139 |
140 | ### Module 5: Supervised Learning Algorithms
141 |
142 | - **Introduction to Supervised Learning**
143 | - **Regression vs. Classification**
144 | - **Regression Algorithms: Simple Linear Regression, Multilinear Regression, Polynomial Regression (with applications like house price prediction)**
145 | - **Classification Algorithms: Decision Trees (Decision Tree Classifier, Random Forest), K-Nearest Neighbors (KNN), Naive Bayes, Support Vector Machines (SVM)**
146 | - **Differential Privacy in Supervised Learning**
147 | - **Noise Addition in Regression Models**
148 | - **Privacy-Preserving Decision Trees**
149 | - [Class 20 Video Link](https://www.facebook.com/iCodeguru/videos/1323471588452378)
150 | - [Class 21 Video Link](https://www.facebook.com/iCodeguru/videos/413629985029427)
151 | - [Class 22 Video Link](https://www.facebook.com/iCodeguru/videos/1018825216109320)
152 | - [Class 23 Video Link](https://www.facebook.com/iCodeguru/videos/544964651216277)
153 |
154 | ### Module 6: Model Evaluation and Optimization
155 |
156 | - **Regression and Classification Metrics**
157 | - **Imbalanced Data in Machine Learning**
158 | - **Underfitting vs Overfitting**
159 | - **Ensemble Methods: Bagging, Boosting**
160 | - **Hyperparameter Tuning**
161 | - **Privacy-Preserving Model Evaluation**
162 | - **Metrics for Assessing Privacy Risks**
163 | - **Differential Privacy in Model Optimization**
164 | - [Class 24 Video Link](https://www.facebook.com/iCodeguru/videos/1168843607708830)
165 | - [Class 25 Video Link](https://web.facebook.com/iCodeguru/videos/1930278197423471/)
166 | - [Class 26 Video Link](https://www.facebook.com/iCodeguru/videos/1243668773297682)
167 | - [Class 27 Video Link](https://www.facebook.com/iCodeguru/videos/1197321291583796)
168 |
169 | ### Module 7: Model Interpretation and Deployment
170 |
171 | - **Model Interpretability and Explainable AI (XAI)**
172 | - **Model Deployment with Flask (or similar framework)**
173 | - **Privacy Concerns in Model Interpretation**
174 | - **Risks of Exposing Sensitive Information through Interpretability**
175 | - **Privacy-Preserving Model Deployment**
176 | - **Secure Multi-Party Computation for Model Serving**
177 | - [Class 28 Video Link](https://web.facebook.com/iCodeguru/videos/1657940491631428)
178 | - [Class 29 Video Link](https://web.facebook.com/iCodeguru/videos/3763928390593592)
179 | - [Class 30 Video Link](https://web.facebook.com/iCodeguru/videos/4196521850634064)
180 | - [Class 31 Video Link](https://web.facebook.com/iCodeguru/videos/1202049024146854)
181 | - [Class 33 Video Link](https://web.facebook.com/iCodeguru/videos/1013797100283976)
182 | - [Class 34 Video Link](https://www.facebook.com/iCodeguru/videos/1834705580352310/)
183 |
184 |
--------------------------------------------------------------------------------
/Class22/column-transformer.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 75,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import numpy as np\n",
10 | "import pandas as pd"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 76,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "from sklearn.impute import SimpleImputer\n",
20 | "from sklearn.preprocessing import OneHotEncoder\n",
21 | "from sklearn.preprocessing import OrdinalEncoder"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 77,
27 | "metadata": {},
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv('covid_toy.csv')"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 27,
36 | "metadata": {},
37 | "outputs": [
38 | {
39 | "data": {
40 | "text/html": [
41 | "\n",
42 | "\n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " | \n",
59 | " age | \n",
60 | " gender | \n",
61 | " fever | \n",
62 | " cough | \n",
63 | " city | \n",
64 | " has_covid | \n",
65 | "
\n",
66 | " \n",
67 | " \n",
68 | " \n",
69 | " | 0 | \n",
70 | " 60 | \n",
71 | " Male | \n",
72 | " 103.0 | \n",
73 | " Mild | \n",
74 | " Kolkata | \n",
75 | " No | \n",
76 | "
\n",
77 | " \n",
78 | " | 1 | \n",
79 | " 27 | \n",
80 | " Male | \n",
81 | " 100.0 | \n",
82 | " Mild | \n",
83 | " Delhi | \n",
84 | " Yes | \n",
85 | "
\n",
86 | " \n",
87 | " | 2 | \n",
88 | " 42 | \n",
89 | " Male | \n",
90 | " 101.0 | \n",
91 | " Mild | \n",
92 | " Delhi | \n",
93 | " No | \n",
94 | "
\n",
95 | " \n",
96 | " | 3 | \n",
97 | " 31 | \n",
98 | " Female | \n",
99 | " 98.0 | \n",
100 | " Mild | \n",
101 | " Kolkata | \n",
102 | " No | \n",
103 | "
\n",
104 | " \n",
105 | " | 4 | \n",
106 | " 65 | \n",
107 | " Female | \n",
108 | " 101.0 | \n",
109 | " Mild | \n",
110 | " Mumbai | \n",
111 | " No | \n",
112 | "
\n",
113 | " \n",
114 | "
\n",
115 | "
"
116 | ],
117 | "text/plain": [
118 | " age gender fever cough city has_covid\n",
119 | "0 60 Male 103.0 Mild Kolkata No\n",
120 | "1 27 Male 100.0 Mild Delhi Yes\n",
121 | "2 42 Male 101.0 Mild Delhi No\n",
122 | "3 31 Female 98.0 Mild Kolkata No\n",
123 | "4 65 Female 101.0 Mild Mumbai No"
124 | ]
125 | },
126 | "execution_count": 27,
127 | "metadata": {},
128 | "output_type": "execute_result"
129 | }
130 | ],
131 | "source": [
132 | "df.head()"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 80,
138 | "metadata": {},
139 | "outputs": [
140 | {
141 | "data": {
142 | "text/plain": [
143 | "age 0\n",
144 | "gender 0\n",
145 | "fever 10\n",
146 | "cough 0\n",
147 | "city 0\n",
148 | "has_covid 0\n",
149 | "dtype: int64"
150 | ]
151 | },
152 | "execution_count": 80,
153 | "metadata": {},
154 | "output_type": "execute_result"
155 | }
156 | ],
157 | "source": [
158 | "df.isnull().sum()"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 29,
164 | "metadata": {},
165 | "outputs": [],
166 | "source": [
167 | "from sklearn.model_selection import train_test_split\n",
168 | "X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],\n",
169 | " test_size=0.2)"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 81,
175 | "metadata": {},
176 | "outputs": [
177 | {
178 | "data": {
179 | "text/html": [
180 | "\n",
181 | "\n",
194 | "
\n",
195 | " \n",
196 | " \n",
197 | " | \n",
198 | " age | \n",
199 | " gender | \n",
200 | " fever | \n",
201 | " cough | \n",
202 | " city | \n",
203 | "
\n",
204 | " \n",
205 | " \n",
206 | " \n",
207 | " | 55 | \n",
208 | " 81 | \n",
209 | " Female | \n",
210 | " 101.0 | \n",
211 | " Mild | \n",
212 | " Mumbai | \n",
213 | "
\n",
214 | " \n",
215 | " | 76 | \n",
216 | " 80 | \n",
217 | " Male | \n",
218 | " 100.0 | \n",
219 | " Mild | \n",
220 | " Bangalore | \n",
221 | "
\n",
222 | " \n",
223 | " | 22 | \n",
224 | " 71 | \n",
225 | " Female | \n",
226 | " 98.0 | \n",
227 | " Strong | \n",
228 | " Kolkata | \n",
229 | "
\n",
230 | " \n",
231 | " | 93 | \n",
232 | " 27 | \n",
233 | " Male | \n",
234 | " 100.0 | \n",
235 | " Mild | \n",
236 | " Kolkata | \n",
237 | "
\n",
238 | " \n",
239 | " | 33 | \n",
240 | " 26 | \n",
241 | " Female | \n",
242 | " 98.0 | \n",
243 | " Mild | \n",
244 | " Kolkata | \n",
245 | "
\n",
246 | " \n",
247 | " | ... | \n",
248 | " ... | \n",
249 | " ... | \n",
250 | " ... | \n",
251 | " ... | \n",
252 | " ... | \n",
253 | "
\n",
254 | " \n",
255 | " | 2 | \n",
256 | " 42 | \n",
257 | " Male | \n",
258 | " 101.0 | \n",
259 | " Mild | \n",
260 | " Delhi | \n",
261 | "
\n",
262 | " \n",
263 | " | 51 | \n",
264 | " 11 | \n",
265 | " Female | \n",
266 | " 100.0 | \n",
267 | " Strong | \n",
268 | " Kolkata | \n",
269 | "
\n",
270 | " \n",
271 | " | 5 | \n",
272 | " 84 | \n",
273 | " Female | \n",
274 | " NaN | \n",
275 | " Mild | \n",
276 | " Bangalore | \n",
277 | "
\n",
278 | " \n",
279 | " | 40 | \n",
280 | " 49 | \n",
281 | " Female | \n",
282 | " 102.0 | \n",
283 | " Mild | \n",
284 | " Delhi | \n",
285 | "
\n",
286 | " \n",
287 | " | 19 | \n",
288 | " 42 | \n",
289 | " Female | \n",
290 | " NaN | \n",
291 | " Strong | \n",
292 | " Bangalore | \n",
293 | "
\n",
294 | " \n",
295 | "
\n",
296 | "
80 rows × 5 columns
\n",
297 | "
"
298 | ],
299 | "text/plain": [
300 | " age gender fever cough city\n",
301 | "55 81 Female 101.0 Mild Mumbai\n",
302 | "76 80 Male 100.0 Mild Bangalore\n",
303 | "22 71 Female 98.0 Strong Kolkata\n",
304 | "93 27 Male 100.0 Mild Kolkata\n",
305 | "33 26 Female 98.0 Mild Kolkata\n",
306 | ".. ... ... ... ... ...\n",
307 | "2 42 Male 101.0 Mild Delhi\n",
308 | "51 11 Female 100.0 Strong Kolkata\n",
309 | "5 84 Female NaN Mild Bangalore\n",
310 | "40 49 Female 102.0 Mild Delhi\n",
311 | "19 42 Female NaN Strong Bangalore\n",
312 | "\n",
313 | "[80 rows x 5 columns]"
314 | ]
315 | },
316 | "execution_count": 81,
317 | "metadata": {},
318 | "output_type": "execute_result"
319 | }
320 | ],
321 | "source": [
322 | "X_train"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "## 1. Without Pipeline"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 83,
335 | "metadata": {},
336 | "outputs": [
337 | {
338 | "data": {
339 | "text/plain": [
340 | "(80, 1)"
341 | ]
342 | },
343 | "execution_count": 83,
344 | "metadata": {},
345 | "output_type": "execute_result"
346 | }
347 | ],
348 | "source": [
349 | "# adding simple imputer to fever col\n",
350 | "si = SimpleImputer()\n",
351 | "X_train_fever = si.fit_transform(X_train[['fever']])\n",
352 | "\n",
353 | "# also the test data\n",
354 | "X_test_fever = si.fit_transform(X_test[['fever']])\n",
355 | " \n",
356 | "X_train_fever.shape"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": 85,
362 | "metadata": {},
363 | "outputs": [
364 | {
365 | "data": {
366 | "text/plain": [
367 | "(80, 1)"
368 | ]
369 | },
370 | "execution_count": 85,
371 | "metadata": {},
372 | "output_type": "execute_result"
373 | }
374 | ],
375 | "source": [
376 | "# Ordinalencoding -> cough\n",
377 | "oe = OrdinalEncoder(categories=[['Mild','Strong']])\n",
378 | "X_train_cough = oe.fit_transform(X_train[['cough']])\n",
379 | "\n",
380 | "# also the test data\n",
381 | "X_test_cough = oe.fit_transform(X_test[['cough']])\n",
382 | "\n",
383 | "X_train_cough.shape"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 87,
389 | "metadata": {},
390 | "outputs": [
391 | {
392 | "data": {
393 | "text/plain": [
394 | "(80, 4)"
395 | ]
396 | },
397 | "execution_count": 87,
398 | "metadata": {},
399 | "output_type": "execute_result"
400 | }
401 | ],
402 | "source": [
403 | "# OneHotEncoding -> gender,city\n",
404 | "ohe = OneHotEncoder(drop='first',sparse=False)\n",
405 | "X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])\n",
406 | "\n",
407 | "# also the test data\n",
408 | "X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])\n",
409 | "\n",
410 | "X_train_gender_city.shape"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 89,
416 | "metadata": {},
417 | "outputs": [
418 | {
419 | "data": {
420 | "text/plain": [
421 | "(80, 1)"
422 | ]
423 | },
424 | "execution_count": 89,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "# Extracting Age\n",
431 | "X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values\n",
432 | "\n",
433 | "# also the test data\n",
434 | "X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values\n",
435 | "\n",
436 | "X_train_age.shape"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": 92,
442 | "metadata": {},
443 | "outputs": [
444 | {
445 | "data": {
446 | "text/plain": [
447 | "(80, 7)"
448 | ]
449 | },
450 | "execution_count": 92,
451 | "metadata": {},
452 | "output_type": "execute_result"
453 | }
454 | ],
455 | "source": [
456 | "X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)\n",
457 | "# also the test data\n",
458 | "X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)\n",
459 | "\n",
460 | "X_train_transformed.shape"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "## With Pipeline"
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 32,
473 | "metadata": {},
474 | "outputs": [],
475 | "source": [
476 | "from sklearn.compose import ColumnTransformer"
477 | ]
478 | },
479 | {
480 | "cell_type": "code",
481 | "execution_count": 95,
482 | "metadata": {},
483 | "outputs": [],
484 | "source": [
485 | "transformer = ColumnTransformer(transformers=[\n",
486 | " ('tnf1',SimpleImputer(),['fever']),\n",
487 | " ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),\n",
488 | " ('tnf3',OneHotEncoder(sparse=False,drop='first'),['gender','city'])\n",
489 | "],remainder='passthrough')"
490 | ]
491 | },
492 | {
493 | "cell_type": "code",
494 | "execution_count": 97,
495 | "metadata": {},
496 | "outputs": [
497 | {
498 | "data": {
499 | "text/plain": [
500 | "(80, 7)"
501 | ]
502 | },
503 | "execution_count": 97,
504 | "metadata": {},
505 | "output_type": "execute_result"
506 | }
507 | ],
508 | "source": [
509 | "transformer.fit_transform(X_train).shape"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": 99,
515 | "metadata": {},
516 | "outputs": [
517 | {
518 | "data": {
519 | "text/plain": [
520 | "(20, 7)"
521 | ]
522 | },
523 | "execution_count": 99,
524 | "metadata": {},
525 | "output_type": "execute_result"
526 | }
527 | ],
528 | "source": [
529 | "transformer.transform(X_test).shape"
530 | ]
531 | }
532 | ],
533 | "metadata": {
534 | "kernelspec": {
535 | "display_name": "Python 3",
536 | "language": "python",
537 | "name": "python3"
538 | },
539 | "language_info": {
540 | "codemirror_mode": {
541 | "name": "ipython",
542 | "version": 3
543 | },
544 | "file_extension": ".py",
545 | "mimetype": "text/x-python",
546 | "name": "python",
547 | "nbconvert_exporter": "python",
548 | "pygments_lexer": "ipython3",
549 | "version": "3.8.3"
550 | }
551 | },
552 | "nbformat": 4,
553 | "nbformat_minor": 4
554 | }
555 |
--------------------------------------------------------------------------------
/Class19/standardization_normalization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## **Normalization**\n",
8 | "\n",
9 | "### **Theory**\n",
10 | "\n",
11 | "Normalization is the process of converting a numerical feature into a standard range of values. The range of values might be either [-1, 1] or [0, 1]. For example, think that we have a data set comprising two features named \"**Age**\" and the \"**Weight**\" as shown below:"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": null,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "import pandas as pd"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": null,
26 | "metadata": {},
27 | "outputs": [],
28 | "source": [
29 | "X = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]\n",
30 | "y = [5, 8, 13, 17, 27, 33, 36, 40, 50, 70, 78, 80, 100, 103, 108, 109, 113, 120, 123, 130]"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "metadata": {},
37 | "outputs": [
38 | {
39 | "data": {
40 | "text/html": [
41 | "\n",
42 | "\n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " | \n",
59 | " Age | \n",
60 | " Weight | \n",
61 | "
\n",
62 | " \n",
63 | " \n",
64 | " \n",
65 | " | 0 | \n",
66 | " 5 | \n",
67 | " 5 | \n",
68 | "
\n",
69 | " \n",
70 | " | 1 | \n",
71 | " 10 | \n",
72 | " 8 | \n",
73 | "
\n",
74 | " \n",
75 | " | 2 | \n",
76 | " 15 | \n",
77 | " 13 | \n",
78 | "
\n",
79 | " \n",
80 | " | 3 | \n",
81 | " 20 | \n",
82 | " 17 | \n",
83 | "
\n",
84 | " \n",
85 | " | 4 | \n",
86 | " 25 | \n",
87 | " 27 | \n",
88 | "
\n",
89 | " \n",
90 | " | 5 | \n",
91 | " 30 | \n",
92 | " 33 | \n",
93 | "
\n",
94 | " \n",
95 | " | 6 | \n",
96 | " 35 | \n",
97 | " 36 | \n",
98 | "
\n",
99 | " \n",
100 | " | 7 | \n",
101 | " 40 | \n",
102 | " 40 | \n",
103 | "
\n",
104 | " \n",
105 | " | 8 | \n",
106 | " 45 | \n",
107 | " 50 | \n",
108 | "
\n",
109 | " \n",
110 | " | 9 | \n",
111 | " 50 | \n",
112 | " 70 | \n",
113 | "
\n",
114 | " \n",
115 | " | 10 | \n",
116 | " 55 | \n",
117 | " 78 | \n",
118 | "
\n",
119 | " \n",
120 | " | 11 | \n",
121 | " 60 | \n",
122 | " 80 | \n",
123 | "
\n",
124 | " \n",
125 | " | 12 | \n",
126 | " 65 | \n",
127 | " 100 | \n",
128 | "
\n",
129 | " \n",
130 | " | 13 | \n",
131 | " 70 | \n",
132 | " 103 | \n",
133 | "
\n",
134 | " \n",
135 | " | 14 | \n",
136 | " 75 | \n",
137 | " 108 | \n",
138 | "
\n",
139 | " \n",
140 | " | 15 | \n",
141 | " 80 | \n",
142 | " 109 | \n",
143 | "
\n",
144 | " \n",
145 | " | 16 | \n",
146 | " 85 | \n",
147 | " 113 | \n",
148 | "
\n",
149 | " \n",
150 | " | 17 | \n",
151 | " 90 | \n",
152 | " 120 | \n",
153 | "
\n",
154 | " \n",
155 | " | 18 | \n",
156 | " 95 | \n",
157 | " 123 | \n",
158 | "
\n",
159 | " \n",
160 | " | 19 | \n",
161 | " 100 | \n",
162 | " 130 | \n",
163 | "
\n",
164 | " \n",
165 | "
\n",
166 | "
"
167 | ],
168 | "text/plain": [
169 | " Age Weight\n",
170 | "0 5 5\n",
171 | "1 10 8\n",
172 | "2 15 13\n",
173 | "3 20 17\n",
174 | "4 25 27\n",
175 | "5 30 33\n",
176 | "6 35 36\n",
177 | "7 40 40\n",
178 | "8 45 50\n",
179 | "9 50 70\n",
180 | "10 55 78\n",
181 | "11 60 80\n",
182 | "12 65 100\n",
183 | "13 70 103\n",
184 | "14 75 108\n",
185 | "15 80 109\n",
186 | "16 85 113\n",
187 | "17 90 120\n",
188 | "18 95 123\n",
189 | "19 100 130"
190 | ]
191 | },
192 | "metadata": {},
193 | "output_type": "display_data"
194 | }
195 | ],
196 | "source": [
197 | "df = pd.DataFrame(list(zip(X, y)), columns =['Age', 'Weight'])\n",
198 | "df"
199 | ]
200 | },
201 | {
202 | "cell_type": "markdown",
203 | "metadata": {},
204 | "source": [
205 | "Suppose the actual range of a feature named \"**Age**\" is **5** to **100**. We can normalize these values into a range of **[0, 1]** by subtracting **5** from every value of the \"**Age**\" column and then dividing the result by **95** (100–5). To make things clear in your brain we can write the above as a formula.\n",
206 | "\n",
207 | "\n",
208 | "\n",
209 | "where min^(j) and max^(j) are the minimum and the maximum values of the feature j in the dataset.\n",
210 | "\n",
211 | "\n",
212 | "\n",
213 | "---\n",
214 | "\n"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "## **Implementation**\n",
222 | "\n",
223 | "Now that you know the theory behind it let's now see how to put it into production. As normal there are two ways to implement this: **Traditional Old school manual method** and the other using `sklearn preprocessing` library. Today let's take the help of `sklearn` library to perform normalization. \n",
224 | "\n",
225 | "\n",
226 | "### **Using sklearn preprocessing - Normalizer**\n",
227 | "\n",
228 | "\n",
229 | "Before feeding the \"**Age**\" and the \"**Weight**\" values directly to the method we need to convert these data frames into a `numpy` array. To do this we can use the `to_numpy()` method as shown below:"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": null,
235 | "metadata": {},
236 | "outputs": [],
237 | "source": [
238 | "# Storing the columns Age values into X and Weight as Y\n",
239 | "X = df['Age']\n",
240 | "y = df['Weight']\n",
241 | "X = X.to_numpy()\n",
242 | "y = y.to_numpy()"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "The above step is very important because of both the `fit()` and the `transform()` method works on an array."
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": null,
255 | "metadata": {},
256 | "outputs": [
257 | {
258 | "data": {
259 | "text/plain": [
260 | "array([[0.01866633, 0.03733267, 0.055999 , 0.07466534, 0.09333167,\n",
261 | " 0.11199801, 0.13066434, 0.14933068, 0.16799701, 0.18666335,\n",
262 | " 0.20532968, 0.22399602, 0.24266235, 0.26132869, 0.27999502,\n",
263 | " 0.29866136, 0.31732769, 0.33599403, 0.35466036, 0.3733267 ]])"
264 | ]
265 | },
266 | "metadata": {},
267 | "output_type": "display_data"
268 | }
269 | ],
270 | "source": [
271 | "from sklearn.preprocessing import Normalizer\n",
272 | "normalizer = Normalizer().fit([X])\n",
273 | "normalizer.transform([X])"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": null,
279 | "metadata": {},
280 | "outputs": [
281 | {
282 | "data": {
283 | "text/plain": [
284 | "array([[0.01394837, 0.02231739, 0.03626577, 0.04742446, 0.07532121,\n",
285 | " 0.09205925, 0.10042828, 0.11158697, 0.13948372, 0.1952772 ,\n",
286 | " 0.2175946 , 0.22317395, 0.27896743, 0.28733646, 0.30128483,\n",
287 | " 0.3040745 , 0.3152332 , 0.33476092, 0.34312994, 0.36265766]])"
288 | ]
289 | },
290 | "metadata": {},
291 | "output_type": "display_data"
292 | }
293 | ],
294 | "source": [
295 | "normalizer = Normalizer().fit([y])\n",
296 | "normalizer.transform([y])"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "As seen above both the arrays have the values in the range **[0, 1]**. More details about the library can be found below:\n",
304 | "\n",
305 | "[Pre-processing data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization)\n",
306 | "\n",
307 | "\n",
308 | "\n",
309 | "---\n",
310 | "\n"
311 | ]
312 | },
313 | {
314 | "cell_type": "markdown",
315 | "metadata": {},
316 | "source": [
317 | "## **When should we actually normalize the data?**\n",
318 | "\n",
319 | "Although normalization is not mandatory or a requirement (must-do thing). There are two ways it can help you which is\n",
320 | "\n",
321 | "\n",
322 | "\n",
323 | "1. Normalizing the data will **increase the speed of learning**. It will increase the speed both in building (training) and testing the data. Give it a try!!\n",
324 | "\n",
325 | "2. It will avoid **numeric overflow**. What is really means is that normalization will ensure that our inputs are roughly in a small relatively small range. This will avoid problems because computers usually have problems dealing with very small or very large numbers.\n",
326 | "\n",
327 | "\n",
328 | "\n",
329 | "---\n",
330 | "\n",
331 | "\n",
332 | "\n"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {},
338 | "source": [
339 | "## **Standardization**\n",
340 | "\n",
341 | "### **Theory**\n",
342 | "\n",
343 | "Standardization or **z-score normalization** or **min-max scaling** is a technique of rescaling the values of a dataset such that they have the properties of a standard normal distribution with **μ** = 0 (mean - average values of the feature) and **σ** = 1 (standard deviation from the mean). This can be written as:\n",
344 | "\n",
345 | "\n",
346 | "\n",
347 | "## **When to standardize:**\n",
348 | "\n",
349 | "1️⃣ Linear distances Model in linear space \n",
350 | "\n",
351 | "Examples:\n",
352 | "- k-Nearest Neighbors (kNN)\n",
353 | "- Linear regression\n",
354 | "- K-Means Clustering\n",
355 | "\n",
356 | "2️⃣ Dataset features have high variance\n",
357 | "\n",
358 | "3️⃣ Different scales: Features are on different scales\n",
359 | "Example: Predicting house prices using no. bedrooms & last sale price. \n",
360 | "\n",
361 | "\n",
362 | "\n",
363 | "## **Implementation**\n",
364 | "\n",
365 | "Now there are plenty of ways to implement standardization, just as normalization, we can use `sklearn` library and use `StandardScalar` method as shown below:\n",
366 | "\n"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": null,
372 | "metadata": {},
373 | "outputs": [
374 | {
375 | "data": {
376 | "text/plain": [
377 | "array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n",
378 | " 0., 0., 0., 0.]])"
379 | ]
380 | },
381 | "metadata": {},
382 | "output_type": "display_data"
383 | }
384 | ],
385 | "source": [
386 | "from sklearn.preprocessing import StandardScaler\n",
387 | "sc = StandardScaler()\n",
388 | "sc.fit_transform([X])\n",
389 | "sc.transform([X])\n",
390 | "sc.fit_transform([y])\n",
391 | "sc.transform([y])"
392 | ]
393 | },
394 | {
395 | "cell_type": "markdown",
396 | "metadata": {},
397 | "source": [
398 | "You can read more about the library from below:\n",
399 | "\n",
400 | "[Pre-processing data](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)\n",
401 | "\n",
402 | "\n",
403 | "\n",
404 | "---\n",
405 | "\n"
406 | ]
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "## **Z-Score Normalization**\n",
413 | "\n",
414 | "Similarly, we can use the pandas `mean` and `std` to do the needful"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": null,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "data": {
424 | "text/html": [
425 | "\n",
426 | "\n",
439 | "
\n",
440 | " \n",
441 | " \n",
442 | " | \n",
443 | " Age | \n",
444 | " Weight | \n",
445 | "
\n",
446 | " \n",
447 | " \n",
448 | " \n",
449 | " | 0 | \n",
450 | " -1.605793 | \n",
451 | " -1.458724 | \n",
452 | "
\n",
453 | " \n",
454 | " | 1 | \n",
455 | " -1.436762 | \n",
456 | " -1.389426 | \n",
457 | "
\n",
458 | " \n",
459 | " | 2 | \n",
460 | " -1.267731 | \n",
461 | " -1.273929 | \n",
462 | "
\n",
463 | " \n",
464 | " | 3 | \n",
465 | " -1.098701 | \n",
466 | " -1.181531 | \n",
467 | "
\n",
468 | " \n",
469 | " | 4 | \n",
470 | " -0.929670 | \n",
471 | " -0.950538 | \n",
472 | "
\n",
473 | " \n",
474 | " | 5 | \n",
475 | " -0.760639 | \n",
476 | " -0.811942 | \n",
477 | "
\n",
478 | " \n",
479 | " | 6 | \n",
480 | " -0.591608 | \n",
481 | " -0.742644 | \n",
482 | "
\n",
483 | " \n",
484 | " | 7 | \n",
485 | " -0.422577 | \n",
486 | " -0.650247 | \n",
487 | "
\n",
488 | " \n",
489 | " | 8 | \n",
490 | " -0.253546 | \n",
491 | " -0.419253 | \n",
492 | "
\n",
493 | " \n",
494 | " | 9 | \n",
495 | " -0.084515 | \n",
496 | " 0.042734 | \n",
497 | "
\n",
498 | " \n",
499 | " | 10 | \n",
500 | " 0.084515 | \n",
501 | " 0.227529 | \n",
502 | "
\n",
503 | " \n",
504 | " | 11 | \n",
505 | " 0.253546 | \n",
506 | " 0.273727 | \n",
507 | "
\n",
508 | " \n",
509 | " | 12 | \n",
510 | " 0.422577 | \n",
511 | " 0.735714 | \n",
512 | "
\n",
513 | " \n",
514 | " | 13 | \n",
515 | " 0.591608 | \n",
516 | " 0.805012 | \n",
517 | "
\n",
518 | " \n",
519 | " | 14 | \n",
520 | " 0.760639 | \n",
521 | " 0.920509 | \n",
522 | "
\n",
523 | " \n",
524 | " | 15 | \n",
525 | " 0.929670 | \n",
526 | " 0.943608 | \n",
527 | "
\n",
528 | " \n",
529 | " | 16 | \n",
530 | " 1.098701 | \n",
531 | " 1.036006 | \n",
532 | "
\n",
533 | " \n",
534 | " | 17 | \n",
535 | " 1.267731 | \n",
536 | " 1.197701 | \n",
537 | "
\n",
538 | " \n",
539 | " | 18 | \n",
540 | " 1.436762 | \n",
541 | " 1.266999 | \n",
542 | "
\n",
543 | " \n",
544 | " | 19 | \n",
545 | " 1.605793 | \n",
546 | " 1.428694 | \n",
547 | "
\n",
548 | " \n",
549 | "
\n",
550 | "
"
551 | ],
552 | "text/plain": [
553 | " Age Weight\n",
554 | "0 -1.605793 -1.458724\n",
555 | "1 -1.436762 -1.389426\n",
556 | "2 -1.267731 -1.273929\n",
557 | "3 -1.098701 -1.181531\n",
558 | "4 -0.929670 -0.950538\n",
559 | "5 -0.760639 -0.811942\n",
560 | "6 -0.591608 -0.742644\n",
561 | "7 -0.422577 -0.650247\n",
562 | "8 -0.253546 -0.419253\n",
563 | "9 -0.084515 0.042734\n",
564 | "10 0.084515 0.227529\n",
565 | "11 0.253546 0.273727\n",
566 | "12 0.422577 0.735714\n",
567 | "13 0.591608 0.805012\n",
568 | "14 0.760639 0.920509\n",
569 | "15 0.929670 0.943608\n",
570 | "16 1.098701 1.036006\n",
571 | "17 1.267731 1.197701\n",
572 | "18 1.436762 1.266999\n",
573 | "19 1.605793 1.428694"
574 | ]
575 | },
576 | "metadata": {},
577 | "output_type": "display_data"
578 | }
579 | ],
580 | "source": [
581 | "# Calculating the mean and standard deviation\n",
582 | "df = (df - df.mean())/df.std()\n",
583 | "df"
584 | ]
585 | },
586 | {
587 | "cell_type": "markdown",
588 | "metadata": {},
589 | "source": [
590 | "\n",
591 | "\n",
592 | "---\n",
593 | "\n"
594 | ]
595 | },
596 | {
597 | "cell_type": "markdown",
598 | "metadata": {},
599 | "source": [
600 | "## **Min-Max scaling**\n",
601 | "\n",
602 | "\n",
603 | "Here we can use pandas `min` and `max` to do the needful\n",
604 | "\n"
605 | ]
606 | },
607 | {
608 | "cell_type": "code",
609 | "execution_count": null,
610 | "metadata": {},
611 | "outputs": [
612 | {
613 | "data": {
614 | "text/html": [
615 | "\n",
616 | "\n",
629 | "
\n",
630 | " \n",
631 | " \n",
632 | " | \n",
633 | " Age | \n",
634 | " Weight | \n",
635 | "
\n",
636 | " \n",
637 | " \n",
638 | " \n",
639 | " | 0 | \n",
640 | " 0.000000 | \n",
641 | " 0.000 | \n",
642 | "
\n",
643 | " \n",
644 | " | 1 | \n",
645 | " 0.052632 | \n",
646 | " 0.024 | \n",
647 | "
\n",
648 | " \n",
649 | " | 2 | \n",
650 | " 0.105263 | \n",
651 | " 0.064 | \n",
652 | "
\n",
653 | " \n",
654 | " | 3 | \n",
655 | " 0.157895 | \n",
656 | " 0.096 | \n",
657 | "
\n",
658 | " \n",
659 | " | 4 | \n",
660 | " 0.210526 | \n",
661 | " 0.176 | \n",
662 | "
\n",
663 | " \n",
664 | " | 5 | \n",
665 | " 0.263158 | \n",
666 | " 0.224 | \n",
667 | "
\n",
668 | " \n",
669 | " | 6 | \n",
670 | " 0.315789 | \n",
671 | " 0.248 | \n",
672 | "
\n",
673 | " \n",
674 | " | 7 | \n",
675 | " 0.368421 | \n",
676 | " 0.280 | \n",
677 | "
\n",
678 | " \n",
679 | " | 8 | \n",
680 | " 0.421053 | \n",
681 | " 0.360 | \n",
682 | "
\n",
683 | " \n",
684 | " | 9 | \n",
685 | " 0.473684 | \n",
686 | " 0.520 | \n",
687 | "
\n",
688 | " \n",
689 | " | 10 | \n",
690 | " 0.526316 | \n",
691 | " 0.584 | \n",
692 | "
\n",
693 | " \n",
694 | " | 11 | \n",
695 | " 0.578947 | \n",
696 | " 0.600 | \n",
697 | "
\n",
698 | " \n",
699 | " | 12 | \n",
700 | " 0.631579 | \n",
701 | " 0.760 | \n",
702 | "
\n",
703 | " \n",
704 | " | 13 | \n",
705 | " 0.684211 | \n",
706 | " 0.784 | \n",
707 | "
\n",
708 | " \n",
709 | " | 14 | \n",
710 | " 0.736842 | \n",
711 | " 0.824 | \n",
712 | "
\n",
713 | " \n",
714 | " | 15 | \n",
715 | " 0.789474 | \n",
716 | " 0.832 | \n",
717 | "
\n",
718 | " \n",
719 | " | 16 | \n",
720 | " 0.842105 | \n",
721 | " 0.864 | \n",
722 | "
\n",
723 | " \n",
724 | " | 17 | \n",
725 | " 0.894737 | \n",
726 | " 0.920 | \n",
727 | "
\n",
728 | " \n",
729 | " | 18 | \n",
730 | " 0.947368 | \n",
731 | " 0.944 | \n",
732 | "
\n",
733 | " \n",
734 | " | 19 | \n",
735 | " 1.000000 | \n",
736 | " 1.000 | \n",
737 | "
\n",
738 | " \n",
739 | "
\n",
740 | "
"
741 | ],
742 | "text/plain": [
743 | " Age Weight\n",
744 | "0 0.000000 0.000\n",
745 | "1 0.052632 0.024\n",
746 | "2 0.105263 0.064\n",
747 | "3 0.157895 0.096\n",
748 | "4 0.210526 0.176\n",
749 | "5 0.263158 0.224\n",
750 | "6 0.315789 0.248\n",
751 | "7 0.368421 0.280\n",
752 | "8 0.421053 0.360\n",
753 | "9 0.473684 0.520\n",
754 | "10 0.526316 0.584\n",
755 | "11 0.578947 0.600\n",
756 | "12 0.631579 0.760\n",
757 | "13 0.684211 0.784\n",
758 | "14 0.736842 0.824\n",
759 | "15 0.789474 0.832\n",
760 | "16 0.842105 0.864\n",
761 | "17 0.894737 0.920\n",
762 | "18 0.947368 0.944\n",
763 | "19 1.000000 1.000"
764 | ]
765 | },
766 | "metadata": {},
767 | "output_type": "display_data"
768 | }
769 | ],
770 | "source": [
771 | "# Calculating the minimum and the maximum \n",
772 | "df = (df-df.min())/(df.max()-df.min())\n",
773 | "df"
774 | ]
775 | },
776 | {
777 | "cell_type": "markdown",
778 | "metadata": {},
779 | "source": [
780 | "Usually, the **Z-score normalization** is preferred because min-max scaling is prone for **overfitting**.\n",
781 | "\n",
782 | "\n"
783 | ]
784 | },
785 | {
786 | "cell_type": "markdown",
787 | "metadata": {},
788 | "source": [
789 | "\n",
790 | "\n",
791 | "---\n",
792 | "**References**\n",
793 | "- [The Hundred-Page Machine Learning Book by Andriy Burkov](http://themlbook.com/)\" (Chapter 5) \n",
794 | "\n",
795 | "- Datacamp.com\n"
796 | ]
797 | },
798 | {
799 | "cell_type": "markdown",
800 | "metadata": {},
801 | "source": []
802 | }
803 | ],
804 | "metadata": {
805 | "language_info": {
806 | "name": "python"
807 | }
808 | },
809 | "nbformat": 4,
810 | "nbformat_minor": 2
811 | }
812 |
--------------------------------------------------------------------------------
/Class22/house_price_prediction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "id": "tsEW-cvKhYKa"
7 | },
8 | "source": [
9 | "## **Import Libraries**"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {
16 | "id": "Ei32NUlu1Vz-"
17 | },
18 | "outputs": [],
19 | "source": [
20 | "import pandas as pd\n",
21 | "import numpy as np\n",
22 | "import matplotlib.pyplot as plt\n",
23 | "%matplotlib inline\n",
24 | "import sklearn\n",
25 | "import seaborn as sns\n",
26 | "import warnings\n",
27 | "warnings.filterwarnings('ignore')\n",
28 | "plt.rcParams[\"figure.figsize\"] = [10,8]"
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": null,
34 | "metadata": {
35 | "id": "ubQCL-d6iaoa"
36 | },
37 | "outputs": [],
38 | "source": [
39 | "import warnings\n",
40 | "warnings.simplefilter(action = 'ignore', category = FutureWarning)"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {
46 | "id": "PdwNo1tDkX5Y"
47 | },
48 | "source": [
49 | "# **Load the dataset**"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "metadata": {
56 | "id": "_F1V3ZpukWhw"
57 | },
58 | "outputs": [],
59 | "source": [
60 | "df = pd.read_csv('../Datasets/USA_Housing.csv')"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": null,
66 | "metadata": {
67 | "colab": {
68 | "base_uri": "https://localhost:8080/",
69 | "height": 320
70 | },
71 | "id": "klIECXfVkpjO",
72 | "outputId": "2cfe50a2-d6ed-434d-f0ae-7a861fb1b66e"
73 | },
74 | "outputs": [],
75 | "source": [
76 | "df.head()"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": null,
82 | "metadata": {
83 | "colab": {
84 | "base_uri": "https://localhost:8080/"
85 | },
86 | "id": "EiJukW6-k0tQ",
87 | "outputId": "8d2bbf5e-dcff-4c49-f662-69dff504478e"
88 | },
89 | "outputs": [],
90 | "source": [
91 | "df.shape"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {
98 | "colab": {
99 | "base_uri": "https://localhost:8080/",
100 | "height": 300
101 | },
102 | "id": "8FsIGn7kv_el",
103 | "outputId": "be07eff8-b0fd-438d-f9f3-458bb2948aac"
104 | },
105 | "outputs": [],
106 | "source": [
107 | "df.describe()"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {
114 | "colab": {
115 | "base_uri": "https://localhost:8080/"
116 | },
117 | "id": "ljQTZZp_k-cn",
118 | "outputId": "8ce8d16d-cbf3-4b52-ab69-eb8c839616ad"
119 | },
120 | "outputs": [],
121 | "source": [
122 | "df.nunique()"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "metadata": {
129 | "colab": {
130 | "base_uri": "https://localhost:8080/"
131 | },
132 | "id": "6YR1GDfJlFd1",
133 | "outputId": "94390eb0-c643-4442-ac88-851ac722bb26"
134 | },
135 | "outputs": [],
136 | "source": [
137 | "df.isnull().sum()"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {
144 | "colab": {
145 | "base_uri": "https://localhost:8080/"
146 | },
147 | "id": "CIgPPl2VlK2P",
148 | "outputId": "d7891866-14b1-4a27-f586-891d8deb4c07"
149 | },
150 | "outputs": [],
151 | "source": [
152 | "df.info()"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {
158 | "id": "t8sUDI3N3wi3"
159 | },
160 | "source": [
161 | "## **1. Perform EDA on the dataset which should include**"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {
167 | "id": "kYYFF1KImtAc"
168 | },
169 | "source": [
170 | "### **a. Visualization** and explore the data using seaborn\n",
171 | "#### **i.** Add your findings about the data under each graph in the colab notebook"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": null,
177 | "metadata": {
178 | "colab": {
179 | "base_uri": "https://localhost:8080/",
180 | "height": 718
181 | },
182 | "id": "p8VloUz7nHGa",
183 | "outputId": "0a63f226-1e46-43ce-9431-a0de1d33eda9"
184 | },
185 | "outputs": [],
186 | "source": [
187 | "sns.histplot(data=df['Area Population'], kde=False)\n",
188 | "plt.title(\"Histogram of Area Population\")\n",
189 | "plt.show()"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {
195 | "id": "k7saEr454BaV"
196 | },
197 | "source": [
198 | "- This histogram should the relation of population with the number of houses. It shows that as the number of houses increase in a particular area its population also increases."
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {
205 | "colab": {
206 | "base_uri": "https://localhost:8080/",
207 | "height": 681
208 | },
209 | "id": "p25VE_2GhypD",
210 | "outputId": "14bd2608-b902-468f-c343-9dac47f656b5"
211 | },
212 | "outputs": [],
213 | "source": [
214 | "sns.boxenplot(data=df['Area Population'])\n",
215 | "plt.show()"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {
222 | "colab": {
223 | "base_uri": "https://localhost:8080/",
224 | "height": 718
225 | },
226 | "id": "ydPPZ4SGwZBe",
227 | "outputId": "615a9ed4-5d2f-4ba6-8872-ea44e35a625d"
228 | },
229 | "outputs": [],
230 | "source": [
231 | "sns.histplot(data=df['Price'], kde=True)\n",
232 | "plt.title(\"Histogram of Price\")\n",
233 | "plt.show()"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {
239 | "id": "QGCyhwQ_4fFJ"
240 | },
241 | "source": [
242 | "- This histogram shows the relationship of number of houses with price. It clearly shows that if the number of houses increases in a particular area then its price also increases."
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": null,
248 | "metadata": {
249 | "colab": {
250 | "base_uri": "https://localhost:8080/",
251 | "height": 727
252 | },
253 | "id": "PdpzfJJowz2g",
254 | "outputId": "b79c7c10-973a-4f4c-a339-3fd03f6c6d1a"
255 | },
256 | "outputs": [],
257 | "source": [
258 | "sns.set_theme(style=\"darkgrid\")\n",
259 | "sns.regplot(data=df, x='Avg. Area Number of Rooms', y='Price')\n",
260 | "plt.title(\"Regplot of Rooms and Price\")\n",
261 | "plt.show()"
262 | ]
263 | },
264 | {
265 | "cell_type": "markdown",
266 | "metadata": {
267 | "id": "opTZEdz58dq0"
268 | },
269 | "source": [
270 | "- This regplot shows that there is negative relationship between Avg. Area number of Rooms and Price. It means, that if in a particular area there are more number of rooms then its price is low or vice versa."
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": null,
276 | "metadata": {
277 | "colab": {
278 | "base_uri": "https://localhost:8080/",
279 | "height": 703
280 | },
281 | "id": "oW5qDmla8815",
282 | "outputId": "2e49f3f9-4379-4ccd-9b8d-ea9586f5562e"
283 | },
284 | "outputs": [],
285 | "source": [
286 | "sns.set_theme(style=\"whitegrid\")\n",
287 | "# Sample size you want for visualization (e.g., 1000 data points)\n",
288 | "sample_size = 2000\n",
289 | "# Randomly sample data from the DataFrame\n",
290 | "sampled_data = df.sample(n=sample_size, random_state=42)\n",
291 | "# Now create the box plot using the sampled data\n",
292 | "plt.boxplot(sampled_data['Area Population'])\n",
293 | "plt.title('Box Plot')\n",
294 | "plt.show()"
295 | ]
296 | },
297 | {
298 | "cell_type": "markdown",
299 | "metadata": {
300 | "id": "WL9Njft9NcU4"
301 | },
302 | "source": [
303 | "- In this box plot, i have used sample values which means that I have not used all the data of the **Area Population** columnn due to larger number of rows. It is showing outliers, it means that some areas have much larger of population and some areas have much lesser population and these are outliers in our **Area population** column."
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": null,
309 | "metadata": {
310 | "colab": {
311 | "base_uri": "https://localhost:8080/",
312 | "height": 727
313 | },
314 | "id": "igqZmXQJCN8p",
315 | "outputId": "63ca4b6d-46c7-48f2-bbd9-27abe43abee3"
316 | },
317 | "outputs": [],
318 | "source": [
319 | "# Sample size you want for visualization (e.g., 1000 data points)\n",
320 | "sample_size = 2000\n",
321 | "\n",
322 | "# Assuming 'x_column' and 'y_column' are the column names for your x and y values in the DataFrame 'df'\n",
323 | "# Randomly sample data from the DataFrame\n",
324 | "sampled_data = df.sample(n=sample_size, random_state=42)\n",
325 | "\n",
326 | "# Use the sampled data for x and y values\n",
327 | "x_values = sampled_data['Avg. Area Number of Rooms']\n",
328 | "y_values = sampled_data['Area Population']\n",
329 | "\n",
330 | "# Plot the line with sampled x and y values\n",
331 | "plt.scatter(x_values, y_values)\n",
332 | "plt.title('Line Plot with Sampled Data')\n",
333 | "\n",
334 | "plt.xlabel(\"Avg. Area Number of Rooms\")\n",
335 | "plt.ylabel(\"Area Population\")\n",
336 | "# Show the plot\n",
337 | "plt.show()"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": null,
343 | "metadata": {
344 | "colab": {
345 | "base_uri": "https://localhost:8080/",
346 | "height": 703
347 | },
348 | "id": "nS-tq6PVG0ZY",
349 | "outputId": "12d36cf8-9b3b-4340-c19d-75b597304da4"
350 | },
351 | "outputs": [],
352 | "source": [
353 | "sns.violinplot(data=df['Avg. Area House Age'])\n",
354 | "plt.title(\"Violin plot of Avg. Area House Age\")\n",
355 | "plt.show()"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {
361 | "id": "M-iubeutaz0O"
362 | },
363 | "source": [
364 | "- The above violin plot shows the distribution for Average Area house age, it means that there are more older houses."
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": null,
370 | "metadata": {
371 | "colab": {
372 | "base_uri": "https://localhost:8080/",
373 | "height": 727
374 | },
375 | "id": "O7OXXdnnHxWz",
376 | "outputId": "02e25011-0c3e-4e85-c02f-c904042f47e7"
377 | },
378 | "outputs": [],
379 | "source": [
380 | "sns.lineplot(data = df, x='Avg. Area House Age', y='Price')\n",
381 | "plt.title(\"Line plot of Average Area House Age and Price\")\n",
382 | "plt.show()"
383 | ]
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {
388 | "id": "aIDUIlyfIKWv"
389 | },
390 | "source": [
391 | "- Due to sheer number of rows, it is bit difficult to understand but if we analyze closely then it is showing that as the Average area house age increases then price also increases. In simple words, more old houses have more price as compared to others."
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": null,
397 | "metadata": {
398 | "colab": {
399 | "base_uri": "https://localhost:8080/",
400 | "height": 727
401 | },
402 | "id": "Cfr9qru9I1bF",
403 | "outputId": "c4a0c5db-b520-4f01-f2f5-238ac7fa68d2"
404 | },
405 | "outputs": [],
406 | "source": [
407 | "sns.lineplot(data = df, x='Avg. Area Number of Rooms', y='Price')\n",
408 | "plt.title(\"Line plot of Average Area Number of Rooms and Price\")\n",
409 | "plt.show()"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {
415 | "id": "mdpYY911JX5c"
416 | },
417 | "source": [
418 | "- Same is the case with Average Area Number of Rooms. As number of rooms increases, price also increases."
419 | ]
420 | },
421 | {
422 | "cell_type": "code",
423 | "execution_count": null,
424 | "metadata": {
425 | "colab": {
426 | "base_uri": "https://localhost:8080/",
427 | "height": 524
428 | },
429 | "id": "obeHublpO2lk",
430 | "outputId": "9b7fd217-3216-42a0-fc96-6376d7446869"
431 | },
432 | "outputs": [],
433 | "source": [
434 | "plt.figure(figsize=(10,8))\n",
435 | "sns.lmplot(data=df, x='Avg. Area Income', y='Price', aspect=1.5)\n",
436 | "plt.title(\"lmplot of Price and Income\")\n",
437 | "plt.show()"
438 | ]
439 | },
440 | {
441 | "cell_type": "markdown",
442 | "metadata": {
443 | "id": "eErRhcmXUkzD"
444 | },
445 | "source": [
446 | "- It shows that there is a positive relationship between Price and Income."
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": null,
452 | "metadata": {
453 | "colab": {
454 | "base_uri": "https://localhost:8080/",
455 | "height": 930
456 | },
457 | "id": "KgyusPRkVCRm",
458 | "outputId": "b3b9d058-7f84-4263-e9d9-8b1958dd1717"
459 | },
460 | "outputs": [],
461 | "source": [
462 | "sns.heatmap(df.corr(), annot=True, cmap=\"viridis\", cbar=True, fmt='.2f')\n",
463 | "plt.title(\"Heatmap\")\n",
464 | "plt.show()"
465 | ]
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {
470 | "id": "fupSHTF7WvB_"
471 | },
472 | "source": [
473 | "- Heatmap shows that there is not as much strong correlation between most of the features, and mostly the features are independent to each other."
474 | ]
475 | },
476 | {
477 | "cell_type": "code",
478 | "execution_count": null,
479 | "metadata": {
480 | "colab": {
481 | "base_uri": "https://localhost:8080/",
482 | "height": 1000
483 | },
484 | "id": "9F2x6-wqZyGq",
485 | "outputId": "1f00845c-5aa9-4934-b45d-5fc467f98a41"
486 | },
487 | "outputs": [],
488 | "source": [
489 | "sns.pairplot(data=df)\n",
490 | "plt.title(\"Pair plot\")\n",
491 | "plt.show()"
492 | ]
493 | },
494 | {
495 | "cell_type": "markdown",
496 | "metadata": {
497 | "id": "YY4cYVX-aR0C"
498 | },
499 | "source": [
500 | "- It shows the relationship of each feature in the dataset. Here, we can see the correlation at a single place for all features."
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {
506 | "id": "O6UWaxzVmMp5"
507 | },
508 | "source": [
509 | "### **b. Identify the data patterns** if exist for single/multiple variables\n",
510 | "#### **i.** Write your findings under the plots or code that identify the pattern"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": null,
516 | "metadata": {
517 | "id": "HkPUZeI1igyi"
518 | },
519 | "outputs": [],
520 | "source": [
521 | "sns.histplot(data=df['Area Population'], kde=False)\n",
522 | "plt.title(\"Histogram of Area Population\")\n",
523 | "plt.show()"
524 | ]
525 | },
526 | {
527 | "cell_type": "markdown",
528 | "metadata": {
529 | "id": "BFMAnMyjjESs"
530 | },
531 | "source": [
532 | "- This histogram should the relation of population with the number of houses. It shows that as the number of houses increase in a particular area its population also increases."
533 | ]
534 | },
535 | {
536 | "cell_type": "code",
537 | "execution_count": null,
538 | "metadata": {
539 | "id": "nIgFsJIfjGnW"
540 | },
541 | "outputs": [],
542 | "source": [
543 | "sns.set_theme(style=\"darkgrid\")\n",
544 | "sns.regplot(data=df, x='Avg. Area Number of Rooms', y='Price')\n",
545 | "plt.title(\"Regplot of Rooms and Price\")\n",
546 | "plt.show()"
547 | ]
548 | },
549 | {
550 | "cell_type": "markdown",
551 | "metadata": {
552 | "id": "SfaqEpBhi64T"
553 | },
554 | "source": [
555 | "- This regplot shows that there is negative relationship between Avg. Area number of Rooms and Price. It means, that if in a particular area there are more number of rooms then its price is low or vice versa."
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": null,
561 | "metadata": {
562 | "id": "WBZO0houj9g3"
563 | },
564 | "outputs": [],
565 | "source": [
566 | "plt.figure(figsize=(10,8))\n",
567 | "sns.lmplot(data=df, x='Avg. Area Income', y='Price', aspect=1.5)\n",
568 | "plt.title(\"lmplot of Price and Income\")\n",
569 | "plt.show()"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "metadata": {
575 | "id": "-Tjsg8XGj-iR"
576 | },
577 | "source": [
578 | "- It shows that there is a positive relationship between Price and Income."
579 | ]
580 | },
581 | {
582 | "cell_type": "markdown",
583 | "metadata": {
584 | "id": "2vGtDG5JlQl4"
585 | },
586 | "source": [
587 | "### **c. Clean the dataset,** remove the missing values\n",
588 | "#### i. Explain your approach in the colab notebook cell"
589 | ]
590 | },
591 | {
592 | "cell_type": "code",
593 | "execution_count": null,
594 | "metadata": {
595 | "colab": {
596 | "base_uri": "https://localhost:8080/",
597 | "height": 930
598 | },
599 | "id": "_SBFHQ8Blv-h",
600 | "outputId": "709e7baf-ca0f-42d2-ce14-1ee531e9988a"
601 | },
602 | "outputs": [],
603 | "source": [
604 | "# heatmap to see the missing values in the dataset\n",
605 | "sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap=\"viridis\")\n",
606 | "plt.title(\"Missing Data\")\n",
607 | "plt.show()"
608 | ]
609 | },
610 | {
611 | "cell_type": "code",
612 | "execution_count": null,
613 | "metadata": {
614 | "colab": {
615 | "base_uri": "https://localhost:8080/"
616 | },
617 | "id": "l1MGR2Hvnvqv",
618 | "outputId": "81b96554-52f8-4407-a935-bf75a635ecbd"
619 | },
620 | "outputs": [],
621 | "source": [
622 | "df.isnull().sum()"
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "execution_count": null,
628 | "metadata": {
629 | "id": "m30x1pw40zxQ"
630 | },
631 | "outputs": [],
632 | "source": [
633 | "df.drop('Address', axis = 1, inplace = True)"
634 | ]
635 | },
636 | {
637 | "cell_type": "code",
638 | "execution_count": null,
639 | "metadata": {
640 | "colab": {
641 | "base_uri": "https://localhost:8080/",
642 | "height": 206
643 | },
644 | "id": "a6OLd_hToQj9",
645 | "outputId": "ad164a9c-6a9f-40d6-9ee5-cd4738fe128e"
646 | },
647 | "outputs": [],
648 | "source": [
649 | "df.dropna(inplace = True)\n",
650 | "df.head()"
651 | ]
652 | },
653 | {
654 | "cell_type": "markdown",
655 | "metadata": {
656 | "id": "1fuBZNJTn1Oh"
657 | },
658 | "source": [
659 | "- **My Approach:**\\\n",
660 | "I have used for heatmap and .isnull(), but there are no missing values in the dataset. Still, I have used .dropna() function for precaution to drop any missing values from dataset."
661 | ]
662 | },
663 | {
664 | "cell_type": "markdown",
665 | "metadata": {
666 | "id": "KDtA6x4Ton9w"
667 | },
668 | "source": [
669 | "### **d. Select the target variable** and clearly mention reason for selecting it"
670 | ]
671 | },
672 | {
673 | "cell_type": "markdown",
674 | "metadata": {
675 | "id": "6C8HYtVzs8jF"
676 | },
677 | "source": [
678 | "**Target variable:**\\\n",
679 | "Price\\\n",
680 | "**Reason:**\\\n",
681 | "I am selecting **price** as my target variable, because the algorithms we are going to use are regressors and they are used for predicting numerical values. So, I think **price** is a better choice as target variable."
682 | ]
683 | },
684 | {
685 | "cell_type": "code",
686 | "execution_count": null,
687 | "metadata": {
688 | "id": "PFVU40sXzmwZ"
689 | },
690 | "outputs": [],
691 | "source": [
692 | "x = df.drop('Price', axis = 1)\n",
693 | "y = df['Price']"
694 | ]
695 | },
696 | {
697 | "cell_type": "markdown",
698 | "metadata": {
699 | "id": "exNahBtyvivw"
700 | },
701 | "source": [
702 | "### **e. Transform the Dataset**\n",
703 | "#### i. Transform the whole dataset (Features, Target Variable)"
704 | ]
705 | },
706 | {
707 | "cell_type": "code",
708 | "execution_count": null,
709 | "metadata": {
710 | "id": "0d2PpAiFpqeN"
711 | },
712 | "outputs": [],
713 | "source": [
714 | "from sklearn import preprocessing"
715 | ]
716 | },
717 | {
718 | "cell_type": "code",
719 | "execution_count": null,
720 | "metadata": {
721 | "id": "VIdzQqr90JPC"
722 | },
723 | "outputs": [],
724 | "source": [
725 | "# Transforming features\n",
726 | "pre_process_x = preprocessing.StandardScaler().fit(x)\n",
727 | "x_transform = pre_process_x.fit_transform(x)"
728 | ]
729 | },
730 | {
731 | "cell_type": "code",
732 | "execution_count": null,
733 | "metadata": {
734 | "id": "ccCOLrbyukwZ"
735 | },
736 | "outputs": [],
737 | "source": [
738 | "# Transforming target variable\n",
739 | "y_array = y.to_numpy()\n",
740 | "y_reshaped_column = y_array.reshape(-1, 1)\n",
741 | "pre_process_y = preprocessing.StandardScaler().fit(y_reshaped_column)\n",
742 | "y_transform = pre_process_y.fit_transform(y_reshaped_column)"
743 | ]
744 | },
745 | {
746 | "cell_type": "markdown",
747 | "metadata": {
748 | "id": "BGx7uKBPzOim"
749 | },
750 | "source": [
751 | "### **f. Split the Dataset** into train and test set"
752 | ]
753 | },
754 | {
755 | "cell_type": "code",
756 | "execution_count": null,
757 | "metadata": {
758 | "id": "hIbATgsozu6p"
759 | },
760 | "outputs": [],
761 | "source": [
762 | "from sklearn.model_selection import train_test_split\n",
763 | "x_train, x_test, y_train, y_test = train_test_split(x_transform, y_transform, test_size = .20, random_state=101)"
764 | ]
765 | },
766 | {
767 | "cell_type": "markdown",
768 | "metadata": {
769 | "id": "Jk2ipUda17i0"
770 | },
771 | "source": [
772 | "## **2. Use the Scikit Learn Library to fit the Regression Models**"
773 | ]
774 | },
775 | {
776 | "cell_type": "markdown",
777 | "metadata": {
778 | "id": "G8f5zMk02vMK"
779 | },
780 | "source": [
781 | "### **a.** Use the different regression models\n",
782 | "#### **i.** Linear regression\n",
783 | "#### **ii.** Decision Tree Regressor\n",
784 | "#### **iii.** Random forest Regressor\n",
785 | "#### **iv.** Gradient boosting Regressor"
786 | ]
787 | },
788 | {
789 | "cell_type": "markdown",
790 | "metadata": {
791 | "id": "CdSi-KO53quh"
792 | },
793 | "source": [
794 | "## **i. Linear Regression**"
795 | ]
796 | },
797 | {
798 | "cell_type": "code",
799 | "execution_count": null,
800 | "metadata": {
801 | "colab": {
802 | "base_uri": "https://localhost:8080/",
803 | "height": 75
804 | },
805 | "id": "m6AwdZAJ3ZIC",
806 | "outputId": "b815e5cf-efed-4433-ac37-a6a2ce3ae3db"
807 | },
808 | "outputs": [],
809 | "source": [
810 | "# Import model\n",
811 | "from sklearn.linear_model import LinearRegression\n",
812 | "\n",
813 | "# Creating instance of the model\n",
814 | "lin_reg = LinearRegression()\n",
815 | "\n",
816 | "# Pass training data to model\n",
817 | "lin_reg.fit(x_train, y_train)"
818 | ]
819 | },
820 | {
821 | "cell_type": "code",
822 | "execution_count": null,
823 | "metadata": {
824 | "id": "pVlce1oF4iPE"
825 | },
826 | "outputs": [],
827 | "source": [
828 | "# Predict\n",
829 | "y_pred_lreg = lin_reg.predict(x_test)"
830 | ]
831 | },
832 | {
833 | "cell_type": "code",
834 | "execution_count": null,
835 | "metadata": {
836 | "colab": {
837 | "base_uri": "https://localhost:8080/",
838 | "height": 681
839 | },
840 | "id": "votzTgnF7s1k",
841 | "outputId": "000d6111-0806-4c89-9425-7b35893d860c"
842 | },
843 | "outputs": [],
844 | "source": [
845 | "# Convert y_test and y_pred to 1-dimensional arrays using .flatten()\n",
846 | "y_test_1d = y_test.flatten()\n",
847 | "y_pred_1d = y_pred_lreg.flatten()\n",
848 | "\n",
849 | "# Plot the scatter plot and the ideal line\n",
850 | "sns.scatterplot(x=y_test_1d, y=y_pred_1d, color='blue', label='Actual Data points')\n",
851 | "plt.plot([min(y_test_1d), max(y_test_1d)], [min(y_test_1d), max(y_test_1d)], color='red', label='Ideal Line')\n",
852 | "plt.legend()\n",
853 | "plt.show()"
854 | ]
855 | },
856 | {
857 | "cell_type": "code",
858 | "execution_count": null,
859 | "metadata": {
860 | "id": "F_w_Ixbn_KtT"
861 | },
862 | "outputs": [],
863 | "source": [
864 | "# Combine actual and predicted values side by side\n",
865 | "results = np.column_stack((y_test, y_pred_lreg))\n",
866 | "\n",
867 | "# Printing the results\n",
868 | "print(\"Actual Values | Predicted Values\")\n",
869 | "print(\"-----------------------------\")\n",
870 | "for actual, predicted in results:\n",
871 | " print(f\"{actual:14.2f} | {predicted:12.2f}\")"
872 | ]
873 | },
874 | {
875 | "cell_type": "code",
876 | "execution_count": null,
877 | "metadata": {
878 | "colab": {
879 | "base_uri": "https://localhost:8080/"
880 | },
881 | "id": "9xZjyPTZCScL",
882 | "outputId": "87ecd165-92b7-4191-bcd6-79f2e7258eaa"
883 | },
884 | "outputs": [],
885 | "source": [
886 | "# Score It\n",
887 | "from sklearn.metrics import mean_squared_error\n",
888 | "\n",
889 | "print('Linear Regression Model')\n",
890 | "# Results\n",
891 | "print('--'*30)\n",
892 | "# mean_squared_error(y_test, y_pred)\n",
893 | "mse_lreg = mean_squared_error(y_test, y_pred_lreg)\n",
894 | "rmse_lreg = np.sqrt(mse_lreg)\n",
895 | "\n",
896 | "# Print evaluation metrics\n",
897 | "print(\"Mean Squared Error:\", mse_lreg)\n",
898 | "print(\"Root Mean Squared Error:\", rmse_lreg)"
899 | ]
900 | },
901 | {
902 | "cell_type": "markdown",
903 | "metadata": {
904 | "id": "u3w_81mA6hzQ"
905 | },
906 | "source": [
907 | "## **ii. Decision tree Regressor**"
908 | ]
909 | },
910 | {
911 | "cell_type": "code",
912 | "execution_count": null,
913 | "metadata": {
914 | "id": "jZGweSEu6rWu"
915 | },
916 | "outputs": [],
917 | "source": [
918 | "# Import model\n",
919 | "from sklearn.tree import DecisionTreeRegressor\n",
920 | "\n",
921 | "# Creating instance of the model\n",
922 | "Dtr = DecisionTreeRegressor()\n",
923 | "\n",
924 | "# Pass training data to model\n",
925 | "Dtr.fit(x_train, y_train)\n",
926 | "\n",
927 | "y_pred_dtr = Dtr.predict(x_test)"
928 | ]
929 | },
930 | {
931 | "cell_type": "code",
932 | "execution_count": null,
933 | "metadata": {
934 | "colab": {
935 | "base_uri": "https://localhost:8080/"
936 | },
937 | "id": "mGKtDhFhMJlQ",
938 | "outputId": "64031e17-3e4d-4b8d-d7d8-7a91beb95012"
939 | },
940 | "outputs": [],
941 | "source": [
942 | "print('Decision Tree Regressor')\n",
943 | "# Results\n",
944 | "print('--'*30)\n",
945 | "# mean_squared_error(y_test, y_pred)\n",
946 | "mse_dtr = mean_squared_error(y_test, y_pred_dtr)\n",
947 | "rmse_dtr = np.sqrt(mse_dtr)\n",
948 | "\n",
949 | "# Print evaluation metrics\n",
950 | "print(\"Mean Squared Error:\", mse_dtr)\n",
951 | "print(\"Root Mean Squared Error:\", rmse_dtr)"
952 | ]
953 | },
954 | {
955 | "cell_type": "markdown",
956 | "metadata": {
957 | "id": "jALiAwzRHBaH"
958 | },
959 | "source": [
960 | "## **iii. Random forest Regressor**"
961 | ]
962 | },
963 | {
964 | "cell_type": "code",
965 | "execution_count": null,
966 | "metadata": {
967 | "id": "C5J10mswHIf1"
968 | },
969 | "outputs": [],
970 | "source": [
971 | "# Import model\n",
972 | "from sklearn.ensemble import RandomForestRegressor\n",
973 | "\n",
974 | "# Creating instance of the model\n",
975 | "Rfr = RandomForestRegressor()\n",
976 | "\n",
977 | "# Pass training data to model\n",
978 | "Rfr.fit(x_train, y_train)\n",
979 | "\n",
980 | "y_pred_rfr = Rfr.predict(x_test)"
981 | ]
982 | },
983 | {
984 | "cell_type": "code",
985 | "execution_count": null,
986 | "metadata": {
987 | "colab": {
988 | "base_uri": "https://localhost:8080/"
989 | },
990 | "id": "YSRkOUQJJ37R",
991 | "outputId": "ae0eedaf-de3b-43d3-a82d-f1acd9de394a"
992 | },
993 | "outputs": [],
994 | "source": [
995 | "print('Random Tree Regressor')\n",
996 | "# Results\n",
997 | "print('--'*30)\n",
998 | "# mean_squared_error(y_test, y_pred_rtr)\n",
999 | "mse_rfr = mean_squared_error(y_test, y_pred_rfr)\n",
1000 | "rmse_rfr = np.sqrt(mse_rfr)\n",
1001 | "\n",
1002 | "# Print evaluation metrics\n",
1003 | "print(\"Mean Squared Error:\", mse_rfr)\n",
1004 | "print(\"Root Mean Squared Error:\", rmse_rfr)"
1005 | ]
1006 | },
1007 | {
1008 | "cell_type": "markdown",
1009 | "metadata": {
1010 | "id": "JDWmV9ftKLFz"
1011 | },
1012 | "source": [
1013 | "## **iv. Gradient boosting Regressor**"
1014 | ]
1015 | },
1016 | {
1017 | "cell_type": "code",
1018 | "execution_count": null,
1019 | "metadata": {
1020 | "id": "aWvoRsi_KSix"
1021 | },
1022 | "outputs": [],
1023 | "source": [
1024 | "# Import model\n",
1025 | "from sklearn.ensemble import GradientBoostingRegressor\n",
1026 | "\n",
1027 | "# Creating instance of the model\n",
1028 | "Gbr = GradientBoostingRegressor()\n",
1029 | "\n",
1030 | "# Pass training data to model\n",
1031 | "Gbr.fit(x_train, y_train)\n",
1032 | "\n",
1033 | "y_pred_gbr = Gbr.predict(x_test)"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "code",
1038 | "execution_count": null,
1039 | "metadata": {
1040 | "colab": {
1041 | "base_uri": "https://localhost:8080/"
1042 | },
1043 | "id": "4x-dD83UKw1E",
1044 | "outputId": "54958e63-bb83-47da-9684-890110b6f731"
1045 | },
1046 | "outputs": [],
1047 | "source": [
1048 | "print('Gradient Boosting Regressor')\n",
1049 | "# Results\n",
1050 | "print('--'*30)\n",
1051 | "# mean_squared_error(y_test, y_pred_rtr)\n",
1052 | "mse_gbr = mean_squared_error(y_test, y_pred_gbr)\n",
1053 | "rmse_gbr = np.sqrt(mse_gbr)\n",
1054 | "\n",
1055 | "# Print evaluation metrics\n",
1056 | "print(\"Mean Squared Error:\", mse_gbr)\n",
1057 | "print(\"Root Mean Squared Error:\", rmse_gbr)"
1058 | ]
1059 | },
1060 | {
1061 | "cell_type": "markdown",
1062 | "metadata": {
1063 | "id": "xOvUVyBaLw0e"
1064 | },
1065 | "source": [
1066 | "### **b.** You have to report the **MSE** result with the following combinations\n",
1067 | "#### **i.** Without feature scaling\n",
1068 | "#### **ii.** With only feature scaling (without target variable)\n",
1069 | "#### **iii.** With feature and target variable scaling"
1070 | ]
1071 | },
1072 | {
1073 | "cell_type": "markdown",
1074 | "metadata": {
1075 | "id": "zEOS8PSTXWjz"
1076 | },
1077 | "source": [
1078 | "## **i. Without feature scaling**"
1079 | ]
1080 | },
1081 | {
1082 | "cell_type": "code",
1083 | "execution_count": null,
1084 | "metadata": {
1085 | "colab": {
1086 | "base_uri": "https://localhost:8080/"
1087 | },
1088 | "id": "6iHJCPGbYYqb",
1089 | "outputId": "0fd51a95-fba1-47ac-fdb1-a94025e407b5"
1090 | },
1091 | "outputs": [],
1092 | "source": [
1093 | "from sklearn.linear_model import LinearRegression\n",
1094 | "from sklearn.tree import DecisionTreeRegressor\n",
1095 | "from sklearn.ensemble import RandomForestRegressor\n",
1096 | "from sklearn.ensemble import GradientBoostingRegressor\n",
1097 | "from sklearn.metrics import mean_squared_error\n",
1098 | "\n",
1099 | "from sklearn.model_selection import train_test_split\n",
1100 | "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .20, random_state = 101)\n",
1101 | "\n",
1102 | "models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]\n",
1103 | "model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')\n",
1104 | "\n",
1105 | "models_score = []\n",
1106 | "for model, model_name in zip(models, model_names):\n",
1107 | " model.fit(x_train, y_train)\n",
1108 | " y_pred = model.predict(x_test)\n",
1109 | " mse = mean_squared_error(y_test, y_pred)\n",
1110 | " models_score.append([model_name, mse])\n",
1111 | "\n",
1112 | "sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)\n",
1113 | "for model in sorted_models:\n",
1114 | " print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "markdown",
1119 | "metadata": {
1120 | "id": "RbgFsohIcU7x"
1121 | },
1122 | "source": [
1123 | "## **ii. With only feature scaling (without target variable)**"
1124 | ]
1125 | },
1126 | {
1127 | "cell_type": "code",
1128 | "execution_count": null,
1129 | "metadata": {
1130 | "colab": {
1131 | "base_uri": "https://localhost:8080/"
1132 | },
1133 | "id": "wmYyUjZMc3Jp",
1134 | "outputId": "f349a35c-72d6-4ce3-db2a-1bf8f528c0e8"
1135 | },
1136 | "outputs": [],
1137 | "source": [
1138 | "from sklearn.linear_model import LinearRegression\n",
1139 | "from sklearn.tree import DecisionTreeRegressor\n",
1140 | "from sklearn.ensemble import RandomForestRegressor\n",
1141 | "from sklearn.ensemble import GradientBoostingRegressor\n",
1142 | "from sklearn.metrics import mean_squared_error\n",
1143 | "\n",
1144 | "from sklearn.model_selection import train_test_split\n",
1145 | "\n",
1146 | "# Assuming the features have been transformed and are stored in 'x_transform'\n",
1147 | "# If you use different transformation techniques, update the description accordingly.\n",
1148 | "x_train, x_test, y_train, y_test = train_test_split(x_transform, y, test_size=0.20, random_state=101)\n",
1149 | "\n",
1150 | "models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]\n",
1151 | "model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')\n",
1152 | "\n",
1153 | "models_score = []\n",
1154 | "for model, model_name in zip(models, model_names):\n",
1155 | " model.fit(x_train, y_train)\n",
1156 | " y_pred = model.predict(x_test)\n",
1157 | " mse = mean_squared_error(y_test, y_pred)\n",
1158 | " models_score.append([model_name, mse])\n",
1159 | "\n",
1160 | "sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)\n",
1161 | "for model in sorted_models:\n",
1162 | " print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')\n",
1163 | "\n",
1164 | "print(\"Note: Features have been transformed using StandardScaler.\")"
1165 | ]
1166 | },
1167 | {
1168 | "cell_type": "markdown",
1169 | "metadata": {
1170 | "id": "OeGvnBwwegQD"
1171 | },
1172 | "source": [
1173 | "## **iii. With feature and target variable scaling**"
1174 | ]
1175 | },
1176 | {
1177 | "cell_type": "code",
1178 | "execution_count": null,
1179 | "metadata": {
1180 | "colab": {
1181 | "base_uri": "https://localhost:8080/"
1182 | },
1183 | "id": "R0LF7EaZeo_u",
1184 | "outputId": "819a9bd9-b6f7-4b44-f20d-44675c780294"
1185 | },
1186 | "outputs": [],
1187 | "source": [
1188 | "from sklearn.linear_model import LinearRegression\n",
1189 | "from sklearn.tree import DecisionTreeRegressor\n",
1190 | "from sklearn.ensemble import RandomForestRegressor\n",
1191 | "from sklearn.ensemble import GradientBoostingRegressor\n",
1192 | "from sklearn.metrics import mean_squared_error\n",
1193 | "\n",
1194 | "from sklearn.model_selection import train_test_split\n",
1195 | "\n",
1196 | "# Assuming the features have been transformed and are stored in 'x_transform'\n",
1197 | "# If you use different transformation techniques, update the description accordingly.\n",
1198 | "x_train, x_test, y_train, y_test = train_test_split(x_transform, y_transform, test_size=0.20, random_state=101)\n",
1199 | "\n",
1200 | "models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]\n",
1201 | "model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')\n",
1202 | "\n",
1203 | "models_score = []\n",
1204 | "for model, model_name in zip(models, model_names):\n",
1205 | " model.fit(x_train, y_train)\n",
1206 | " y_pred = model.predict(x_test)\n",
1207 | " mse = mean_squared_error(y_test, y_pred)\n",
1208 | " models_score.append([model_name, mse])\n",
1209 | "\n",
1210 | "sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)\n",
1211 | "for model in sorted_models:\n",
1212 | " print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')\n",
1213 | "\n",
1214 | "print(\"Note: Features and target variable both have been transformed using StandardScaler.\")"
1215 | ]
1216 | },
1217 | {
1218 | "cell_type": "markdown",
1219 | "metadata": {
1220 | "id": "pBnjNedcb5R9"
1221 | },
1222 | "source": [
1223 | "## **Comparison of MSE:**\n",
1224 | "### **Without feature scaling:**\n",
1225 | "- Decision Tree Regressor (MSE): 32320110401.78\n",
1226 | "- Random Forest Regressor (MSE): 15118290670.43\n",
1227 | "- Gradient Boosting Regressor (MSE): 12408033260.39\n",
1228 | "- Linear Regression (MSE): 10100187858.86\n",
1229 | "\n",
1230 | "### **With only feature scaling:**\n",
1231 | "- Decision Tree Regressor (MSE): 32523773080.99\n",
1232 | "- Random Forest Regressor (MSE): 15173002261.80\n",
1233 | "- Gradient Boosting Regressor (MSE): 12400765586.94\n",
1234 | "- Linear Regression (MSE): 10100187858.87\n",
1235 | "\n",
1236 | "### **With feature and target variable scaling:**\n",
1237 | "- Decision Tree Regressor (MSE): 0.25\n",
1238 | "- Random Forest Regressor (MSE): 0.12\n",
1239 | "- Gradient Boosting Regressor (MSE): 0.10\n",
1240 | "- Linear Regression (MSE): 0.08"
1241 | ]
1242 | },
1243 | {
1244 | "cell_type": "markdown",
1245 | "metadata": {
1246 | "id": "CjZHuiTJOZBL"
1247 | },
1248 | "source": [
1249 | "### **c.** Display the ranking of different models according to their **MSE** values"
1250 | ]
1251 | },
1252 | {
1253 | "cell_type": "code",
1254 | "execution_count": null,
1255 | "metadata": {
1256 | "colab": {
1257 | "base_uri": "https://localhost:8080/"
1258 | },
1259 | "id": "QVlgj33COqDG",
1260 | "outputId": "90685330-81ab-4c34-970c-4082fd49559a"
1261 | },
1262 | "outputs": [],
1263 | "source": [
1264 | "model_scores = {\n",
1265 | " \"Linear Regression\": 0.08101725519794249,\n",
1266 | " \"Descison Tree Regressor\": 0.25137569765775214,\n",
1267 | " \"Random Forest Regressor\": 0.12042240672361741,\n",
1268 | " \"Gradient Boosting Regressor\": 0.09946292746379987\n",
1269 | "}\n",
1270 | "\n",
1271 | "# Sort the model scores in ascending order based on their values (lower values first)\n",
1272 | "sorted_scores = sorted(model_scores.items(), key=lambda x: x[1])\n",
1273 | "\n",
1274 | "# Display the ranking of the models\n",
1275 | "print(\"Model Rankings according to their MSE values:\")\n",
1276 | "for rank, (model_name, score) in enumerate(sorted_scores, start=1):\n",
1277 | " print(f\"{rank}. {model_name}: {score}\")"
1278 | ]
1279 | }
1280 | ],
1281 | "metadata": {
1282 | "colab": {
1283 | "provenance": []
1284 | },
1285 | "kernelspec": {
1286 | "display_name": "Python 3",
1287 | "name": "python3"
1288 | },
1289 | "language_info": {
1290 | "name": "python",
1291 | "version": "3.11.5"
1292 | }
1293 | },
1294 | "nbformat": 4,
1295 | "nbformat_minor": 0
1296 | }
1297 |
--------------------------------------------------------------------------------
/Class15/google_playstore_apps.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# **EDA Report: Google Play Store Apps Dataset**\n",
8 | "- ## **Author:** Asjad Ali\n",
9 | "- ### **Email:** aliasjid009@gmail.com\n",
10 | "- ### **Date:** 13/08/2023\n",
11 | "\n",
12 | "> In this Exploratory Data Analysis (EDA) report, we will examine and summarize the main characteristics of the Google Play Store Apps dataset. The dataset contains details about various applications available on the Play Store and is sourced from Kaggle.\n",
13 | "\n",
14 | "## **Dataset Overview**\n",
15 | "\n",
16 | "- **Dataset Name:** Google PlayStore Apps\n",
17 | "- **Dataset Size:** 210MB\n",
18 | "- **Number of Apps:** 10,0000+\n",
19 | "- **Data Collection Date** June 2021\n",
20 | "- **Data Collection Method** Python script(Scrapy)\n",
21 | "\n",
22 | "## **Objective**\n",
23 | "\n",
24 | "> The main objective of this project is to gain insights into customer demands and provide valuable information to developers, helping them popularize their products on the Google Play Store.\n",
25 | "\n",
26 | "## **Data Analysis**\n",
27 | "\n",
28 | "1. **Data Understanding:**\n",
29 | "\n",
30 | " > We will start by exploring the structure and contents of the dataset.\n",
31 | " > We will examine the variables, their types, and the overall data distribution.\n",
32 | "\n",
33 | "1. **Data Quality Check:**\n",
34 | "\n",
35 | " > We will identify and handle any missing values, outliers, or inconsistencies in the data.\n",
36 | " > We will assess the quality and reliability of the dataset.\n",
37 | "\n",
38 | "1. **Exploring Patterns:**\n",
39 | "\n",
40 | " > We will analyze the data to uncover patterns, trends, and correlations among variables.\n",
41 | " > We will generate visualizations and summary statistics to identify interesting insights.\n",
42 | "\n",
43 | "1. **Variable Relationships:**\n",
44 | "\n",
45 | " > We will investigate the relationships between variables.\n",
46 | " > We will measure the strength and direction of correlations and assess the impact of one variable on another.\n",
47 | "\n",
48 | "1. **Feature Selection:**\n",
49 | "\n",
50 | " > Based on our analysis, we will determine which features are most informative and relevant for predicting app popularity.\n",
51 | " > We will perform feature selection or dimensionality reduction techniques.\n",
52 | "\n",
53 | "1. **Outlier Detection:**\n",
54 | "\n",
55 | " > We will identify any outliers or anomalies in the dataset.\n",
56 | " > We will examine extreme or unexpected observations that may require further investigation.\n",
57 | "\n",
58 | "1. **Data Visualization:**\n",
59 | "\n",
60 | " > We will create visual representations such as plots, charts, or graphs to communicate our findings effectively.\n",
61 | "\n",
62 | "## **Conclusion**\n",
63 | "\n",
64 | "> Through this EDA report, we aim to gain insights, discover patterns, and uncover relationships within the Google Play Store Apps dataset. The analysis will provide valuable information to developers, enabling them to understand customer demands and popularize their applications on the Play Store.\n",
65 | "\n",
66 | "For detailed access to the dataset, please follow this [link 🔗](https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps).\n",
67 | "\n",
68 | "Note: The analysis and findings presented in this report are based on the available dataset and the EDA techniques applied."
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## **Import the libraries**"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": null,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "import pandas as pd\n",
85 | "import numpy as np\n",
86 | "import matplotlib.pyplot as plt\n",
87 | "%matplotlib inline\n",
88 | "import seaborn as sns\n",
89 | "import warnings\n",
90 | "warnings.filterwarnings('ignore')\n",
91 | "warnings.simplefilter(action = 'ignore', category = FutureWarning)"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "## **Data Preprocessing**\n",
99 | "- Load the csv file with pandas\n",
100 | "- Creating Dataframe of the csv file and understanding the data present in the dataset\n",
101 | "- Dealing with the missing/null values"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "df = pd.read_csv(\"../Datasets/googleplaystore.csv\")"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "- ### **Data Composition**"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "**View the first 5 rows of the data**"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "df.head()"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "metadata": {},
139 | "source": [
140 | "- From here we have come to know that with which type of data we are going to deal with."
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "**Let's see the columns in the dataset**"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "df.columns"
157 | ]
158 | },
159 | {
160 | "cell_type": "markdown",
161 | "metadata": {},
162 | "source": [
163 | "| Column Name | Description |\n",
164 | "|---------------------|-------------------------------------------------------------------------------|\n",
165 | "| **App Name** | The name or title of the application available on the Google Play Store. |\n",
166 | "| **App Id** | The unique identifier assigned to each application. |\n",
167 | "| **Category** | The category or genre to which the application belongs. |\n",
168 | "| **Rating** | The average user rating or feedback score received by the application. |\n",
169 | "| **Rating Count** | The total number of user ratings received by the application. |\n",
170 | "| **Installs** | The estimated number of times the application has been installed. |\n",
171 | "| **Minimum Installs**| The minimum number of installations required for the application to be listed on the Play Store. |\n",
172 | "| **Maximum Installs**| The maximum number of installations recorded for the application. |\n",
173 | "| **Free** | Indicates whether the application is available for free or if it has a price. |\n",
174 | "| **Price** | The price of the application, if it is not available for free. |\n",
175 | "| **Currency** | The currency in which the price is listed. |\n",
176 | "| **Size** | The size of the application in terms of storage space. |\n",
177 | "| **Minimum Android** | The minimum Android version required to run the application. |\n",
178 | "| **Developer Id** | The unique identifier assigned to the application developer. |\n",
179 | "| **Developer Website** | The website associated with the application developer. |\n",
180 | "| **Developer Email** | The email address of the application developer. |\n",
181 | "| **Released** | The date when the application was initially released. |\n",
182 | "| **Last Updated** | The date when the application was last updated. |\n",
183 | "| **Content Rating** | The age-based rating or content suitability of the application. |\n",
184 | "| **Privacy Policy** | The link to the privacy policy associated with the application. |\n",
185 | "| **Ad Supported** | Indicates whether the application contains advertisements. |\n",
186 | "| **In App Purchases** | Indicates whether the application offers in-app purchases. |\n",
187 | "| **Editors Choice** | Indicates whether the application has been selected as an editor's choice on the Play Store. |\n",
188 | "| **Scraped Time** | The date and time when the data was scraped or collected from the Play Store. |\n"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "**Important things to know**"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "pd.set_option('display.max_columns', None)\n",
205 | "pd.set_option('display.max_rows', None)"
206 | ]
207 | },
208 | {
209 | "cell_type": "markdown",
210 | "metadata": {},
211 | "source": [
212 | "**Shape or Dimensions of the data**"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": null,
218 | "metadata": {},
219 | "outputs": [],
220 | "source": [
221 | "df.shape"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "- It shows that there are 2312944 rows and 24 columns in the dataset, which means that Google playstore has more than 2.3 million applications."
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "**Let's get some more information about the data**"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": null,
241 | "metadata": {},
242 | "outputs": [],
243 | "source": [
244 | "df.info()"
245 | ]
246 | },
247 | {
248 | "cell_type": "markdown",
249 | "metadata": {},
250 | "source": [
251 | "- It shows the number of enteries in the data and data types of the columns. Like:\n",
252 | " - 4 bool type columns\n",
253 | " - 4 float type columns\n",
254 | " - 1 integer type column\n",
255 | " - 15 object type columns -> we will see its further details onwards\n",
256 | "- Data set is using 361.8+ MB memory of the system"
257 | ]
258 | },
259 | {
260 | "cell_type": "markdown",
261 | "metadata": {},
262 | "source": [
263 | "## **Descriptive Statistics**"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": null,
269 | "metadata": {},
270 | "outputs": [],
271 | "source": [
272 | "df.describe()"
273 | ]
274 | },
275 | {
276 | "cell_type": "markdown",
277 | "metadata": {},
278 | "source": [
279 | "- Describe function shows the summary of the data and summary is only of numeric variables. It shows, there are only these 5 columns ***Rating, Rating Count, Minumum Installs, Maximum Installs, Price*** in the whole data that are numeric\n",
280 | "- Count of ***Rating*** (2.290016e+06) and ***Rating Count*** (2.290016e+06) is less than other columns, which shows that these 2 column contain missing values\n",
281 | "- Maximum number of ***Minimum installs*** recorded for any application is 10 billion\n",
282 | "- Maximum number of ***Maximum installs*** recorded for any application is 12+ billion\n",
283 | "- Maximum ***Rating*** for any application is 5\n",
284 | "- Maximum ***Rating Count*** for any application is 138.5576 million\n",
285 | "- Maximum ***Price*** for any application is 400. But at this moment, we can say, which currecy it is."
286 | ]
287 | },
288 | {
289 | "cell_type": "markdown",
290 | "metadata": {},
291 | "source": [
292 | "**To see the entire column, we can use pandas' set_option() function**"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "pd.set_option('display.max_columns', None)"
302 | ]
303 | },
304 | {
305 | "cell_type": "markdown",
306 | "metadata": {},
307 | "source": [
308 | "## **Missing Values**"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "missing_values = df.isnull().sum().sort_values(ascending=False)\n",
318 | "print(missing_values)"
319 | ]
320 | },
321 | {
322 | "cell_type": "markdown",
323 | "metadata": {},
324 | "source": [
325 | "- It means that ***Developer Websites*** contain highest number of missing values that is 760835\n",
326 | "- It also shows that most of the App Developers do not provide its ***Website Address***, ***Privacy Policy*** of the app that is very important and also ***Released date*** of the application.\n",
327 | "- Missing values of the ***Minimum Android*** shows that, they don't tell which version of android is suitable for application.\n",
328 | "- Missing vakues of ***Size*** shows that, they don,t provide details about application size."
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "### **Visualizing Missing/Null Values**"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {},
342 | "outputs": [],
343 | "source": [
344 | "import matplotlib\n",
345 | "matplotlib.rcParams['figure.figsize'] = (20,10)\n",
346 | "sns.heatmap(df.isnull(), xticklabels=False, cbar=False, cmap='viridis')\n",
347 | "plt.title('Number of Missing Values')"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "- It is easy to visualize missing values. The yellow lines are indicating the missing vakues in each column."
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "**Percentage of missing values**"
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": null,
367 | "metadata": {},
368 | "outputs": [],
369 | "source": [
370 | "missing_percentage = df.isnull().sum().sort_values(ascending=False)/len(df)*100\n",
371 | "print(missing_percentage)"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "metadata": {},
377 | "source": [
378 | "- Here, we can see the percentage of missing data in the columns."
379 | ]
380 | },
381 | {
382 | "cell_type": "markdown",
383 | "metadata": {},
384 | "source": [
385 | "**Visualizing Percenage of Missing Values**"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": null,
391 | "metadata": {},
392 | "outputs": [],
393 | "source": [
394 | "missing_percentage = missing_percentage[missing_percentage != 0]\n",
395 | "import matplotlib\n",
396 | "matplotlib.rcParams['figure.figsize'] = (20, 8)\n",
397 | "sns.barplot(x=missing_percentage, y=missing_percentage.index)\n",
398 | "plt.xticks(rotation=90)\n",
399 | "plt.title('Percentage of Missing Values')"
400 | ]
401 | },
402 | {
403 | "cell_type": "markdown",
404 | "metadata": {},
405 | "source": [
406 | "- Here, it is more easy to visualize the percentage of missing values in the data. \n",
407 | "- The column haveing highest percentage of missing values are:\n",
408 | " - ***Developer Website***\n",
409 | " - ***Privay Policy***\n",
410 | "- If we want, we can drop these columns from the dataset or we can drop rows that contain missing values because we cannot impute missing values in these columns because they are very specific/unique to applications\n",
411 | "- We can impute or drop small null values for the columns like,\n",
412 | " - Size\n",
413 | " - Installs\n",
414 | " - Currency\n",
415 | " - Minimum Installs\n",
416 | " - App Name\n",
417 | " - Developer Id\n",
418 | " - Developer Email\n",
419 | "- We can impute null values for the following columns, becuase they are important features:\n",
420 | " - Released\n",
421 | " - Rating\n",
422 | " - Minimum Android\n",
423 | " - Rating Count\n",
424 | "- Imputation of missing values depends upon the end purpose of data and we normally oerform imputation when we have small amount of data"
425 | ]
426 | },
427 | {
428 | "cell_type": "markdown",
429 | "metadata": {},
430 | "source": [
431 | "**Droppping Missing Values from the Data**"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": null,
437 | "metadata": {},
438 | "outputs": [],
439 | "source": [
440 | "df.dropna(subset=['Size', 'Installs', 'Currency', 'Developer Id', 'Developer Email', 'App Name', 'Minimum Installs'], inplace=True)"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "- Here, we have dropped rows that contain missing values from the dataset to clean our data."
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": null,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": [
456 | "df.isnull().sum().sort_values(ascending=False)"
457 | ]
458 | },
459 | {
460 | "cell_type": "markdown",
461 | "metadata": {},
462 | "source": [
463 | "- Here, we can see that missing values are removed from most of the columns and only those columns are left behind where we want to impute missing values."
464 | ]
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "## **Data Cleaning**"
471 | ]
472 | },
473 | {
474 | "cell_type": "markdown",
475 | "metadata": {},
476 | "source": [
477 | "**Let's check the duplicated values in the `App Name` column and we see duplicated data in columns that are unique**"
478 | ]
479 | },
480 | {
481 | "cell_type": "code",
482 | "execution_count": null,
483 | "metadata": {},
484 | "outputs": [],
485 | "source": [
486 | "df['App Name'].duplicated().any()"
487 | ]
488 | },
489 | {
490 | "cell_type": "markdown",
491 | "metadata": {},
492 | "source": [
493 | "- Here, `True` shows that there are duplicated values in ***App Name*** column means that there are more than one Apps with same name."
494 | ]
495 | },
496 | {
497 | "cell_type": "markdown",
498 | "metadata": {},
499 | "source": [
500 | "**Now, let's see which rows are duplicated**"
501 | ]
502 | },
503 | {
504 | "cell_type": "code",
505 | "execution_count": null,
506 | "metadata": {},
507 | "outputs": [],
508 | "source": [
509 | "df['App Name'].value_counts()"
510 | ]
511 | },
512 | {
513 | "cell_type": "markdown",
514 | "metadata": {},
515 | "source": [
516 | "- Here we can see that there are many duplicated rows of Apps in this column."
517 | ]
518 | },
519 | {
520 | "cell_type": "markdown",
521 | "metadata": {},
522 | "source": [
523 | "**Before removing the duplicated values, let's check are they actually duplicated or not**"
524 | ]
525 | },
526 | {
527 | "cell_type": "code",
528 | "execution_count": null,
529 | "metadata": {},
530 | "outputs": [],
531 | "source": [
532 | "df[df['App Name'] == 'Tic Tac Toe']"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "- Here we can see that they are not actually duplicated because name is same but other features like App Id, Category, Installs, Rating etc are different, that shows they are not actually duplicated."
540 | ]
541 | },
542 | {
543 | "cell_type": "markdown",
544 | "metadata": {},
545 | "source": [
546 | "**On the base of App Id we can check if there are duplicated rows in data or not**"
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "execution_count": null,
552 | "metadata": {},
553 | "outputs": [],
554 | "source": [
555 | "df['App Id'].value_counts()"
556 | ]
557 | },
558 | {
559 | "cell_type": "markdown",
560 | "metadata": {},
561 | "source": [
562 | "- From the above output, the value counts for each ***App Id*** is 1. So, we have concluded that there are Apps with same name but they are different based on App Id's. So, no duplicated rows in the data."
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {},
568 | "source": [
569 | "### **Explore Different Variables**"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "metadata": {},
575 | "source": [
576 | "1. Install"
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": null,
582 | "metadata": {},
583 | "outputs": [],
584 | "source": [
585 | "df['Installs'].unique()"
586 | ]
587 | },
588 | {
589 | "cell_type": "markdown",
590 | "metadata": {},
591 | "source": [
592 | "- It shows the number of installations of different appps.\n",
593 | "- Highest number of installations recorded for any app is 1 billion+\n",
594 | "- Lowest number of installations are 0+\n",
595 | "- Here `+` means that insallation of any app may be in process during scraping so it will not be counted."
596 | ]
597 | },
598 | {
599 | "cell_type": "markdown",
600 | "metadata": {},
601 | "source": [
602 | "**Convert `Installs` from `object` to `int` datatype**\n",
603 | "> As we discussed before, object dtypes will be dealed later on. So here we are dealing with it and also with commmas(,) and plus(+)"
604 | ]
605 | },
606 | {
607 | "cell_type": "code",
608 | "execution_count": null,
609 | "metadata": {},
610 | "outputs": [],
611 | "source": [
612 | "df['Installs'] = df['Installs'].str.split('+').str[0]\n",
613 | "df['Installs'].replace(',', '', regex=True, inplace=True)\n",
614 | "df['Installs'] = df['Installs'].astype(np.int64)"
615 | ]
616 | },
617 | {
618 | "cell_type": "code",
619 | "execution_count": null,
620 | "metadata": {},
621 | "outputs": [],
622 | "source": [
623 | "df['Installs'].unique()"
624 | ]
625 | },
626 | {
627 | "cell_type": "markdown",
628 | "metadata": {},
629 | "source": [
630 | "- As we can see from the output, comma(,) and plus(+) have removed from the install values and we have converted its type from object to int."
631 | ]
632 | },
633 | {
634 | "cell_type": "markdown",
635 | "metadata": {},
636 | "source": [
637 | "2. Currency"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": null,
643 | "metadata": {},
644 | "outputs": [],
645 | "source": [
646 | "df['Currency'].unique()"
647 | ]
648 | },
649 | {
650 | "cell_type": "markdown",
651 | "metadata": {},
652 | "source": [
653 | "- Here, we can see from the output that which currencies are acceptable in Google playstore.\n",
654 | "- List of currencies acceptable in Google playstore:\n",
655 | " - **USD:** United States Dollar\n",
656 | " - **XXX:** This is often used as a placeholder or code for transactions involving no specific currency.\n",
657 | " - **CAD:** Canadian Dollar\n",
658 | " - **EUR:** Euro (used by many countries in the European Union)\n",
659 | " - **INR:** Indian Rupee\n",
660 | " - **VND:** Vietnamese Dong\n",
661 | " - **GBP:** British Pound Sterling\n",
662 | " - **BRL:** Brazilian Real\n",
663 | " - **KRW:** South Korean Won\n",
664 | " - **TRY:** Turkish Lira\n",
665 | " - **RUB:** Russian Ruble\n",
666 | " - **SGD:** Singapore Dollar\n",
667 | " - **AUD:** Australian Dollar\n",
668 | " - **PKR:** Pakistani Rupee\n",
669 | " - **ZAR:** South African Rand"
670 | ]
671 | },
672 | {
673 | "cell_type": "markdown",
674 | "metadata": {},
675 | "source": [
676 | "3. Size"
677 | ]
678 | },
679 | {
680 | "cell_type": "code",
681 | "execution_count": null,
682 | "metadata": {},
683 | "outputs": [],
684 | "source": [
685 | "df['Size'].unique()"
686 | ]
687 | },
688 | {
689 | "cell_type": "markdown",
690 | "metadata": {},
691 | "source": [
692 | "- From the above output, we can see that the size of the App can be in GB, MB and KB."
693 | ]
694 | },
695 | {
696 | "cell_type": "markdown",
697 | "metadata": {},
698 | "source": [
699 | "**Let's convert App size in MB**"
700 | ]
701 | },
702 | {
703 | "cell_type": "code",
704 | "execution_count": null,
705 | "metadata": {},
706 | "outputs": [],
707 | "source": [
708 | "df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)"
709 | ]
710 | },
711 | {
712 | "cell_type": "markdown",
713 | "metadata": {},
714 | "source": [
715 | "- Here, we have firstly remove `M` from ***Size*** column' each value. For example, if it was firstly 10M, now its only 10."
716 | ]
717 | },
718 | {
719 | "cell_type": "code",
720 | "execution_count": null,
721 | "metadata": {},
722 | "outputs": [],
723 | "source": [
724 | "df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', ''))/1024 if 'k' in str(x) else x)"
725 | ]
726 | },
727 | {
728 | "cell_type": "markdown",
729 | "metadata": {},
730 | "source": [
731 | "- Here we have mismatched value with the data. We have got 1,018 which is causing error. So we have to remove this value from data or convert this comma(,) into dot(.) comsidering it an incorrect value in the **Size** column"
732 | ]
733 | },
734 | {
735 | "cell_type": "code",
736 | "execution_count": null,
737 | "metadata": {},
738 | "outputs": [],
739 | "source": [
740 | "df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '.') if ',' in str(x) else x)"
741 | ]
742 | },
743 | {
744 | "cell_type": "markdown",
745 | "metadata": {},
746 | "source": [
747 | "- After this, we have to run the above cell again. We are basically converting *kbs* into *MB*"
748 | ]
749 | },
750 | {
751 | "cell_type": "markdown",
752 | "metadata": {},
753 | "source": [
754 | "**Again convert `Size` to float**"
755 | ]
756 | },
757 | {
758 | "cell_type": "markdown",
759 | "metadata": {},
760 | "source": [
761 | "- Here we are again encountering a problem that ***App Size*** varies with device. So, you may drop them or replace it with 0. Here, I am assuming it as 0."
762 | ]
763 | },
764 | {
765 | "cell_type": "code",
766 | "execution_count": null,
767 | "metadata": {},
768 | "outputs": [],
769 | "source": [
770 | "df['Size'] = df['Size'].apply(lambda x:str(x).replace('Varies with device', '0') if 'Varies with device' in str(x) else x)"
771 | ]
772 | },
773 | {
774 | "cell_type": "markdown",
775 | "metadata": {},
776 | "source": [
777 | "- Here we have replace `Varies with device` columns with `0`"
778 | ]
779 | },
780 | {
781 | "cell_type": "markdown",
782 | "metadata": {},
783 | "source": [
784 | "**Now let's convert `Size` to float**"
785 | ]
786 | },
787 | {
788 | "cell_type": "code",
789 | "execution_count": null,
790 | "metadata": {},
791 | "outputs": [],
792 | "source": [
793 | "# df['Size'] = df['Size'].apply(lambda x:float(x))"
794 | ]
795 | },
796 | {
797 | "cell_type": "markdown",
798 | "metadata": {},
799 | "source": [
800 | "- Here we are again encountering an issue that there are values in GBs like 1.5G"
801 | ]
802 | },
803 | {
804 | "cell_type": "markdown",
805 | "metadata": {},
806 | "source": [
807 | "**Convert GBs to MBs**"
808 | ]
809 | },
810 | {
811 | "cell_type": "code",
812 | "execution_count": null,
813 | "metadata": {},
814 | "outputs": [],
815 | "source": [
816 | "df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('G', ''))*1024 if 'G' in str(x) else x)"
817 | ]
818 | },
819 | {
820 | "cell_type": "markdown",
821 | "metadata": {},
822 | "source": [
823 | "- Now we have converted the data in GBs to MBs."
824 | ]
825 | },
826 | {
827 | "cell_type": "markdown",
828 | "metadata": {},
829 | "source": [
830 | "**Now convert `Size` to float**"
831 | ]
832 | },
833 | {
834 | "cell_type": "code",
835 | "execution_count": null,
836 | "metadata": {},
837 | "outputs": [],
838 | "source": [
839 | "df['Size'] = df['Size'].apply(lambda x:float(x))"
840 | ]
841 | },
842 | {
843 | "cell_type": "code",
844 | "execution_count": null,
845 | "metadata": {},
846 | "outputs": [],
847 | "source": [
848 | "df.dtypes['Size']"
849 | ]
850 | },
851 | {
852 | "cell_type": "code",
853 | "execution_count": null,
854 | "metadata": {},
855 | "outputs": [],
856 | "source": [
857 | "print(max(df['Size']))"
858 | ]
859 | },
860 | {
861 | "cell_type": "markdown",
862 | "metadata": {},
863 | "source": [
864 | "- Conlusion from ***Size*** column\n",
865 | " - We convert GBs, MBs and KBs to MBs to get a clear overview.\n",
866 | " - The largest size of any app in MBs is 1536, other than that applications that Varies with device."
867 | ]
868 | },
869 | {
870 | "cell_type": "markdown",
871 | "metadata": {},
872 | "source": [
873 | "4. Minumum Android"
874 | ]
875 | },
876 | {
877 | "cell_type": "code",
878 | "execution_count": null,
879 | "metadata": {},
880 | "outputs": [],
881 | "source": [
882 | "df['Minimum Android'].unique()"
883 | ]
884 | },
885 | {
886 | "cell_type": "markdown",
887 | "metadata": {},
888 | "source": [
889 | "- It shows different ***Minimum Android*** version required to run an playstore application. As the minimum version let's say is 4.1 then there is no purpose to write `and up` with, that's why we can remove it."
890 | ]
891 | },
892 | {
893 | "cell_type": "markdown",
894 | "metadata": {},
895 | "source": [
896 | "**Removing ` and up` from Minimum Andriod Values**"
897 | ]
898 | },
899 | {
900 | "cell_type": "code",
901 | "execution_count": null,
902 | "metadata": {},
903 | "outputs": [],
904 | "source": [
905 | "df['Minimum Android'] = df['Minimum Android'].str.replace(' and up', '')"
906 | ]
907 | },
908 | {
909 | "cell_type": "code",
910 | "execution_count": null,
911 | "metadata": {},
912 | "outputs": [],
913 | "source": [
914 | "df['Minimum Android'].unique()"
915 | ]
916 | },
917 | {
918 | "cell_type": "markdown",
919 | "metadata": {},
920 | "source": [
921 | "- Also there is `W` written with some values. It means, Android version that is specifically designed for wearable devices. For example, if there is value `4.4W` is a wearable API level that was released prior to Android 5.0 (Lollipop) as an update of the API to include the Android Wear APIs. This version of Android is exclusive to smartwatches and other wearable devices. It is important to note that \"4.4W\" is not the same as the regular Android 4.4 version, which is designed for smartphones and tablets.\n",
922 | "- It means there are different values in it, not only versions but also wearables"
923 | ]
924 | },
925 | {
926 | "cell_type": "markdown",
927 | "metadata": {},
928 | "source": [
929 | "5. Content Rating"
930 | ]
931 | },
932 | {
933 | "cell_type": "code",
934 | "execution_count": null,
935 | "metadata": {},
936 | "outputs": [],
937 | "source": [
938 | "df['Content Rating'].unique()"
939 | ]
940 | },
941 | {
942 | "cell_type": "markdown",
943 | "metadata": {},
944 | "source": [
945 | "- It shows the that for which age group, specific apps are designed according to its content. \n",
946 | "- Here:\n",
947 | " - Everyone -> Content is generally suitable for all ages. May contain minimal cartoon, fantasy or mild violence and/or infrequent use of mild language.\n",
948 | " - Teen -> Content is generally suitable for ages 13 and up. May contain violence, suggestive themes, crude humor, minimal blood, simulated gambling, and/or infrequent use of strong language.\n",
949 | " - Mature 17+ -> Content is generally suitable for ages 17 and up. May contain intense violence, blood and gore, sexual content, and/or strong language.\n",
950 | " - Everyone 10+ -> Content is generally suitable for ages 10 and up. May contain more cartoon, fantasy or mild violence, mild language and/or minimal suggestive themes.\n",
951 | " - Adults only 18+ -> Content is suitable only for adults. May include graphic depictions of sex and/or violence.\n",
952 | " - Unrated -> Indicates possible exposure to unfiltered/uncensored user-generated content, including user-to-user communications and media sharing via online platforms."
953 | ]
954 | },
955 | {
956 | "cell_type": "markdown",
957 | "metadata": {},
958 | "source": [
959 | "6. Released"
960 | ]
961 | },
962 | {
963 | "cell_type": "code",
964 | "execution_count": null,
965 | "metadata": {},
966 | "outputs": [],
967 | "source": [
968 | "df['Released'].unique()"
969 | ]
970 | },
971 | {
972 | "cell_type": "markdown",
973 | "metadata": {},
974 | "source": [
975 | "- From here, we can see that how long ago an app was released and in which year most number of apps were released."
976 | ]
977 | },
978 | {
979 | "cell_type": "markdown",
980 | "metadata": {},
981 | "source": [
982 | "**Imputing null values in `Released`**"
983 | ]
984 | },
985 | {
986 | "cell_type": "code",
987 | "execution_count": null,
988 | "metadata": {},
989 | "outputs": [],
990 | "source": [
991 | "# imoute null values here and check the year with most number of apps released"
992 | ]
993 | },
994 | {
995 | "cell_type": "markdown",
996 | "metadata": {},
997 | "source": [
998 | "7. Last Updated"
999 | ]
1000 | },
1001 | {
1002 | "cell_type": "code",
1003 | "execution_count": null,
1004 | "metadata": {},
1005 | "outputs": [],
1006 | "source": [
1007 | "df['Last Updated']"
1008 | ]
1009 | },
1010 | {
1011 | "cell_type": "markdown",
1012 | "metadata": {},
1013 | "source": [
1014 | "- It shows the last time an app was updated. Here we can calculate when a specific app was released and afterwards it was updated or not."
1015 | ]
1016 | },
1017 | {
1018 | "cell_type": "markdown",
1019 | "metadata": {},
1020 | "source": [
1021 | "8. Privacy Policy"
1022 | ]
1023 | },
1024 | {
1025 | "cell_type": "code",
1026 | "execution_count": null,
1027 | "metadata": {},
1028 | "outputs": [],
1029 | "source": [
1030 | "df['Privacy Policy']"
1031 | ]
1032 | },
1033 | {
1034 | "cell_type": "markdown",
1035 | "metadata": {},
1036 | "source": [
1037 | "- From here we can read the privacy policy of applications and it is very important to give because it tells us which permissions an application access while installing in our device."
1038 | ]
1039 | },
1040 | {
1041 | "cell_type": "markdown",
1042 | "metadata": {},
1043 | "source": [
1044 | "9. Scraped Time"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "code",
1049 | "execution_count": null,
1050 | "metadata": {},
1051 | "outputs": [],
1052 | "source": [
1053 | "df['Scraped Time']"
1054 | ]
1055 | },
1056 | {
1057 | "cell_type": "markdown",
1058 | "metadata": {},
1059 | "source": [
1060 | "- From here we can see when the data for a particular app was scraped and we can also calculate how much time it take and also when most data was scraped and it also tells us that we do not have any further updates available if made after the data scraped."
1061 | ]
1062 | },
1063 | {
1064 | "cell_type": "markdown",
1065 | "metadata": {},
1066 | "source": [
1067 | "10. Free"
1068 | ]
1069 | },
1070 | {
1071 | "cell_type": "code",
1072 | "execution_count": null,
1073 | "metadata": {},
1074 | "outputs": [],
1075 | "source": [
1076 | "df['Free']"
1077 | ]
1078 | },
1079 | {
1080 | "cell_type": "markdown",
1081 | "metadata": {},
1082 | "source": [
1083 | "- This column shows which apps are free and which apps are paid with True and False respectively. We can convert it to paid and free from True and False for easy understanding."
1084 | ]
1085 | },
1086 | {
1087 | "cell_type": "markdown",
1088 | "metadata": {},
1089 | "source": [
1090 | "**Paid and Free Apps**"
1091 | ]
1092 | },
1093 | {
1094 | "cell_type": "code",
1095 | "execution_count": null,
1096 | "metadata": {},
1097 | "outputs": [],
1098 | "source": [
1099 | "df['Type'] = np.where(df['Free']==True, 'Free', 'Paid')\n",
1100 | "df.drop(['Free'], axis=1, inplace=True)"
1101 | ]
1102 | },
1103 | {
1104 | "cell_type": "code",
1105 | "execution_count": null,
1106 | "metadata": {},
1107 | "outputs": [],
1108 | "source": [
1109 | "df['Type']"
1110 | ]
1111 | },
1112 | {
1113 | "cell_type": "code",
1114 | "execution_count": null,
1115 | "metadata": {},
1116 | "outputs": [],
1117 | "source": [
1118 | "num_free_apps = len(df[df['Type'] == 'Free'])\n",
1119 | "print(num_free_apps)"
1120 | ]
1121 | },
1122 | {
1123 | "cell_type": "markdown",
1124 | "metadata": {},
1125 | "source": [
1126 | "- From here we can easily see the Free/Paid Apps rather than True/False.\n",
1127 | "- Number of free apps on Google playstore greater than number of paid apps that is 2267616\n",
1128 | "- We can also check whether Free apps have more installations or Paid apps."
1129 | ]
1130 | },
1131 | {
1132 | "cell_type": "markdown",
1133 | "metadata": {},
1134 | "source": [
1135 | "**Dealing with Content Rating**"
1136 | ]
1137 | },
1138 | {
1139 | "cell_type": "code",
1140 | "execution_count": null,
1141 | "metadata": {},
1142 | "outputs": [],
1143 | "source": [
1144 | "df['Content Rating'].value_counts()"
1145 | ]
1146 | },
1147 | {
1148 | "cell_type": "markdown",
1149 | "metadata": {},
1150 | "source": [
1151 | "- From here we can see that the apps that the most number of apps on Google playstore are for eveyone and least number of apps are for Adults only 18+\n",
1152 | "- From here, We can calculate which type of apps have most number of installs\n",
1153 | "- We can also use `ANOVA` to see the difference between apps type on playstore but we mostly appy it when there is very little difference between values and we do it according to the requirements of our company or stakeholder.\n",
1154 | "- Here we can make main categories to represent data for better understanding, like\n",
1155 | " - Adults only 18+ -> Adults\n",
1156 | " - Everyone 10+ -> Teen\n",
1157 | " - Unrated -> Everyone\n",
1158 | " - Mature 17+ -> Adults"
1159 | ]
1160 | },
1161 | {
1162 | "cell_type": "code",
1163 | "execution_count": null,
1164 | "metadata": {},
1165 | "outputs": [],
1166 | "source": [
1167 | "df['Content Rating'] = df['Content Rating'].replace('Unrated', 'Everyone')\n",
1168 | "df['Content Rating'] = df['Content Rating'].replace('Adults only 18+', 'Adults')\n",
1169 | "df['Content Rating'] = df['Content Rating'].replace('Mature 17+', 'Adults')\n",
1170 | "df['Content Rating'] = df['Content Rating'].replace('Everyone 10+', 'Teen')"
1171 | ]
1172 | },
1173 | {
1174 | "cell_type": "code",
1175 | "execution_count": null,
1176 | "metadata": {},
1177 | "outputs": [],
1178 | "source": [
1179 | "df['Content Rating'].unique()"
1180 | ]
1181 | },
1182 | {
1183 | "cell_type": "markdown",
1184 | "metadata": {},
1185 | "source": [
1186 | "11. Rating"
1187 | ]
1188 | },
1189 | {
1190 | "cell_type": "code",
1191 | "execution_count": null,
1192 | "metadata": {},
1193 | "outputs": [],
1194 | "source": [
1195 | "df['Rating'].unique()"
1196 | ]
1197 | },
1198 | {
1199 | "cell_type": "markdown",
1200 | "metadata": {},
1201 | "source": [
1202 | "- It shows different ratings to different apps given by users.\n",
1203 | "- Maximum rating is 5\n",
1204 | "- Here we can also calculate the number of apps with maximum rating that is 5."
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "markdown",
1209 | "metadata": {},
1210 | "source": [
1211 | "12. Rating Count"
1212 | ]
1213 | },
1214 | {
1215 | "cell_type": "code",
1216 | "execution_count": null,
1217 | "metadata": {},
1218 | "outputs": [],
1219 | "source": [
1220 | "df['Rating Count']"
1221 | ]
1222 | },
1223 | {
1224 | "cell_type": "code",
1225 | "execution_count": null,
1226 | "metadata": {},
1227 | "outputs": [],
1228 | "source": [
1229 | "max_rating_count = max(df['Rating Count'])\n",
1230 | "max_rating_count"
1231 | ]
1232 | },
1233 | {
1234 | "cell_type": "code",
1235 | "execution_count": null,
1236 | "metadata": {},
1237 | "outputs": [],
1238 | "source": [
1239 | "max_rating_rows = df[df['Rating Count'] == max_rating_count]\n",
1240 | "\n",
1241 | "# Extract App Name and App Id from the filtered rows\n",
1242 | "app_name = max_rating_rows['App Name'].iloc[0]\n",
1243 | "app_id = max_rating_rows['App Id'].iloc[0]\n",
1244 | "print(app_name)\n",
1245 | "print(app_id)"
1246 | ]
1247 | },
1248 | {
1249 | "cell_type": "markdown",
1250 | "metadata": {},
1251 | "source": [
1252 | "- It represents the number of people who give rating to an app\n",
1253 | "- It shows the maximum number of rating count to an app are 138557570.0\n",
1254 | "- It shows that the maximum Rating count App is `Whatsapp` with App Id `com.whatsapp`\n",
1255 | "- We can also divide it into different categories for better understanding"
1256 | ]
1257 | },
1258 | {
1259 | "cell_type": "markdown",
1260 | "metadata": {},
1261 | "source": [
1262 | "**Rating Count Categories**"
1263 | ]
1264 | },
1265 | {
1266 | "cell_type": "code",
1267 | "execution_count": null,
1268 | "metadata": {},
1269 | "outputs": [],
1270 | "source": [
1271 | "df['Rating Type'] = 'NoRatingProvided'\n",
1272 | "df.loc[(df['Rating Count']>0)&(df['Rating Count']<=10000.0), 'Rating Type'] = 'Less than 10k'\n",
1273 | "df.loc[(df['Rating Count']>10000.0)&(df['Rating Count']<=500000.0), 'Rating Type'] = 'Between 10k and 500k'\n",
1274 | "df.loc[(df['Rating Count']>500000.0)&(df['Rating Count']<=138557570.0), 'Rating Type'] = 'More than 500k'\n",
1275 | "df['Rating Type'].value_counts()"
1276 | ]
1277 | },
1278 | {
1279 | "cell_type": "markdown",
1280 | "metadata": {},
1281 | "source": [
1282 | "- Here we are again tide up the data and we have converted ***Rating Count*** into different categories and sorted it according to which apps counts with Rating Count:\n",
1283 | " - Less than 10k -> 1192855\n",
1284 | " - NoRatingProvided -> 1082645\n",
1285 | " - Between 10k and 500k -> 35779\n",
1286 | " - More than 500k -> 1665"
1287 | ]
1288 | },
1289 | {
1290 | "cell_type": "markdown",
1291 | "metadata": {},
1292 | "source": [
1293 | "## **Important Questions related Data**"
1294 | ]
1295 | },
1296 | {
1297 | "cell_type": "markdown",
1298 | "metadata": {},
1299 | "source": [
1300 | "### **1. What are the top 10 categories of Apps on Google Playstore?**"
1301 | ]
1302 | },
1303 | {
1304 | "cell_type": "code",
1305 | "execution_count": null,
1306 | "metadata": {},
1307 | "outputs": [],
1308 | "source": [
1309 | "df['Category'].unique()"
1310 | ]
1311 | },
1312 | {
1313 | "cell_type": "code",
1314 | "execution_count": null,
1315 | "metadata": {},
1316 | "outputs": [],
1317 | "source": [
1318 | "top_category = df.Category.value_counts().reset_index().rename(columns={'Category':'Category', 'index':'Category'})\n",
1319 | "top_category"
1320 | ]
1321 | },
1322 | {
1323 | "cell_type": "code",
1324 | "execution_count": null,
1325 | "metadata": {},
1326 | "outputs": [],
1327 | "source": [
1328 | "top_category_installs = pd.merge(top_category, Category_installs, on='Category')\n",
1329 | "top_10_category_installs = top_category_installs.head(10).sort_values(by=['Installs'], ascending=False)\n",
1330 | "plt.figure(figsize=(16, 8))\n",
1331 | "plt.xticks(rotation=60)\n",
1332 | "plt.title('Top 10 Categories of Apps')\n",
1333 | "sns.barplot(x='Category', y='count', data=top_10_category_installs)\n",
1334 | "\n",
1335 | "plt.show()"
1336 | ]
1337 | },
1338 | {
1339 | "cell_type": "markdown",
1340 | "metadata": {},
1341 | "source": [
1342 | "- From the above dataframe, you can see that the top 10 categories are:\n",
1343 | " 1. Education -> 241090\n",
1344 | " 2. Music & Audio -> 154906\n",
1345 | " 3. Tools -> 143988\n",
1346 | " 4. Business -> 143771\n",
1347 | " 5. Entertainment -> 138276\n",
1348 | " 6. Lifestyle -> 118331\n",
1349 | " 7. Books & Reference -> 116728\n",
1350 | " 8. Personalization -> 89210\n",
1351 | " 9. Health & Fitness -> 83510\n",
1352 | " 10. Productivity -> 79698\n",
1353 | "- It also tells us that there are less number of `tools` apps available on Google playstore and this category has the largest number of innstallations. So there is opportunity for businesses to invest in this category."
1354 | ]
1355 | },
1356 | {
1357 | "cell_type": "markdown",
1358 | "metadata": {},
1359 | "source": [
1360 | "### **2. Which are the categories that are getting installed the most in top 10 categories?**"
1361 | ]
1362 | },
1363 | {
1364 | "cell_type": "code",
1365 | "execution_count": null,
1366 | "metadata": {},
1367 | "outputs": [],
1368 | "source": [
1369 | "Category_installs = df.groupby(['Category'])[['Installs']].sum()\n",
1370 | "print(Category_installs)"
1371 | ]
1372 | },
1373 | {
1374 | "cell_type": "code",
1375 | "execution_count": null,
1376 | "metadata": {},
1377 | "outputs": [],
1378 | "source": [
1379 | "top_category_installs = pd.merge(top_category, Category_installs, on='Category')\n",
1380 | "top_category_installs.head()"
1381 | ]
1382 | },
1383 | {
1384 | "cell_type": "markdown",
1385 | "metadata": {},
1386 | "source": [
1387 | "- Top Category installs:\n",
1388 | " 1. Tools -> 71440271217\n",
1389 | " 2. Entertainment -> 17108396833\n",
1390 | " 3. Music & Audio -> 14239401798\n",
1391 | " 4. Education -> 5983815847\n",
1392 | " 5. Business -> 5236661902"
1393 | ]
1394 | },
1395 | {
1396 | "cell_type": "markdown",
1397 | "metadata": {},
1398 | "source": [
1399 | "#### **Top 10 most Installed Categories**"
1400 | ]
1401 | },
1402 | {
1403 | "cell_type": "code",
1404 | "execution_count": null,
1405 | "metadata": {},
1406 | "outputs": [],
1407 | "source": [
1408 | "top_10_category_installs = top_category_installs.head(10).sort_values(by=['Installs'], ascending=False)\n",
1409 | "import matplotlib\n",
1410 | "matplotlib.rcParams['figure.figsize'] = (20,8)\n",
1411 | "plt.title(\"Top 10 most Installed Categories\")\n",
1412 | "sns.barplot(x=top_10_category_installs.Category, y=top_10_category_installs.Installs)"
1413 | ]
1414 | },
1415 | {
1416 | "cell_type": "markdown",
1417 | "metadata": {},
1418 | "source": [
1419 | "According to our analysis, these are the top 10 most installed categories:\n",
1420 | "1. Tools\n",
1421 | "2. Productivity\n",
1422 | "3. Entertainment\n",
1423 | "4. Music & Audio\n",
1424 | "5. Personalization\n",
1425 | "6. Lifestyle\n",
1426 | "7. Education\n",
1427 | "8. Business\n",
1428 | "9. Books & Reference\n",
1429 | "10. Health & Fitness"
1430 | ]
1431 | },
1432 | {
1433 | "cell_type": "code",
1434 | "execution_count": null,
1435 | "metadata": {},
1436 | "outputs": [],
1437 | "source": [
1438 | "plt.figure(figsize=(8,6))\n",
1439 | "data = df.groupby('Category')['Maximum Installs'].max().sort_values(ascending=True)\n",
1440 | "data = data.head(10)\n",
1441 | "labels = data.keys()\n",
1442 | "plt.pie(data, labels=labels, autopct='%1.1f%%')\n",
1443 | "plt.title('Top 10 Categories with Maximum Installs')\n",
1444 | "plt.show()"
1445 | ]
1446 | },
1447 | {
1448 | "cell_type": "markdown",
1449 | "metadata": {},
1450 | "source": [
1451 | "- From here we can see the maximum installed categories of applications"
1452 | ]
1453 | },
1454 | {
1455 | "cell_type": "markdown",
1456 | "metadata": {},
1457 | "source": [
1458 | "### **3. Which is the highest rated category?**"
1459 | ]
1460 | },
1461 | {
1462 | "cell_type": "code",
1463 | "execution_count": null,
1464 | "metadata": {},
1465 | "outputs": [],
1466 | "source": [
1467 | "# Filter the dataframe\n",
1468 | "filtered_df = df[df['Rating'] == 5.0]\n",
1469 | "\n",
1470 | "# Group the resulting dataframe by the Category column and calculate the mean rating for each category\n",
1471 | "grouped_df = filtered_df.groupby('Category')['Rating'].mean().reset_index()\n",
1472 | "\n",
1473 | "# Sort the resulting dataframe by the mean rating in descending order\n",
1474 | "sorted_df = grouped_df.sort_values('Rating', ascending=False)\n",
1475 | "\n",
1476 | "# Select the first row of the resulting dataframe, which will be the highest rated category\n",
1477 | "highest_rated_category = sorted_df.iloc[0]['Category']\n",
1478 | "\n",
1479 | "# Print the result\n",
1480 | "print(\"The highest rated category is:\", highest_rated_category)"
1481 | ]
1482 | },
1483 | {
1484 | "cell_type": "markdown",
1485 | "metadata": {},
1486 | "source": [
1487 | "- It shows the highest rated category that has a rating of 5.0 is ***Action***"
1488 | ]
1489 | },
1490 | {
1491 | "cell_type": "code",
1492 | "execution_count": null,
1493 | "metadata": {},
1494 | "outputs": [],
1495 | "source": [
1496 | "plt.figure(figsize=(14,7))\n",
1497 | "plt.title(\"HIghest Rated Category\")\n",
1498 | "sns.barplot(x='Category', y='Rating', data=df)\n",
1499 | "plt.xticks(rotation=90)\n",
1500 | "plt.show()"
1501 | ]
1502 | },
1503 | {
1504 | "cell_type": "markdown",
1505 | "metadata": {},
1506 | "source": [
1507 | "- It gives you more clear picture about the Rating of different categories."
1508 | ]
1509 | },
1510 | {
1511 | "cell_type": "markdown",
1512 | "metadata": {},
1513 | "source": [
1514 | "### **4. Which Category has the highest Paid and Free apps?**"
1515 | ]
1516 | },
1517 | {
1518 | "cell_type": "code",
1519 | "execution_count": null,
1520 | "metadata": {},
1521 | "outputs": [],
1522 | "source": [
1523 | "# Filter the dataframe to include only the rows where the Type column is \"Free\" or \"Paid\"\n",
1524 | "filtered_df = df[df['Type'].isin(['Free', 'Paid'])]\n",
1525 | "\n",
1526 | "# Group the resulting dataframe by the Category column and count the number of rows for each category where the Type column is \"Free\" or \"Paid\"\n",
1527 | "grouped_df = filtered_df.groupby(['Category', 'Type']).size().reset_index(name='Count')\n",
1528 | "\n",
1529 | "# Pivot the resulting dataframe to have the categories as rows and the types as columns\n",
1530 | "pivoted_df = grouped_df.pivot(index='Category', columns='Type', values='Count').reset_index()\n",
1531 | "\n",
1532 | "# Sort the resulting dataframe by the count of free apps in descending order\n",
1533 | "sorted_free_df = pivoted_df.sort_values('Free', ascending=False)\n",
1534 | "\n",
1535 | "# Sort the resulting dataframe by the count of paid apps in descending order\n",
1536 | "sorted_paid_df = pivoted_df.sort_values('Paid', ascending=False)\n",
1537 | "\n",
1538 | "# Select the first row of the resulting dataframe for the category with the most free apps\n",
1539 | "most_free_category = sorted_free_df.iloc[0]['Category']\n",
1540 | "\n",
1541 | "# Select the first row of the resulting dataframe for the category with the highest paid apps\n",
1542 | "highest_paid_category = sorted_paid_df.iloc[0]['Category']\n",
1543 | "\n",
1544 | "# Print the results\n",
1545 | "print(\"The category with the most free apps is:\", most_free_category)\n",
1546 | "print(\"The category with the highest paid apps is:\", highest_paid_category)"
1547 | ]
1548 | },
1549 | {
1550 | "cell_type": "markdown",
1551 | "metadata": {},
1552 | "source": [
1553 | "- It shows that ***Education*** is the category with the most free and paid apps."
1554 | ]
1555 | },
1556 | {
1557 | "cell_type": "code",
1558 | "execution_count": null,
1559 | "metadata": {},
1560 | "outputs": [],
1561 | "source": [
1562 | "# Filter the dataframe to include only the rows where the Type column is \"Free\" or \"Paid\"\n",
1563 | "filtered_df = df[df['Type'].isin(['Free', 'Paid'])]\n",
1564 | "\n",
1565 | "# Group the resulting dataframe by the Category column and count the number of rows for each category where the Type column is \"Free\" or \"Paid\"\n",
1566 | "grouped_df = filtered_df.groupby(['Category', 'Type']).size().reset_index(name='Count')\n",
1567 | "\n",
1568 | "# Pivot the resulting dataframe to have the categories as rows and the types as columns\n",
1569 | "pivoted_df = grouped_df.pivot(index='Category', columns='Type', values='Count').reset_index()\n",
1570 | "\n",
1571 | "# Sort the resulting dataframe by the count of free apps in descending order\n",
1572 | "sorted_free_df = pivoted_df.sort_values('Free', ascending=False)\n",
1573 | "\n",
1574 | "# Sort the resulting dataframe by the count of paid apps in descending order\n",
1575 | "sorted_paid_df = pivoted_df.sort_values('Paid', ascending=False)\n",
1576 | "\n",
1577 | "# Print the resulting dataframes\n",
1578 | "print(\"Category-wise count of free apps:\")\n",
1579 | "print(sorted_free_df)\n",
1580 | "\n",
1581 | "print(\"\\nCategory-wise count of paid apps:\")\n",
1582 | "print(sorted_paid_df)"
1583 | ]
1584 | },
1585 | {
1586 | "cell_type": "markdown",
1587 | "metadata": {},
1588 | "source": [
1589 | "- From here we can see the exact count of paid and free apps in each category "
1590 | ]
1591 | },
1592 | {
1593 | "cell_type": "code",
1594 | "execution_count": null,
1595 | "metadata": {},
1596 | "outputs": [],
1597 | "source": [
1598 | "# Create a cross-tabulation of the \"Category\" and \"Type\" columns\n",
1599 | "ct = pd.crosstab(df['Category'], df['Type'])\n",
1600 | "\n",
1601 | "# Stack the resulting dataframe\n",
1602 | "stacked_df = ct.stack().reset_index()\n",
1603 | "\n",
1604 | "# title of the plot\n",
1605 | "plt.title('Free vs Paid Apps in All Categories')\n",
1606 | "\n",
1607 | "plt.xticks(rotation=90)\n",
1608 | "# Create a stacked bar plot of the resulting dataframe with Seaborn\n",
1609 | "sns.set_style(\"whitegrid\")\n",
1610 | "sns.barplot(x=stacked_df['Category'], y=stacked_df[0], hue=stacked_df['Type'], palette=\"rocket\")\n",
1611 | "\n",
1612 | "# Add labels and title\n",
1613 | "plt.xlabel('Category')\n",
1614 | "plt.ylabel('Count')\n",
1615 | "plt.title('Free vs Paid Apps in All Categories')\n",
1616 | "\n",
1617 | "# Show the plot\n",
1618 | "plt.show()"
1619 | ]
1620 | },
1621 | {
1622 | "cell_type": "markdown",
1623 | "metadata": {},
1624 | "source": [
1625 | "### **5. How does the size of the application impacts the installation?**"
1626 | ]
1627 | },
1628 | {
1629 | "cell_type": "code",
1630 | "execution_count": null,
1631 | "metadata": {},
1632 | "outputs": [],
1633 | "source": [
1634 | "plt.figure(figsize=(18,9))\n",
1635 | "plt.xticks(rotation=60, fontsize=9)\n",
1636 | "plt.title(\"Impact of Application size on Installations\")\n",
1637 | "sns.scatterplot(x='Size', y='Installs', hue='Type', data=df)"
1638 | ]
1639 | },
1640 | {
1641 | "cell_type": "markdown",
1642 | "metadata": {},
1643 | "source": [
1644 | "### **6. What is the impact of Content Rating on Maximum Installations?**"
1645 | ]
1646 | },
1647 | {
1648 | "cell_type": "code",
1649 | "execution_count": null,
1650 | "metadata": {},
1651 | "outputs": [],
1652 | "source": [
1653 | "plt.figure(figsize=(12,6))\n",
1654 | "sns.scatterplot(data=df, x='Maximum Installs', y='Rating Count', hue='Content Rating')\n",
1655 | "plt.title('Content Rating and Maximum Installations')"
1656 | ]
1657 | },
1658 | {
1659 | "cell_type": "markdown",
1660 | "metadata": {},
1661 | "source": [
1662 | "### **7. How many apps are available in each category?**"
1663 | ]
1664 | },
1665 | {
1666 | "cell_type": "code",
1667 | "execution_count": null,
1668 | "metadata": {},
1669 | "outputs": [],
1670 | "source": [
1671 | "plt.figure(figsize=(16, 12))\n",
1672 | "plt.xticks(rotation=90)\n",
1673 | "plt.title('Number of Apps in each category')\n",
1674 | "sns.barplot(y='Category', x='count', data=top_category_installs)\n",
1675 | "\n",
1676 | "plt.show()"
1677 | ]
1678 | },
1679 | {
1680 | "cell_type": "markdown",
1681 | "metadata": {},
1682 | "source": [
1683 | "- From here we can easily visualize the number of apps in each category.\n",
1684 | "- Top 5 categories with most number of apps\n",
1685 | " 1. Education\n",
1686 | " 2. Music & Audio\n",
1687 | " 3. Tools\n",
1688 | " 4. Business\n",
1689 | " 5. Entertainment"
1690 | ]
1691 | },
1692 | {
1693 | "cell_type": "markdown",
1694 | "metadata": {},
1695 | "source": [
1696 | "### **8. How many apps have a rating above a certain threshold (e.g., 4.0)?**"
1697 | ]
1698 | },
1699 | {
1700 | "cell_type": "code",
1701 | "execution_count": null,
1702 | "metadata": {},
1703 | "outputs": [],
1704 | "source": [
1705 | "# Filter the dataframe\n",
1706 | "filtered_df = df[df['Rating'] > 4.0]\n",
1707 | "\n",
1708 | "# Count the number of rows\n",
1709 | "total_apps_above_4_rating = len(filtered_df)\n",
1710 | "\n",
1711 | "# Print the result\n",
1712 | "print(\"Total number of apps with a rating above 4.0:\", total_apps_above_4_rating)"
1713 | ]
1714 | },
1715 | {
1716 | "cell_type": "markdown",
1717 | "metadata": {},
1718 | "source": [
1719 | "- It shows that there are 750285 Apps with rating above 4.0"
1720 | ]
1721 | },
1722 | {
1723 | "cell_type": "code",
1724 | "execution_count": null,
1725 | "metadata": {},
1726 | "outputs": [],
1727 | "source": [
1728 | "plt.figure(figsize=(12,6))\n",
1729 | "sns.kdeplot(df.Rating, color='Blue', shade=True)\n",
1730 | "plt.xlabel('Rating')\n",
1731 | "plt.ylabel('Frequency')\n",
1732 | "plt.title('Distribution of Rating')"
1733 | ]
1734 | },
1735 | {
1736 | "cell_type": "markdown",
1737 | "metadata": {},
1738 | "source": [
1739 | "- It shows that most apps have zero rating, means that most of the time app have not been rated and mostly rating is between 4 to 5. We can visualize it more clearly with the help of histplot."
1740 | ]
1741 | },
1742 | {
1743 | "cell_type": "code",
1744 | "execution_count": null,
1745 | "metadata": {},
1746 | "outputs": [],
1747 | "source": [
1748 | "plt.figure(figsize=(12,6))\n",
1749 | "sns.histplot(df.Rating, kde=True ,bins=5)\n",
1750 | "plt.title('Distribution of Rating')"
1751 | ]
1752 | },
1753 | {
1754 | "cell_type": "markdown",
1755 | "metadata": {},
1756 | "source": [
1757 | "- Here we can see more clearly that people mostly don't give ratings but if they do it is most of the time between 4 and 5"
1758 | ]
1759 | },
1760 | {
1761 | "cell_type": "markdown",
1762 | "metadata": {},
1763 | "source": [
1764 | "### **9. What are the top 5 Free Apps based on highest ratings and installs?**"
1765 | ]
1766 | },
1767 | {
1768 | "cell_type": "code",
1769 | "execution_count": null,
1770 | "metadata": {},
1771 | "outputs": [],
1772 | "source": [
1773 | "free_apps = df[(df.Type=='Free')&(df.Installs>=5000000)]\n",
1774 | "free_apps = free_apps.groupby('App Name')['Rating'].max().sort_values(ascending=False)\n",
1775 | "free_apps.head(5)"
1776 | ]
1777 | },
1778 | {
1779 | "cell_type": "code",
1780 | "execution_count": null,
1781 | "metadata": {},
1782 | "outputs": [],
1783 | "source": [
1784 | "# category_type_installs = df.groupby(['Category', 'Type'])[['Installs']].sum().reset_index()\n",
1785 | "# category_type_installs['log_Installs'] = np.log10(category_type_installs['Installs'])"
1786 | ]
1787 | },
1788 | {
1789 | "cell_type": "code",
1790 | "execution_count": null,
1791 | "metadata": {},
1792 | "outputs": [],
1793 | "source": [
1794 | "plt.figure(figsize=(18,7))\n",
1795 | "plt.title(\"Top 5 Free Rated Apps\")\n",
1796 | "sns.lineplot(x=free_apps.values, y=free_apps.index, color = 'Red')"
1797 | ]
1798 | },
1799 | {
1800 | "cell_type": "markdown",
1801 | "metadata": {},
1802 | "source": [
1803 | "### **10. What are the top 5 Paid Apps based on highest ratings and installs?**"
1804 | ]
1805 | },
1806 | {
1807 | "cell_type": "code",
1808 | "execution_count": null,
1809 | "metadata": {},
1810 | "outputs": [],
1811 | "source": [
1812 | "paid_apps = df[(df.Type=='Paid')&(df.Installs>=5000000)]\n",
1813 | "paid_apps = paid_apps.groupby('App Name')['Rating'].max().sort_values(ascending=False)\n",
1814 | "paid_apps.head(5)"
1815 | ]
1816 | },
1817 | {
1818 | "cell_type": "code",
1819 | "execution_count": null,
1820 | "metadata": {},
1821 | "outputs": [],
1822 | "source": [
1823 | "plt.figure(figsize=(18,7))\n",
1824 | "plt.title(\"Top 5 Paid Rated Apps\")\n",
1825 | "sns.lineplot(x=paid_apps.values, y=paid_apps.index, color = 'Blue')"
1826 | ]
1827 | },
1828 | {
1829 | "cell_type": "code",
1830 | "execution_count": null,
1831 | "metadata": {},
1832 | "outputs": [],
1833 | "source": [
1834 | "df.head()"
1835 | ]
1836 | },
1837 | {
1838 | "cell_type": "markdown",
1839 | "metadata": {},
1840 | "source": [
1841 | "### **Heat map to see the correlation between different features**"
1842 | ]
1843 | },
1844 | {
1845 | "cell_type": "markdown",
1846 | "metadata": {},
1847 | "source": [
1848 | "In order to see the correlation, we have to drop categoricla columns from the Dataframe"
1849 | ]
1850 | },
1851 | {
1852 | "cell_type": "code",
1853 | "execution_count": null,
1854 | "metadata": {},
1855 | "outputs": [],
1856 | "source": [
1857 | "df.drop(columns=['Currency', 'Developer Id', 'Developer Email', 'Last Updated', 'Scraped Time', 'Category', 'App Id', 'Content Rating', 'Rating Type', 'App Name', 'Minimum Android', 'Released', 'Privacy Policy', 'Developer Website', 'Type'], inplace=True)"
1858 | ]
1859 | },
1860 | {
1861 | "cell_type": "code",
1862 | "execution_count": null,
1863 | "metadata": {},
1864 | "outputs": [],
1865 | "source": [
1866 | "df.dropna(subset=['Rating', 'Rating Count'], inplace=True)"
1867 | ]
1868 | },
1869 | {
1870 | "cell_type": "code",
1871 | "execution_count": null,
1872 | "metadata": {},
1873 | "outputs": [],
1874 | "source": [
1875 | "df.corr()"
1876 | ]
1877 | },
1878 | {
1879 | "cell_type": "code",
1880 | "execution_count": null,
1881 | "metadata": {},
1882 | "outputs": [],
1883 | "source": [
1884 | "plt.figure(figsize=(20,10))\n",
1885 | "plt.title(\"Heatmap\")\n",
1886 | "sns.heatmap(df.corr(), cbar=True, yticklabels=True, annot=True, cmap='viridis')\n",
1887 | "plt.show()"
1888 | ]
1889 | },
1890 | {
1891 | "cell_type": "markdown",
1892 | "metadata": {},
1893 | "source": [
1894 | "- It gives you the complete overview of all the important features' correlation\n",
1895 | " - There is slightly postive correlation between ***Installs*** and ***Rating Count***, means that if Rating Count increases Install will also increase.\n",
1896 | " - There is negative correlation between between ***Price*** and ***Installs***, means that if price increae installs will decrease.\n",
1897 | " - There is negative correlation between ***Size*** and ***Installs***, means that if size increase installs will decrease.\n",
1898 | " - Factors like ***Ad Support*** and ***In App Purchases*** are correlated to ***Rating***, means that if app provides customer support and subscription plans then we can engage more customers\n",
1899 | " - ***Editors Choice*** is also correlated to ***Rating Count***"
1900 | ]
1901 | },
1902 | {
1903 | "cell_type": "markdown",
1904 | "metadata": {},
1905 | "source": [
1906 | "## **Conclusion:**\n",
1907 | "- Most people do not give rating but the peole who do, tend to give 4+ rating the most.\n",
1908 | "- Most of the installations are are done by the teen and the most of them are Video players and Editors.\n",
1909 | "- Size of the application varies the installations\n",
1910 | "- People have mostly installed the free apps and the availability of free apps is also high\n",
1911 | "- In App purchases are correlated to Rating count means that if apps will have subscription plans it will help to engage customers.\n",
1912 | "- Most apps available on Google playstore are of education category but most number of installations are of tools category. So there is opportunity for businesses to invest in this category. "
1913 | ]
1914 | }
1915 | ],
1916 | "metadata": {
1917 | "kernelspec": {
1918 | "display_name": "Python 3",
1919 | "language": "python",
1920 | "name": "python3"
1921 | },
1922 | "language_info": {
1923 | "codemirror_mode": {
1924 | "name": "ipython",
1925 | "version": 3
1926 | },
1927 | "file_extension": ".py",
1928 | "mimetype": "text/x-python",
1929 | "name": "python",
1930 | "nbconvert_exporter": "python",
1931 | "pygments_lexer": "ipython3",
1932 | "version": "3.11.5"
1933 | },
1934 | "orig_nbformat": 4
1935 | },
1936 | "nbformat": 4,
1937 | "nbformat_minor": 2
1938 | }
1939 |
--------------------------------------------------------------------------------