├── README.md ├── fsdr-1122.ipynb ├── heml0922-spartificial.ipynb ├── psgq-0922.ipynb ├── pvpda1122.ipynb ├── rspd-1222.ipynb ├── sdsi0922.ipynb └── spml0922.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # YouTube Academic Projects 2 | 3 | 4 | YouTube Academic Projects Series is a youtube video series by [Spartificial](https://spartificial.com/) that provides free certified Machine Learning related student projects focusing on Space technology and Sustainable Development. 5 | 6 | Here is the [YouTube Playlist](https://youtube.com/playlist?list=PL7HQvd_RTCc3Vope7dkx4pggrH5f-uvZe) of all the project videos. Watch them for thorough explanation. 7 | 8 | 9 | ![YouTube Academic Projects - Spartificial (3000 × 800 px)](https://user-images.githubusercontent.com/50978045/188312778-3a699a9c-b4d1-4c4c-ab28-1305bcbf18c2.png) 10 | 11 | If you have completed the tasks for any of the project, you can submit your solution notebook through this [form](https://docs.google.com/forms/d/e/1FAIpQLSd0TiEf7SsHMS7dvnkUzUZBiXKq-0Ctv8ejjNjbubR4LHfGtg/viewform) 12 | 13 | 14 | 1. [Hunting for Exoplanet with Machine Learning](https://github.com/Spartificial/yt-acad-projs/blob/main/heml0922-spartificial.ipynb) 15 | 2. [Ship Detection from Satellite Imagery](https://github.com/Spartificial/yt-acad-projs/blob/main/sdsi0922.ipynb) 16 | 3. [Sunspots Prediction using Machine Learning](https://github.com/Spartificial/yt-acad-projs/blob/main/spml0922.ipynb) 17 | 4. [India's PV Power Potential Data Analysis](https://github.com/Spartificial/yt-acad-projs/blob/main/pvpda1122.ipynb) 18 | 5. [Predicting Stars, Galaxies & Quasars with ML Model](https://github.com/Spartificial/yt-acad-projs/blob/main/psgq-0922.ipynb) 19 | 6. [Rooftop Solar Panel Detection using Deep Learning](https://github.com/Spartificial/yt-acad-projs/blob/main/rspd-1222.ipynb) 20 | 7. [Fish Detection using Deep Learning](https://github.com/Spartificial/yt-acad-projs/blob/main/fsdr-1122.ipynb) 21 | -------------------------------------------------------------------------------- /heml0922-spartificial.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"#
Hunt for Exoplanets using Machine Learning\n\n--- \n \n**Project ID: HEML0922**\n\n**Project Name: Hunting for Exoplanet with Machine Learning**\n \n---","metadata":{"id":"-gwW9Ys1QG_o"}},{"cell_type":"markdown","source":"
","metadata":{"id":"c5BkOZJXg5tU"}},{"cell_type":"markdown","source":"#### **Let us start by [understanding exoplanets](https://docs.google.com/presentation/d/1-zjKAaiDt3gs4NkHbWsT4AzgZ-jqL6Oz/edit?usp=sharing&ouid=105423306669124151882&rtpof=true&sd=true)!**\n","metadata":{"id":"2o_ijuDVdvXH"}},{"cell_type":"markdown","source":"### In this notebook we will look into the following:- \n**1)** Explore the Exoplanet Dataset \n**2)** Handling outliers \n**3)** Understand KNN model for classification \n**4)** Implementing KNN without Balancing the data \n**5)** Handling imbalance data and then implementing KNN \n**6)** Task for you ","metadata":{"id":"18PA0Q4ZbVhJ"}},{"cell_type":"markdown","source":"","metadata":{"id":"TMTcDASYbVhL"}},{"cell_type":"markdown","source":"### Explore the Exoplanet Dataset \n","metadata":{"id":"yju-fMzzbVhM"}},{"cell_type":"markdown","source":"We will be dealing with the **[Kepler Space Telescope data](https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data)** \n\n \n\n**Steps to add the data in Kaggle notebook:-** \n**1)** Click on top right arrow of the notebook \n**2)** Click on \"Add Data\" \n**3)** Copy the link of \"dataset\" and paste it in search \n**4)** Click on \"+\" sign to add this dataset to the notebook","metadata":{"id":"587PWX5obVhP"}},{"cell_type":"markdown","source":"#### Importing the most needed libraries","metadata":{"id":"TCJOqNksfAIO"}},{"cell_type":"code","source":"#********************************************\nimport pandas as pd\nimport seaborn as sns\n#********************************************\nimport numpy as np \nimport matplotlib.pyplot as plt\n#********************************************\nimport warnings\nwarnings.filterwarnings('ignore')\nplt.style.use('fivethirtyeight')\n#********************************************","metadata":{"id":"kPCSBtFXbVhN","execution":{"iopub.status.busy":"2022-09-03T11:20:33.637772Z","iopub.execute_input":"2022-09-03T11:20:33.638418Z","iopub.status.idle":"2022-09-03T11:20:33.644673Z","shell.execute_reply.started":"2022-09-03T11:20:33.638391Z","shell.execute_reply":"2022-09-03T11:20:33.643445Z"},"trusted":true},"execution_count":98,"outputs":[]},{"cell_type":"markdown","source":"#### Getting into the data","metadata":{"id":"s_xhUBPbbVhQ"}},{"cell_type":"code","source":"# Let us begin with Train data\ntrain_df = pd.read_csv('../input/kepler-labelled-time-series-data/exoTrain.csv')\ntrain_df.head(10)","metadata":{"id":"8sCslEySbVhS","execution":{"iopub.status.busy":"2022-09-03T11:39:13.129230Z","iopub.execute_input":"2022-09-03T11:39:13.129617Z","iopub.status.idle":"2022-09-03T11:39:16.592203Z","shell.execute_reply.started":"2022-09-03T11:39:13.129591Z","shell.execute_reply":"2022-09-03T11:39:16.591047Z"},"trusted":true},"execution_count":163,"outputs":[]},{"cell_type":"code","source":"# Check the shape of train data\ntrain_df.shape","metadata":{"id":"FZ64b6UKbVhT","execution":{"iopub.status.busy":"2022-09-03T11:30:43.030537Z","iopub.execute_input":"2022-09-03T11:30:43.031171Z","iopub.status.idle":"2022-09-03T11:30:43.038014Z","shell.execute_reply.started":"2022-09-03T11:30:43.031135Z","shell.execute_reply":"2022-09-03T11:30:43.037011Z"},"trusted":true},"execution_count":131,"outputs":[]},{"cell_type":"markdown","source":"> *We can understand this data based on the transit method for detecting exoplanets.* \n\n> *There are total of 5087 stars in this data.*\n\n> *For each star, we have 3197 flux values at different time intervals.*\n\n> *These flux values are used to plot the light curves we saw earlier to detect if a star has exoplanet(s) orbiting it.*","metadata":{"id":"svIV1JYibVhU"}},{"cell_type":"markdown","source":"#### Check for Missing Values","metadata":{"id":"u0bA_C5QbVhV"}},{"cell_type":"code","source":"# Display the rows with null values\ntrain_df[train_df.isnull().any(axis = 1)] # axis = 1 ---> column","metadata":{"id":"oF7ID3AzbVhW","execution":{"iopub.status.busy":"2022-09-03T11:43:09.469214Z","iopub.execute_input":"2022-09-03T11:43:09.469565Z","iopub.status.idle":"2022-09-03T11:43:09.504882Z","shell.execute_reply.started":"2022-09-03T11:43:09.469540Z","shell.execute_reply":"2022-09-03T11:43:09.503581Z"},"trusted":true},"execution_count":172,"outputs":[]},{"cell_type":"markdown","source":"> *There are **no missing values**! We can also visualise it through heatmap.*","metadata":{"id":"8s9tsNwEbVhW"}},{"cell_type":"code","source":"sns.heatmap(train_df.isnull(), cmap = 'Set2', cbar = False)","metadata":{"id":"yk-bqlxqbVhX","execution":{"iopub.status.busy":"2022-09-03T11:20:36.880845Z","iopub.execute_input":"2022-09-03T11:20:36.881519Z","iopub.status.idle":"2022-09-03T11:20:54.683607Z","shell.execute_reply.started":"2022-09-03T11:20:36.881455Z","shell.execute_reply":"2022-09-03T11:20:54.682386Z"},"trusted":true},"execution_count":102,"outputs":[]},{"cell_type":"markdown","source":"> *The horizontal dashes in this plot would indicate the presence of missing values in respective column.* \n\n> *As there aren't any of such dashes seen we can conclude that there are no missing values in this data.*","metadata":{"id":"GaDdz4jJbVhY"}},{"cell_type":"markdown","source":"#### Decoding labels in the data","metadata":{"id":"RoYRiT8PbVhZ"}},{"cell_type":"code","source":"# Check how many labels are there\ntrain_df['LABEL'].unique()","metadata":{"id":"Sido6EoHbVhZ","execution":{"iopub.status.busy":"2022-09-03T11:20:54.685026Z","iopub.execute_input":"2022-09-03T11:20:54.685359Z","iopub.status.idle":"2022-09-03T11:20:54.693111Z","shell.execute_reply.started":"2022-09-03T11:20:54.685331Z","shell.execute_reply":"2022-09-03T11:20:54.692039Z"},"trusted":true},"execution_count":103,"outputs":[]},{"cell_type":"code","source":"# Extract the index for the stars labelled as 2\nidx_lab2 = list(train_df[train_df['LABEL'] == 2].index)\nprint(f\"Index list for label 2 star in the data:-\\n{idx_lab2}\\n\")","metadata":{"id":"OKZIPZIlbVha","execution":{"iopub.status.busy":"2022-09-03T11:20:54.694174Z","iopub.execute_input":"2022-09-03T11:20:54.695141Z","iopub.status.idle":"2022-09-03T11:20:54.714045Z","shell.execute_reply.started":"2022-09-03T11:20:54.695114Z","shell.execute_reply":"2022-09-03T11:20:54.712466Z"},"trusted":true},"execution_count":104,"outputs":[]},{"cell_type":"markdown","source":"> *There are total of **two classes**; one is for stars with exoplanets and the other for stars without exoplanets*\n\n> *Very few index for label 2 indicates that this class must belong to stars with exoplanets*\n\n> *We can also visualise this using countplot*","metadata":{"id":"uYP2HfT8bVhb"}},{"cell_type":"code","source":"# Visualise these values using countplot\nplt.figure(figsize = (3, 5)) \nax = sns.countplot('LABEL', data = train_df, palette = 'Set2') \nax.bar_label(ax.containers[0])\nplt.title(\"Visualising count of classes\\n1 ~ Non Exoplanets | 2 ~ Exoplanets\\n\", \n fontsize = 15, color = 'red', weight = 'bold')\nplt.show()","metadata":{"id":"PbQ-5DP1bVhb","execution":{"iopub.status.busy":"2022-09-03T11:20:54.715312Z","iopub.execute_input":"2022-09-03T11:20:54.715583Z","iopub.status.idle":"2022-09-03T11:20:54.845238Z","shell.execute_reply.started":"2022-09-03T11:20:54.715556Z","shell.execute_reply":"2022-09-03T11:20:54.844297Z"},"trusted":true},"execution_count":105,"outputs":[]},{"cell_type":"markdown","source":"> *There is a **huge imbalance** in the data which isn't good for KNN (explained later in this notebook).*\n\n> *We will need to balance it using some resampling technique and we will use RandomOverSampler for this data.*\n\n> *We'll do that after building the model with imbalanced dataset to compare the results!*","metadata":{"id":"Y7W0mLcSbVhc"}},{"cell_type":"markdown","source":"#### Replacing the labels\nFor ease of our model its always better to feed in the data in terms of 0 and 1 \n- Stars with Exoplanets: 2 $\\rightarrow$ 1 \n- Stars without Exoplanets: 1 $\\rightarrow$ 0","metadata":{"id":"ppqriOT7bVhc"}},{"cell_type":"code","source":"# Replacing labels \ntrain_df = train_df.replace({'LABEL' : {1:0, 2:1}})\nprint(\"Replacing labels...\")\n\n# Check the labels now\nprint(\"Done!\\n\")\nuniq_val = train_df.LABEL.unique()\nprint(f\"There are {len(uniq_val)} classes in the data:-\")\nprint(f\"{uniq_val[0]} - Stars with Exoplanets\\n{uniq_val[1]} - Stars without Exoplantes\")","metadata":{"id":"XtqiO4AObVhd","execution":{"iopub.status.busy":"2022-09-03T11:30:55.803531Z","iopub.execute_input":"2022-09-03T11:30:55.803933Z","iopub.status.idle":"2022-09-03T11:30:55.835231Z","shell.execute_reply.started":"2022-09-03T11:30:55.803907Z","shell.execute_reply":"2022-09-03T11:30:55.834160Z"},"trusted":true},"execution_count":132,"outputs":[]},{"cell_type":"markdown","source":"#### Visualising the light curves in this data\nWhen a planet passes between an observer and the star, the flux value decreases and hence we see a dip in light curves with exoplanets\n","metadata":{"id":"JsYssnqibVhd"}},{"cell_type":"markdown","source":"","metadata":{"id":"sKRogUpAbVhe"}},{"cell_type":"code","source":"# Drop label column to plot only the flux values\nplot_df = train_df.drop(['LABEL'], axis = 1)\n\n# X - axis data: Replace FLUX. from each column names\ncol_names = list(plot_df.columns)\ntime = [int(flux_prefix.replace(\"FLUX.\", \"\")) for flux_prefix in col_names]\n\n# Function to plot flux variation of star\ndef flux_plot(df, candidate, exo = True):\n color = 'b' if exo == True else 'm'\n plt.figure(figsize=(15, 5))\n plt.plot(time, df.iloc[candidate-1], linewidth = .5, color = color)\n title1, clr1 = f\"Flux Variation of star {candidate} with Exoplanents\", 'olive'\n title2, clr2 = f\"Flux Variation of star {candidate} without Exoplanets\", 'tab:red'\n plt.title(title1, color = clr1) if exo == True else plt.title(title2, color = clr2)\n plt.xlabel(\"Time\")\n plt.ylabel(\"Flux Variation\")","metadata":{"id":"FAP0AP23bVhe","execution":{"iopub.status.busy":"2022-09-03T11:31:10.514436Z","iopub.execute_input":"2022-09-03T11:31:10.514774Z","iopub.status.idle":"2022-09-03T11:31:10.548619Z","shell.execute_reply.started":"2022-09-03T11:31:10.514749Z","shell.execute_reply":"2022-09-03T11:31:10.546748Z"},"trusted":true},"execution_count":133,"outputs":[]},{"cell_type":"code","source":"# Example of light curves\nexo, n_exo = [4, 14, 34], [99, 199, 2999]\n\nfor candidate in range(len(exo)):\n flux_plot(plot_df, exo[candidate], exo = True)\n flux_plot(plot_df, n_exo[candidate], exo = False)","metadata":{"id":"1cnuHg8ebVhf","execution":{"iopub.status.busy":"2022-09-03T11:31:15.998729Z","iopub.execute_input":"2022-09-03T11:31:15.999664Z","iopub.status.idle":"2022-09-03T11:31:17.051226Z","shell.execute_reply.started":"2022-09-03T11:31:15.999636Z","shell.execute_reply":"2022-09-03T11:31:17.050051Z"},"trusted":true},"execution_count":134,"outputs":[]},{"cell_type":"markdown","source":"","metadata":{"id":"yB_vuGpKbVhf"}},{"cell_type":"markdown","source":"### Extreme outliers \n- We can see random **huge spikes** especially in stars without exoplanets which can be considered as extreme outliers\n\n- KNN can be sensitive to outliers (explained later in this notebook) so we will need to handle it \n\n- We can also visualise these extreme outliers through boxplot","metadata":{"id":"4KL3kpy3bVhf"}},{"cell_type":"code","source":"# Boxplot to visualise outliers\nplt.figure(figsize = (20, 9))\nplt.suptitle(\"Box Plot to visualise outliers\", ha = 'right', color = 'red', weight = 'bold')\nfor i in range(1, 4):\n plt.subplot(1, 4, i)\n sns.boxplot(data=train_df, x='LABEL', y = 'FLUX.' + str(i))\n plt.xlabel(\"\")\n plt.ylabel(\"\")\n plt.title(\"FLUX \" + str(i) + \"\\n\", color = 'b', fontsize = 13)\n","metadata":{"id":"xVTFFfj4bVhg","execution":{"iopub.status.busy":"2022-09-03T11:33:49.038561Z","iopub.execute_input":"2022-09-03T11:33:49.038902Z","iopub.status.idle":"2022-09-03T11:33:49.419792Z","shell.execute_reply.started":"2022-09-03T11:33:49.038877Z","shell.execute_reply":"2022-09-03T11:33:49.418647Z"},"trusted":true},"execution_count":142,"outputs":[]},{"cell_type":"markdown","source":"> *We can see that the flux values more than $0.25 x 10^6$ are extreme outliers.* \n\n> *We can either drop it or replace its value with upper bridge value. For this usecase, we will simply drop it.* \n\n> *However you can try to compute on your own the upper bridge value using the formula given below:-* \n\n> $UB = Q3 + 3 \\times IQR$; **UB** - upper bridge, **Q3** - 75th percentile, **IQR** - Interquartile range","metadata":{"id":"55GjzgeWbVhh"}},{"cell_type":"code","source":"# Get the extreme outliers\nextreme_outliers = train_df[train_df['FLUX.2'] > 0.25e6]\nextreme_outliers","metadata":{"execution":{"iopub.status.busy":"2022-09-03T11:39:46.635387Z","iopub.execute_input":"2022-09-03T11:39:46.636162Z","iopub.status.idle":"2022-09-03T11:39:46.664437Z","shell.execute_reply.started":"2022-09-03T11:39:46.636088Z","shell.execute_reply":"2022-09-03T11:39:46.663628Z"},"trusted":true},"execution_count":167,"outputs":[]},{"cell_type":"code","source":"# Drop the extreme outlier\nprint(\"Droping Extreme Outliers...\")\ntrain_df.drop(extreme_outliers.index, axis = 0, inplace = True) # axis = 0 ----> row\nprint(\"Done!\")","metadata":{"id":"ObjdQ4wsbVhh","execution":{"iopub.status.busy":"2022-09-03T11:40:01.711204Z","iopub.execute_input":"2022-09-03T11:40:01.711589Z","iopub.status.idle":"2022-09-03T11:40:01.782348Z","shell.execute_reply.started":"2022-09-03T11:40:01.711563Z","shell.execute_reply":"2022-09-03T11:40:01.780888Z"},"trusted":true},"execution_count":168,"outputs":[]},{"cell_type":"markdown","source":"> *We have dropped the star at location 3340 which contributed to extreme outliers, let us visualise the box plots again*","metadata":{}},{"cell_type":"code","source":"# Cross check via any random box plot\nsns.boxplot(data=train_df, x='LABEL', y = 'FLUX.' + str(np.random.randint(1000)))","metadata":{"id":"hkWLtVEdbVhh","execution":{"iopub.status.busy":"2022-09-03T11:44:14.069224Z","iopub.execute_input":"2022-09-03T11:44:14.069561Z","iopub.status.idle":"2022-09-03T11:44:14.228490Z","shell.execute_reply.started":"2022-09-03T11:44:14.069538Z","shell.execute_reply":"2022-09-03T11:44:14.227697Z"},"trusted":true},"execution_count":173,"outputs":[]},{"cell_type":"markdown","source":"","metadata":{"id":"cWWnnVFVbVhi"}},{"cell_type":"markdown","source":"### Understanding K - Nearest Neighbors (KNN) Algorithm for Classification Tasks\n\n#### Why choose KNN for this task?\n- In this dataset we saw that there were outliers + it was imbalanced \n- KNN is **sensitive** to both **outliers and imbalnced data** which you will understand in a while\n- Hence, we choose this model to demonstrate how to handle outliers and imbalanced dataset\n- Moreover KNN is one of the simplest ML algorithms based on Supervised Learning technique\n- KNN are widely used for classification (binary and multiclass classification) as compared to regression tasks\n\n#### How does KNN work? \n\n**Steps:-** \n**1)** Select the **number of K neighbors** for the new data point \n**2)** Calculate the **Euclidean distance** from this point to the other points in the data \n**3)** Take the **K nearest neighbors** as per the calculated Euclidean distance \n**4)** Among these neighbors, **count** the number of data points of each category \n**5)** The new data point belongs to the **cateogory with maximum data points available** \n\n*Here is one demonstration to understand how this algorith works:-*\n\n\n\n\n\n\n**Euclidean Distance** *between two points is simply calculated using the distance between two points formula in a cartesian coordinate system:-* \n\n\n\n#### How to select value for K in this algorithm \n- We need to try out some values of K and figure out which one is working best out of them \n- Usually the preferred value of K is 5 \n- A very low value of K (K = 1, K = 2) can lead to effects of outliers in the model\n- A very high value of K can lead to the biasness towards imbalanced dataset\n\n \n\n- If the value of K isn't very low and dataset is balanced than the outliers would not really effect our model\n- Lets take K = 150 in that case you can see the imbalance in the data can lead to a very poor result! \n\n\n#### Advantages of KNN\n**1)** One of the simplest to understand and implement \n**2)** Depending on value of K it can be robust to the noisy training data \n**3)** It can be more effective if training data is large \n\n#### Disadvantages of KNN \n**1)** Sometimes detrming the value of K can become a complex task \n**2)** High computation cost as we are calculating the distances between the data points for all the training data","metadata":{"id":"EVgrEYyvbVhi"}},{"cell_type":"markdown","source":"","metadata":{"id":"zxCHlY6WbVhi"}},{"cell_type":"markdown","source":"### Implementing KNN after handling the extreme outliers but have yet not balanced the data\n*It would be interesting to compare the results with and without imbalance in our data. Let us first start with imbalanced data:-*","metadata":{"id":"ewVyNHhebVhj"}},{"cell_type":"code","source":"# Extract dependent and independent features\nx = train_df.drop(['LABEL'], axis = 1)\ny = train_df.LABEL\n\nprint(f\"Take a look over ~\\n\\nX train array:-\\n{x.values}\\n\\nY train array:-\\n{y.values}\")","metadata":{"id":"cynthuX8bVhj","execution":{"iopub.status.busy":"2022-09-03T11:31:53.032461Z","iopub.execute_input":"2022-09-03T11:31:53.032821Z","iopub.status.idle":"2022-09-03T11:31:53.064829Z","shell.execute_reply.started":"2022-09-03T11:31:53.032796Z","shell.execute_reply":"2022-09-03T11:31:53.063371Z"},"trusted":true},"execution_count":138,"outputs":[]},{"cell_type":"code","source":"# Splitting this dataset into training and testing set\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)","metadata":{"id":"h9xwQvdsbVhk","execution":{"iopub.status.busy":"2022-09-03T11:20:56.695330Z","iopub.execute_input":"2022-09-03T11:20:56.696390Z","iopub.status.idle":"2022-09-03T11:20:56.792324Z","shell.execute_reply.started":"2022-09-03T11:20:56.696363Z","shell.execute_reply":"2022-09-03T11:20:56.791523Z"},"trusted":true},"execution_count":113,"outputs":[]},{"cell_type":"code","source":"# Feature scaling\nfrom sklearn.preprocessing import StandardScaler \n\nsc = StandardScaler()\nX_train_sc = sc.fit_transform(X_train)\nX_test_sc = sc.transform(X_test)\n\n# Checking the minimum, mean and maxmum value after scaling\nprint(\"X_train after scaling ~\\n\")\nprint(f\"Minimum:- {round(np.min(X_train_sc),2)}\\nMean:- {round(np.mean(X_train_sc),2)}\\nMax:- {round(np.max(X_train_sc), 2)}\\n\")\nprint(\"--------------------------------\\n\")\nprint(\"X_test after scaling ~\\n\")\nprint(f\"Minimum:- {round(np.min(X_test_sc),2)}\\nMean:- {round(np.mean(X_test_sc),2)}\\nMax:- {round(np.max(X_test_sc), 2)}\\n\")","metadata":{"id":"VagRDCuebVhk","execution":{"iopub.status.busy":"2022-09-03T11:20:56.793203Z","iopub.execute_input":"2022-09-03T11:20:56.793411Z","iopub.status.idle":"2022-09-03T11:20:57.012448Z","shell.execute_reply.started":"2022-09-03T11:20:56.793390Z","shell.execute_reply":"2022-09-03T11:20:57.011266Z"},"trusted":true},"execution_count":114,"outputs":[]},{"cell_type":"code","source":"# Fiting the KNN Classifier Model on to the training data\nfrom sklearn.neighbors import KNeighborsClassifier as KNC\n\n# Choosing K = 10\nknn_classifier = KNC(n_neighbors=5,metric='minkowski',p=2) \n'''metric is to be by default minkowski for p = 2 to calculate the Eucledian distances'''\n\n# Fit the model\nknn_classifier.fit(X_train_sc, y_train)\n\n# Predict\ny_pred = knn_classifier.predict(X_test_sc)\n\n# Results\nfrom sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, auc\n\nprint('\\nValidation accuracy of KNN is', accuracy_score(y_test,y_pred))\nprint(\"\\n-------------------------------------------------------\")\nprint (\"\\nClassification report :\\n\",(classification_report(y_test,y_pred)))\n\n#Confusion matrix\nplt.figure(figsize=(15,11))\nplt.subplots_adjust(wspace = 0.3)\nplt.suptitle(\"KNN Performance before handling the imbalance in the data\", color = 'r', weight = 'bold')\nplt.subplot(221)\nsns.heatmap(confusion_matrix(y_test,y_pred),annot=True,cmap=\"Set2\",fmt = \"d\",linewidths=3, cbar = False,\n xticklabels=['nexo', 'exo'], yticklabels=['nexo','exo'], square = True)\nplt.xlabel(\"True Labels\", fontsize = 15, weight = 'bold', color = 'tab:pink')\nplt.ylabel(\"Predicited Labels\", fontsize = 15, weight = 'bold', color = 'tab:pink')\nplt.title(\"CONFUSION MATRIX\",fontsize=20, color = 'm')\n\n#ROC curve and Area under the curve plotting\npredicting_probabilites = knn_classifier.predict_proba(X_test_sc)[:,1]\nfpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)\nplt.subplot(222)\nplt.plot(fpr,tpr,label = (\"AUC :\",auc(fpr,tpr)),color = \"g\")\nplt.plot([1,0],[1,0],\"k--\")\nplt.legend()\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title(\"ROC - CURVE & AREA UNDER CURVE\",fontsize=20, color = 'm')\nplt.show()\n\n","metadata":{"id":"8I4vQE0wbVhk","execution":{"iopub.status.busy":"2022-09-03T11:20:57.015249Z","iopub.execute_input":"2022-09-03T11:20:57.015529Z","iopub.status.idle":"2022-09-03T11:20:58.403761Z","shell.execute_reply.started":"2022-09-03T11:20:57.015505Z","shell.execute_reply":"2022-09-03T11:20:58.401413Z"},"trusted":true},"execution_count":115,"outputs":[]},{"cell_type":"markdown","source":"> *Even though the accuracy is amazing, this isn't really a good model. This is due to the huge imbalance in the dataset!* \n\n> *We need to check for other metrics like **precission**, **recall**, **f1 score** in such models* \n\n
\n \n \n> *Beliving this as a very good model based on only the accuracy can really perform bad on an unseen data*","metadata":{"id":"4V3OC_7vbVhl"}},{"cell_type":"markdown","source":"### Handling the imbalance in the data and then applying KNN\n*There are many techniques available out of which we will be trying* ***RandomOverSampler***:- \n RandomOverSampler is over-sampling by duplicating some of the original samples of the minority class","metadata":{"id":"n1oK_CTWbVhm"}},{"cell_type":"markdown","source":"","metadata":{"id":"luXeGVQobVhm"}},{"cell_type":"code","source":"# Handling imbalanced data using RandomOverSampler\nfrom imblearn.over_sampling import RandomOverSampler\nfrom collections import Counter\n\nros = RandomOverSampler()\nx_ros, y_ros = ros.fit_resample(x, y) # Taking the original x, y as arguments\n\nprint(f\"Before sampling:- {Counter(y)}\")\nprint(f\"After sampling:- {Counter(y_ros)}\")","metadata":{"id":"5HFfM7SbbVhm","execution":{"iopub.status.busy":"2022-09-03T11:44:41.308648Z","iopub.execute_input":"2022-09-03T11:44:41.309210Z","iopub.status.idle":"2022-09-03T11:44:42.038079Z","shell.execute_reply.started":"2022-09-03T11:44:41.309182Z","shell.execute_reply":"2022-09-03T11:44:42.037006Z"},"trusted":true},"execution_count":175,"outputs":[]},{"cell_type":"code","source":"# Visualise it\ny_ros.value_counts().plot(kind='bar', title='After aplying RandomOverSampler');\n","metadata":{"id":"V0JzAejgbVhn","execution":{"iopub.status.busy":"2022-09-03T12:39:34.594791Z","iopub.execute_input":"2022-09-03T12:39:34.595283Z","iopub.status.idle":"2022-09-03T12:39:35.143271Z","shell.execute_reply.started":"2022-09-03T12:39:34.595258Z","shell.execute_reply":"2022-09-03T12:39:35.142107Z"},"trusted":true},"execution_count":216,"outputs":[]},{"cell_type":"markdown","source":"#### Repeating the above steps","metadata":{"execution":{"iopub.status.busy":"2022-09-03T13:33:51.118133Z","iopub.execute_input":"2022-09-03T13:33:51.118507Z","iopub.status.idle":"2022-09-03T13:33:51.126437Z","shell.execute_reply.started":"2022-09-03T13:33:51.118461Z","shell.execute_reply":"2022-09-03T13:33:51.125012Z"}}},{"cell_type":"code","source":"# ****************************************************************\n# | Performing split and scaling on the random over sampled data |\n# ****************************************************************\n\nX_train, X_test, y_train, y_test = train_test_split(x_ros, y_ros, test_size = 0.3, random_state = 0)\n\nsc = StandardScaler()\nX_train_sc = sc.fit_transform(X_train)\nX_test_sc = sc.transform(X_test)","metadata":{"id":"kgiRkAU4bVhn","execution":{"iopub.status.busy":"2022-09-03T11:20:59.632609Z","iopub.execute_input":"2022-09-03T11:20:59.633031Z","iopub.status.idle":"2022-09-03T11:21:00.265216Z","shell.execute_reply.started":"2022-09-03T11:20:59.632994Z","shell.execute_reply":"2022-09-03T11:21:00.263695Z"},"trusted":true},"execution_count":119,"outputs":[]},{"cell_type":"markdown","source":"##### Creating a function to try to fetch the optimal value of K","metadata":{}},{"cell_type":"code","source":"# Create function to fetch the optimal value of K\ndef optimal_Kval_KNN(start_k, end_k, x_train, x_test, y_train, y_test, progress = True):\n ''' \n This function takes in the following arguments -\n start_k - start value of k\n end_k - end value of k\n x_train - independent training values for training the KNN\n x_test - independent testing values for prediction\n y_train - dependent training values for training KNN\n y_test - dependent testing values for computing error rate\n progress - if true shows the progress for each k (by default its set to True)\n '''\n # Header\n print(f\"Fetching the optimal value of K in between {start_k} & {end_k} ~\\n\\nIn progress...\")\n \n # Empty list to append error rate\n mean_err = []\n for K in range(start_k, end_k + 1): # Generates K from start to end-1 values\n knn = KNC(n_neighbors = K) # Build KNN for respective K value\n knn.fit(x_train, y_train) # Train the model\n err_rate = np.mean(knn.predict(x_test) != y_test) # Get the error rate\n mean_err.append(err_rate) # Append it\n # If progress is true display the error rate for each K\n if progress == True:print(f'For K = {K}, mean error = {err_rate:.3}')\n \n # Get the optimal value of k and corresponding value of mean error\n k, val = mean_err.index(min(mean_err))+1, min(mean_err)\n \n # Footer\n print('\\nDone! Here is how error rate varies wrt to K values:- \\n')\n \n # Display how error rate changes wrt K values and mark the optimal K value\n plt.figure(figsize = (5,5))\n plt.plot(range(start_k,end_k + 1), mean_err, 'mo--', markersize = 8, markerfacecolor = 'c',\n linewidth = 1) # plots all mean error wrt K values\n plt.plot(k, val, marker = 'o', markersize = 8, markerfacecolor = 'gold', \n markeredgecolor = 'g') # highlits the optimal K\n plt.title(f\"The optimal performance is obtained at K = {k}\", color = 'r', weight = 'bold',\n fontsize = 15)\n plt.ylabel(\"Error Rate\", color = 'olive', fontsize = 13)\n plt.xlabel(\"K values\", color = 'olive', fontsize = 13)\n \n '''returns the optimal value of k'''\n return k","metadata":{"id":"ndmM1A10bVhn","execution":{"iopub.status.busy":"2022-09-03T11:21:00.266658Z","iopub.execute_input":"2022-09-03T11:21:00.266914Z","iopub.status.idle":"2022-09-03T11:21:00.278081Z","shell.execute_reply.started":"2022-09-03T11:21:00.266891Z","shell.execute_reply":"2022-09-03T11:21:00.277245Z"},"trusted":true},"execution_count":120,"outputs":[]},{"cell_type":"code","source":"k = optimal_Kval_KNN(1, 10, X_train_sc, X_test_sc, y_train, y_test)","metadata":{"id":"Ju3OoHr0uMLh","execution":{"iopub.status.busy":"2022-09-03T11:21:00.279335Z","iopub.execute_input":"2022-09-03T11:21:00.279642Z","iopub.status.idle":"2022-09-03T11:21:21.693217Z","shell.execute_reply.started":"2022-09-03T11:21:00.279615Z","shell.execute_reply":"2022-09-03T11:21:21.692435Z"},"trusted":true},"execution_count":121,"outputs":[]},{"cell_type":"markdown","source":"> *Seems like **K = 1** shall do the job! Let's try*","metadata":{"id":"iRPzLecDbVho"}},{"cell_type":"code","source":"# Fiting the KNN Classifier Model on to the training data after\n\n# Choosing K = 1\nknn_classifier = KNC(n_neighbors=1,metric='minkowski',p=2) \n'''metric is to be by default minkowski for p = 2 to calculate the Eucledian distances'''\n\n# Fit the model\nknn_classifier.fit(X_train_sc, y_train)\n\n# Predict\ny_pred = knn_classifier.predict(X_test_sc)\n\n# Results\nfrom sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, auc\n\nprint('\\nValidation accuracy of KNN is', accuracy_score(y_test,y_pred))\nprint(\"\\n-------------------------------------------------------\")\nprint (\"\\nClassification report :\\n\",(classification_report(y_test,y_pred)))\n\n#Confusion matrix\nplt.figure(figsize=(15,11))\nplt.subplots_adjust(wspace = 0.3)\nplt.suptitle(\"KNN Performance after handling the imbalance in the data\", color = 'b', weight = 'bold')\nplt.subplot(221)\nsns.heatmap(confusion_matrix(y_test,y_pred),annot=True,cmap=\"Set2\",fmt = \"d\",linewidths=3, cbar = False,\n xticklabels=['nexo', 'exo'], yticklabels=['nexo','exo'], square = True)\nplt.xlabel(\"True Labels\", fontsize = 15, weight = 'bold', color = 'm')\nplt.ylabel(\"Predicited Labels\", fontsize = 15, weight = 'bold', color = 'm')\nplt.title(\"CONFUSION MATRIX\",fontsize=20, color = 'purple')\n\n#ROC curve and Area under the curve plotting\npredicting_probabilites = knn_classifier.predict_proba(X_test_sc)[:,1]\nfpr,tpr,thresholds = roc_curve(y_test,predicting_probabilites)\nplt.subplot(222)\nplt.plot(fpr,tpr,label = (\"AUC :\",auc(fpr,tpr)),color = \"g\")\nplt.plot([1,0],[1,0], 'k--')\nplt.legend(loc = \"best\")\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title(\"ROC - CURVE & AREA UNDER CURVE\",fontsize=20, color = 'm')\nplt.show()","metadata":{"id":"mLjStmN1bVho","execution":{"iopub.status.busy":"2022-09-03T14:12:09.753128Z","iopub.execute_input":"2022-09-03T14:12:09.753498Z","iopub.status.idle":"2022-09-03T14:12:13.719203Z","shell.execute_reply.started":"2022-09-03T14:12:09.753445Z","shell.execute_reply":"2022-09-03T14:12:13.717714Z"},"trusted":true},"execution_count":260,"outputs":[]},{"cell_type":"markdown","source":"> *We can see now all the metrics we talked about earlier is showing good results on the splitted testing set!*","metadata":{"id":"STy7R4g_bVho"}},{"cell_type":"markdown","source":"","metadata":{"id":"OXKBAceNbVhp"}},{"cell_type":"markdown","source":"### Task for you\n\n- Is this model working well for unseen data (test set)? Try it yourself! \n- If **yes**,\n - *Can you name any other models that would work better than KNN?*\n - *Try building a model that would work better than what you get after testing the unseen data*\n - [*Submit*](https://forms.gle/gVW3Spv148dMzamo9) *your notebook by completing these tasks and get certified for solving this problem*\n-If **not**, \n - *What could be the reason we might have overlooked while building our model?*\n - *Can you come up with a better model (not necessary to use KNN) that will work well with the test set?* \n - [*Submit*](https://forms.gle/gVW3Spv148dMzamo9) *your notebook by completing these tasks and get certified for solving this problem*\n","metadata":{"id":"nWsq06aJbVhp"}},{"cell_type":"markdown","source":"---\n\n#
THE END","metadata":{"id":"StZ16ktLhn_I"}}]} -------------------------------------------------------------------------------- /rspd-1222.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "*This notebook is particularly designed for Youtube Academic Projects Series by Spartificial Innovations Pvt Ltd. However, anyone is allowed to use it, who is interested in understanding applications of Machine Learning through real world projects.* \n", 7 | "*Please contact team@spartificial.com or visit https://spartificial.com/ to know more.* \n", 8 | "\n", 9 | "*Dataset and some parts of the code have been taken from the solar-panel-detection public repo by arathee2 on [github](https://github.com/arathee2/solar-panel-detection/blob/master/code/solar-panel-detection.py).*\n" 10 | ], 11 | "metadata": { 12 | "id": "xgtHfbP6pEZi" 13 | } 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "source": [ 18 | "#
RoofTop Solar Panel Detection using Deep Learning\n", 19 | "\n", 20 | "---\n", 21 | "\n", 22 | "**Project ID: RSPD-1222** \n", 23 | "\n", 24 | "**Project Name: RoofTop Solar Panel Detection using Deep Learning**\n", 25 | "\n", 26 | "---\n", 27 | "\n", 28 | "
" 29 | ], 30 | "metadata": { 31 | "id": "jN-3VroKqdRA" 32 | } 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "source": [ 37 | "## Workflow of this notebook\n", 38 | "**1)** [Introducing the Problem](#h1) \n", 39 | "**2)** [Understanding the Dataset](#h2) \n", 40 | "**3)** [Importing necessary libraries and modules for this notebook](#h3) \n", 41 | "**4)** [Exploratory Analysis & Data Scaling](#h4) \n", 42 | "**5)** [Building & Tuning our CNN Model](#h5) \n", 43 | "**6)** [Model Evaluation & Results](#h6) \n", 44 | "**7)** [Task for You](#h7)" 45 | ], 46 | "metadata": { 47 | "id": "Znrsn-1rq16a" 48 | } 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "source": [ 53 | "#Introduction to the problem \n", 54 | "\n", 55 | "###Why should solar panels be detected?\n", 56 | "\n", 57 | "

Presently, 1% of the electricity produced worldwide comes from solar energy. In fact, predictions for solar energy production indicate a possible 65-fold increase in output by 2050, making solar energy one of the world's greatest sources of energy at that point. Thirty percent of this energy is thought to be produced by solar photovoltaic, or solar PV, power systems mounted on rooftops. Solar PV power has already started to take on a more and bigger part in the generation of electricity in the US in recent years. Solar energy production increased by 75,123 GWh or 39 times between 2008 and 2017, or a 39-fold increase.\n", 58 | "\n", 59 | "

Here's an overview on the global growth -\n", 60 | "\n", 61 | "![Capture218.jpg]()\n", 62 | "\n", 63 | "Credits : Bloomberg\n", 64 | "\n", 65 | "

Granular data on distributed rooftop solar PV is becoming increasingly important as solar photovoltaic (PV) becomes a significant segment of the energy industry. An imagery-based solar panel recognition algorithm that can be used to create detailed databases of installations and their power capacity would be extremely helpful to solar power suppliers and consumers, urban planners, grid system operators, and energy policy makers. The fact that solar panel installers typically keep installation details to themselves is another factor in solar panel detection. A well-known solar panel detecting technique or algorithm is therefore urgently needed. However, there hasn't been much effort done to identify solar panels in aerial or satellite photographs.\n", 66 | "\n", 67 | "

We first require a labelled data-set of satellite images in order to create an algorithm that can recognise solar panels from aerial or satellite imagery.\n" 68 | ], 69 | "metadata": { 70 | "id": "-VGVSOmgoqtq" 71 | } 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "source": [ 76 | "# Understanding the Dataset\n", 77 | "\n", 78 | "#####

Here are a few snippets from the dataset - Images containing Solar Panels \n", 79 | "\n", 80 | "
\n", 81 | "\n", 82 | "#####
Here are a few snippets from the dataset - Images NOT containing Solar Panels\n", 83 | "
\n", 84 | "\n", 85 | "\n", 86 | "

When examining the photographs themselves, it is clear that solar panels frequently have rectangular shapes with distinct angles and borders. However, the whole pictures that include solar PV do not necessarily have a same structure. The solar panels are not always at the centre of images, which come in a range of sizes and hues. Additionally, the background scenery in the photographs of the two classes is also not uniform. Both classes contain illustrations of home swimming pools, pavement, grass, and rooftops. A model should also be able to predict the same class independent of the orientation of each image." 87 | ], 88 | "metadata": { 89 | "id": "PKm9aP3Quf2O" 90 | } 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "source": [ 95 | "#Importing necessary libraries and modules for this notebook" 96 | ], 97 | "metadata": { 98 | "id": "qIy76Rc0BENn" 99 | } 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": { 105 | "id": "rI4GdpfgMmfe" 106 | }, 107 | "outputs": [], 108 | "source": [ 109 | "# IMPORT REQUIRED LIBRARIES AND FUNCTIONS\n", 110 | "\n", 111 | "\n", 112 | "'''Data Handling & Linear Algebra'''\n", 113 | "import numpy as np\n", 114 | "import pandas as pd\n", 115 | "\n", 116 | "'''Visualisation'''\n", 117 | "import matplotlib.pyplot as plt\n", 118 | "import matplotlib as mpl\n", 119 | "from pylab import rcParams\n", 120 | "import seaborn as sns\n", 121 | "\n", 122 | "'''Data Analysis'''\n", 123 | "from sklearn.model_selection import StratifiedKFold\n", 124 | "from sklearn.metrics import roc_auc_score\n", 125 | "from sklearn.metrics import roc_curve\n", 126 | "from sklearn.metrics import confusion_matrix\n", 127 | "\n", 128 | "'''Manipulating Data and Model Building'''\n", 129 | "from keras.layers import Conv2D\n", 130 | "from keras.layers import Dense\n", 131 | "from keras.layers import GlobalMaxPooling2D\n", 132 | "from keras.layers import MaxPooling2D\n", 133 | "from keras.layers import BatchNormalization\n", 134 | "from keras.layers import Add\n", 135 | "from keras.models import Sequential" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "source": [ 141 | "###Importing Google Drive for Dataset Access" 142 | ], 143 | "metadata": { 144 | "id": "bvIn56Q0ArnN" 145 | } 146 | }, 147 | { 148 | "cell_type": "markdown", 149 | "source": [ 150 | "- Download this dataset to your system.\n", 151 | "- Upload this 'data' folder directly in your 'Main Drive'." 152 | ], 153 | "metadata": { 154 | "id": "0x3uRuOYnyf7" 155 | } 156 | }, 157 | { 158 | "cell_type": "code", 159 | "source": [ 160 | "from google.colab import drive\n", 161 | "drive.mount('/content/drive')" 162 | ], 163 | "metadata": { 164 | "id": "MCFAQAs9YS93" 165 | }, 166 | "execution_count": null, 167 | "outputs": [] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": { 173 | "id": "nazMB-iPHyhz" 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "# define dataset directories - the below links won't work if you haven't placed 'data' folder in your 'Main Drive'\n", 178 | "DIR_TRAIN_IMAGES = \"/content/drive/MyDrive/data/training/\"\n", 179 | "DIR_TRAIN_LABELS = \"/content/drive/MyDrive/data/labels_training.csv\"" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "source": [ 185 | "#Exploratory Analysis & Data Scaling\n", 186 | "\n", 187 | "\n" 188 | ], 189 | "metadata": { 190 | "id": "coIj9Hc42GzC" 191 | } 192 | }, 193 | { 194 | "cell_type": "code", 195 | "source": [ 196 | "pd.read_csv(DIR_TRAIN_LABELS).head()" 197 | ], 198 | "metadata": { 199 | "id": "XffklsVltSdU" 200 | }, 201 | "execution_count": null, 202 | "outputs": [] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "source": [ 207 | "- id are names of the image before tif\n", 208 | "- label has two values:\n", 209 | " - 0: No solar panels in the image\n", 210 | " - 1: Solar panels present in the image" 211 | ], 212 | "metadata": { 213 | "id": "iE5jH577tWAL" 214 | } 215 | }, 216 | { 217 | "cell_type": "code", 218 | "source": [ 219 | "# LOADING DATA AND PREPROCESSING\n", 220 | "\n", 221 | "def load_data(dir_data, dir_labels):\n", 222 | " '''\n", 223 | " dir_data: Data directory\n", 224 | " dir_labels: Respective csv file containing ids and labels\n", 225 | " returns: Array of all the image arrays and its respective labels\n", 226 | " '''\n", 227 | " labels_pd = pd.read_csv(dir_labels) # Read the csv file with labels and ids as we saw above\n", 228 | " ids = labels_pd.id.values # Extracting ids from the csv file\n", 229 | " data = [] # Initiating the empty list to store each image as numpy array\n", 230 | " for identifier in ids: # Looping into the desired folder\n", 231 | " fname = dir_data + identifier.astype(str) + '.tif' # Generating the file name\n", 232 | " image = mpl.image.imread(fname) # Reading image as numpy array using matplotlib\n", 233 | " data.append(image) # Appending this array into the empty list and repeat the above cycle\n", 234 | " data = np.array(data) # Now, convert the data list into data array\n", 235 | " labels = labels_pd.label.values # Extract labels from the csv file\n", 236 | " return data, labels # Return the array of data and respective labels" 237 | ], 238 | "metadata": { 239 | "id": "9YrdmgifoLoh" 240 | }, 241 | "execution_count": null, 242 | "outputs": [] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": { 248 | "id": "OszJkbgrH1SV" 249 | }, 250 | "outputs": [], 251 | "source": [ 252 | "# load train data - time consuming code cell\n", 253 | "X, y = load_data(DIR_TRAIN_IMAGES, DIR_TRAIN_LABELS)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "source": [ 259 | "# display the images with and without solar panels\n", 260 | "plt.figure(figsize = (13,8)) # Adjust the figure size\n", 261 | "for i in range(6): # For first 6 images in the data\n", 262 | " plt.subplot(2, 3, i+1) # Create subplots\n", 263 | " plt.imshow(X[i]) # Show the respective image in respective postion\n", 264 | " if y[i] == 0: # If label is 0\n", 265 | " title = 'No Solar Panels in this image' # Set this as the title\n", 266 | " else: # Else label is 1\n", 267 | " title = 'Solar Panels in this image' # Set this as the title\n", 268 | " plt.title(title, color = 'r', weight = 'bold') # Adding title to each images in the subplot\n", 269 | "plt.tight_layout() # Automatically adjusts the width and height between images in subplot\n", 270 | "plt.show() # Display the subplot" 271 | ], 272 | "metadata": { 273 | "id": "ZBgNCx7hxbJK" 274 | }, 275 | "execution_count": null, 276 | "outputs": [] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": { 282 | "id": "uUGT3ZPBH5jd" 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "# print data shape\n", 287 | "print('X shape:\\n', X.shape)" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "source": [ 293 | "- 1500 total images in the training data\n", 294 | "- Each image is of shape (101 x 101 x 3)" 295 | ], 296 | "metadata": { 297 | "id": "KRu_Ivtf2UlF" 298 | } 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": null, 303 | "metadata": { 304 | "id": "A5AYGyrHH7wF" 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "# check number of samples\n", 309 | "print('Distribution of y', np.bincount(y))" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "source": [ 315 | "- Out of 1500 images:\n", 316 | " - 995 images are without any solar panels\n", 317 | " - 505 images are with solar panels" 318 | ], 319 | "metadata": { 320 | "id": "X4H5f-nr2k3b" 321 | } 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "id": "3Pn9Dq3dH829" 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "# scale pixel values between 0 and 1\n", 332 | "X = X / 255.0" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "source": [ 338 | "#Building the CNN Model\n", 339 | "\n", 340 | "

A neural network type with a topology resembling a grid is known as a CNN. The effectiveness of CNNs in computer vision applications including image classification, picture clustering, and object identification is well recognised. Convolutional neural networks (CNNs) at least one of its layers instead of matrix multiplication at their core. They are structured like other neural networks by a series of layers. Neurons are grouped in three dimensions—width, height, and depth—in the layers of CNN. Although there are many various kinds of CNN architectures, they are the best option for picture identification since they handle pixels in relation to their surrounds.\n", 341 | "\n", 342 | "

\n", 343 | "\n", 344 | "The convolutional layer applies a convolution operation, the output is passed to the next layer. The pool layer performs a down sampling operation by combining the outputs of neurons at one layer into a single neuron in the next layer. The flatten reshapes the feature map into a column. The full-connection layer will compute the class scores, each neuron in this layer will be connected to all the neurons in the previous one\n", 345 | "\n", 346 | "*Credits* - [MathWorks](https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html)" 347 | ], 348 | "metadata": { 349 | "id": "a4Hu7o9f2RFk" 350 | } 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "id": "UI5gk1HGH_c-" 357 | }, 358 | "outputs": [], 359 | "source": [ 360 | "# MODEL : CONVOLUTIONAL NEURAL NETWORK\n", 361 | "\n", 362 | "# define CNN\n", 363 | "def build_model():\n", 364 | " '''\n", 365 | " Returns a Keras CNN model\n", 366 | " '''\n", 367 | "\n", 368 | " # define image dimensions\n", 369 | " IMAGE_HEIGHT = 101\n", 370 | " IMAGE_WIDTH = 101\n", 371 | " IMAGE_CHANNELS = 3\n", 372 | "\n", 373 | " # define a straightforward sequential neural network\n", 374 | " model = Sequential()\n", 375 | "\n", 376 | " # layer-1\n", 377 | " #filter is convolutional matrix which is applied across the image = 32 filters\n", 378 | " #kernal size is 3x3 matrix(filter)\n", 379 | " #relu positive kept as it is, negative is taken out\n", 380 | " model.add(Conv2D(filters=32,\n", 381 | " kernel_size=3,\n", 382 | " activation='relu',\n", 383 | " input_shape=(IMAGE_HEIGHT,\n", 384 | " IMAGE_WIDTH,\n", 385 | " IMAGE_CHANNELS)))\n", 386 | "\n", 387 | " #adding normalizing layer to improve the speed of training\n", 388 | " model.add(BatchNormalization())\n", 389 | "\n", 390 | " # As we move forword in the layers pattern gets more complex,\n", 391 | " # to capture the maximum combinations in subsequent layers\n", 392 | " # layer-2\n", 393 | " model.add(Conv2D(filters=64,\n", 394 | " kernel_size=3,\n", 395 | " activation='relu'))\n", 396 | " model.add(BatchNormalization())\n", 397 | "\n", 398 | " # layer-3\n", 399 | " model.add(Conv2D(filters=128,\n", 400 | " kernel_size=3,\n", 401 | " activation='relu'))\n", 402 | " model.add(BatchNormalization())\n", 403 | "\n", 404 | " # Pooling layer is to reduce dimentions of feature map by summerizing presence of features\n", 405 | " # max-pool - sends only imp data to next layer - here 2x2 matrix\n", 406 | " model.add(MaxPooling2D(pool_size=2))\n", 407 | "\n", 408 | " # layer-4\n", 409 | " model.add(Conv2D(filters=64,\n", 410 | " kernel_size=3,\n", 411 | " activation='relu'))\n", 412 | " model.add(BatchNormalization())\n", 413 | "\n", 414 | " # layer-5\n", 415 | " model.add(Conv2D(filters=128,\n", 416 | " kernel_size=3,\n", 417 | " activation='relu'))\n", 418 | " model.add(BatchNormalization())\n", 419 | "\n", 420 | " # max-pool\n", 421 | " model.add(MaxPooling2D(pool_size=2))\n", 422 | "\n", 423 | " # layer-6\n", 424 | " model.add(Conv2D(filters=64,\n", 425 | " kernel_size=3,\n", 426 | " activation='relu'))\n", 427 | " model.add(BatchNormalization())\n", 428 | "\n", 429 | " # layer-7\n", 430 | " model.add(Conv2D(filters=128,\n", 431 | " kernel_size=3,\n", 432 | " activation='relu'))\n", 433 | " model.add(BatchNormalization())\n", 434 | "\n", 435 | " # gobal-max-pool- performs downsampling by computing the maximum of the height and width dimensions of the input\n", 436 | " # using it as a substitute of Flatten before passing it to the final layer\n", 437 | " model.add(GlobalMaxPooling2D())\n", 438 | "\n", 439 | " # output layer\n", 440 | " model.add(Dense(1, activation='sigmoid'))\n", 441 | "\n", 442 | " # compile model\n", 443 | " model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n", 444 | "\n", 445 | " return model\n" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "source": [ 451 | "##Checking the Performance of our CNN Model" 452 | ], 453 | "metadata": { 454 | "id": "lmBI2xps6EXI" 455 | } 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": null, 460 | "metadata": { 461 | "id": "LvC8xK7cIPTQ" 462 | }, 463 | "outputs": [], 464 | "source": [ 465 | "# cross-validate CNN model\n", 466 | "def cv_performance_assessment(X, y, num_folds, clf, random_seed=1):\n", 467 | " '''\n", 468 | " Cross validated performance assessment\n", 469 | "\n", 470 | " Input:\n", 471 | " X: training data\n", 472 | " y: training labels\n", 473 | " num_folds: number of folds for cross validation\n", 474 | " clf: classifier to use\n", 475 | "\n", 476 | " Divide the training data into k folds of training and validation data.\n", 477 | " For each fold the classifier will be trained on the training data and\n", 478 | " tested on the validation data. The classifier prediction scores are\n", 479 | " aggregated and output.\n", 480 | " '''\n", 481 | "\n", 482 | " prediction_scores = np.empty(y.shape[0], dtype='object')\n", 483 | "\n", 484 | " # establish the num_folds folds\n", 485 | " kf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=random_seed)\n", 486 | "\n", 487 | " for train_index, val_index in kf.split(X, y):\n", 488 | " # extract the training and validation data for this fold\n", 489 | " X_train, X_val = X[train_index], X[val_index]\n", 490 | " y_train = y[train_index]\n", 491 | "\n", 492 | " # give more weight to minority class based on the target class distribution\n", 493 | " class_weight = {0: 505/1500, 1: 995/1500}\n", 494 | "\n", 495 | " # train the classifier\n", 496 | " training = clf.fit(x=X_train,\n", 497 | " y=y_train,\n", 498 | " class_weight=class_weight,\n", 499 | " batch_size=32,\n", 500 | " epochs=10,\n", 501 | " shuffle=True,\n", 502 | " verbose=1)\n", 503 | "\n", 504 | " # test the classifier on the validation data for this fold\n", 505 | " y_val_pred_probs = clf.predict(X_val).reshape((-1, ))\n", 506 | "\n", 507 | " # save the predictions for this fold\n", 508 | " prediction_scores[val_index] = y_val_pred_probs\n", 509 | "\n", 510 | " return prediction_scores" 511 | ] 512 | }, 513 | { 514 | "cell_type": "code", 515 | "execution_count": null, 516 | "metadata": { 517 | "id": "gPpE3xp0IkpW" 518 | }, 519 | "outputs": [], 520 | "source": [ 521 | "# number of subsets of data, where k subsets are used as test set and other k-1 subsets are used for the training purpose\n", 522 | "num_folds = 3\n", 523 | "\n", 524 | "# seed value is the previous value number generated by the random function\n", 525 | "random_seed = 1\n", 526 | "\n", 527 | "# build_model() function returns the predefined sequential model\n", 528 | "cnn = build_model()\n", 529 | "\n", 530 | "# lets look at summary of the model\n", 531 | "cnn.summary()\n", 532 | "\n", 533 | "# generate the probabilities (y_pred_prob)\n", 534 | "cnn_y_hat_prob = cv_performance_assessment(X, y, num_folds, cnn, random_seed=random_seed)" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "source": [ 540 | "Looking at the True Positives, False Negatives, False Positives & True Negatives -\n", 541 | "\n" 542 | ], 543 | "metadata": { 544 | "id": "znKpsZ0GJrlG" 545 | } 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": { 551 | "id": "w7aqS0JtMFlP" 552 | }, 553 | "outputs": [], 554 | "source": [ 555 | "df = pd.read_csv(DIR_TRAIN_LABELS) # Create a data frame of labels\n", 556 | "df[\"predicted_class\"] = [1 if pred >= 0.5 else 0 for pred in cnn_y_hat_prob] # Add a column to it for predicted class\n", 557 | "\n", 558 | "# Get the values for FN, FP, TP, TN\n", 559 | "fn = np.array(df[(df['label'] == 1) & (df['predicted_score'] == 0)]['id']) # False Negative\n", 560 | "fp = np.array(df[(df['label'] == 0) & (df['predicted_score'] == 1)]['id']) # False Positive\n", 561 | "tp = np.array(df[(df['label'] == 1) & (df['predicted_score'] == 1)]['id']) # True Positive\n", 562 | "tn = np.array(df[(df['label'] == 0) & (df['predicted_score'] == 0)]['id']) # True Negative" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": { 569 | "id": "DFcAXCOLewjz" 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "# Visuals of TP, TN, FP, and FN\n", 574 | "def show_images(image_ids, num_images, title, color):\n", 575 | " '''\n", 576 | " Display a subset of images from the image_ids data\n", 577 | " '''\n", 578 | " rcParams['figure.figsize'] = 20, 4 # Adjusting figure size\n", 579 | " plt.figure() # Generating figure\n", 580 | " n = 1 # index where plot should apear in subplot\n", 581 | " for i in image_ids[0:num_images]: # Run a loop for total number of images to display\n", 582 | " plt.subplot(1, num_images, n) # Generate a subplot\n", 583 | " plt.imshow(X[i, :, :, :]) # Display the image\n", 584 | " plt.title('Image id: ' + str(i)) # Add title\n", 585 | " plt.axis('off') # Turn off the axis\n", 586 | " n+=1 # Incrememting index by 1\n", 587 | " plt.suptitle('\\n'+title, fontsize=15, color = color, weight = 'bold') # Adding main title to subplot\n", 588 | " plt.show() # Display the final output" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": null, 594 | "metadata": { 595 | "id": "62gClVAxMMXe" 596 | }, 597 | "outputs": [], 598 | "source": [ 599 | "num_images = 7 # number of images to look at\n", 600 | "show_images(tp, num_images, 'Examples of True Positives - Predicted solar panels if they were present', 'g')\n", 601 | "show_images(fp, num_images, 'Examples of False Positives - Predicted solar panels even if they were not present', 'r')\n", 602 | "show_images(tn, num_images, 'Examples of True Negatives - Predicted no solar panels when they were not present', 'g')\n", 603 | "show_images(fn, num_images, 'Examples of False Negatives - Predicted no solar panels even if they were present', 'r')" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "source": [ 609 | "# Model Evaluation & Results" 610 | ], 611 | "metadata": { 612 | "id": "mxpaXfysBxwK" 613 | } 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "source": [ 618 | "## Understanding ROC Curves -\n", 619 | "\n", 620 | "An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:\n", 621 | "\n", 622 | "* True Positive Rate\n", 623 | "* False Positive Rate\n", 624 | "\n", 625 | "
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:\n", 626 | "\n", 627 | "![Screenshot 2023-01-20 at 5.27.02 PM.png]()\n", 628 | "\n", 629 | "False Positive Rate (FPR) is defined as follows:\n", 630 | "\n", 631 | "![Screenshot 2023-01-20 at 5.27.08 PM.png]()\n", 632 | "\n", 633 | "An ROC curve plots TPR vs. FPR at different classification thresholds.\n", 634 | "\n", 635 | "Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives.
\n", 636 | "\n", 637 | "####AUC: Area Under the ROC Curve -\n", 638 | "AUC stands for \"Area under the ROC Curve.\" That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from 0 to 1.\n", 639 | "AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the **probability that the model ranks a random positive example more highly than a random negative example**. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0." 640 | ], 641 | "metadata": { 642 | "id": "yjGRyWy8CL6T" 643 | } 644 | }, 645 | { 646 | "cell_type": "code", 647 | "source": [ 648 | "# ROC - AUC\n", 649 | "def plot_roc(y_true, y_pred_cnn):\n", 650 | " '''\n", 651 | " Plots ROC curves for the CNN models.\n", 652 | " '''\n", 653 | " plt.figure(figsize=(8, 8))\n", 654 | "\n", 655 | " # ROC of CNN\n", 656 | " fpr, tpr, _ = roc_curve(y_true, y_pred_cnn, pos_label=1)\n", 657 | " auc = roc_auc_score(y_true, y_pred_cnn)\n", 658 | " legend_string = 'CNN Model - AUC = {:0.3f}'.format(auc)\n", 659 | " plt.plot(fpr, tpr, color='red', label=legend_string)\n", 660 | "\n", 661 | " # ROC of chance\n", 662 | " plt.plot([0, 1], [0, 1], '--', color='gray', label='Chance - AUC = 0.5')\n", 663 | "\n", 664 | " # plot aesthetics\n", 665 | " plt.xlabel('False Positive Rate')\n", 666 | " plt.ylabel('True Positive Rate')\n", 667 | " plt.grid('on')\n", 668 | " plt.axis('square')\n", 669 | " plt.legend()\n", 670 | " plt.tight_layout()\n", 671 | " plt.title('ROC Curve', fontsize=10)\n", 672 | " pass" 673 | ], 674 | "metadata": { 675 | "id": "EqZQyaXUIAc3" 676 | }, 677 | "execution_count": null, 678 | "outputs": [] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": null, 683 | "metadata": { 684 | "id": "1lFySboKMYgu" 685 | }, 686 | "outputs": [], 687 | "source": [ 688 | "# plot ROC\n", 689 | "y_pred = [1 if pred >= 0.5 else 0 for pred in cnn_y_hat_prob]\n", 690 | "plot_roc(y, cnn_y_hat_prob)\n", 691 | "plot_roc(y, y_pred)" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "source": [ 697 | "**CONFUSION MATRIX** \n", 698 | "" 699 | ], 700 | "metadata": { 701 | "id": "7bYlLxktcI9b" 702 | } 703 | }, 704 | { 705 | "cell_type": "code", 706 | "source": [ 707 | "plt.figure(figsize=(5,5))\n", 708 | "sns.heatmap(confusion_matrix(y, y_pred), annot = True, cbar = False, fmt='.0f')\n", 709 | "plt.show()" 710 | ], 711 | "metadata": { 712 | "id": "-1ZDcT2uVYC0" 713 | }, 714 | "execution_count": null, 715 | "outputs": [] 716 | }, 717 | { 718 | "cell_type": "markdown", 719 | "source": [ 720 | "## Task for you (Your chance to earn a certificate on completion!)\n", 721 | "\n", 722 | "- Use data augmentation to increase the size of the training data and train the model again.\n", 723 | "- If you are familar with transfer learning do try to implement and see if you get even better results.\n", 724 | "- In either of the cases write your conclusion based on what you changed and how you try it.\n", 725 | "\n", 726 | "Submit your solution notebook using [this](https://forms.gle/Yz4mm4Lq29zA1WhF9) form! All the Best!" 727 | ], 728 | "metadata": { 729 | "id": "dfG-qanWGQjd" 730 | } 731 | } 732 | ], 733 | "metadata": { 734 | "colab": { 735 | "provenance": [], 736 | "private_outputs": true 737 | }, 738 | "kernelspec": { 739 | "display_name": "Python 3", 740 | "name": "python3" 741 | }, 742 | "language_info": { 743 | "name": "python" 744 | }, 745 | "accelerator": "GPU", 746 | "gpuClass": "standard" 747 | }, 748 | "nbformat": 4, 749 | "nbformat_minor": 0 750 | } --------------------------------------------------------------------------------