├── .DS_Store ├── CTR-Rank.png ├── CTR-Lenght.png ├── Elbowmethod.png ├── CTR-Impressions.png ├── www.uselessthingstobuy.com.xlsx ├── Kmeans.py └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/.DS_Store -------------------------------------------------------------------------------- /CTR-Rank.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Rank.png -------------------------------------------------------------------------------- /CTR-Lenght.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Lenght.png -------------------------------------------------------------------------------- /Elbowmethod.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/Elbowmethod.png -------------------------------------------------------------------------------- /CTR-Impressions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Impressions.png -------------------------------------------------------------------------------- /www.uselessthingstobuy.com.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/www.uselessthingstobuy.com.xlsx -------------------------------------------------------------------------------- /Kmeans.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Mon Nov 11 16:08:29 2019 5 | 6 | @author: kburch1 7 | """ 8 | 9 | 10 | import pandas as pd 11 | import numpy as np 12 | import matplotlib.pyplot as plt 13 | from sklearn.cluster import KMeans 14 | 15 | 16 | #Opening xlsx 17 | df = pd.read_excel (r'www.uselessthingstobuy.com.xlsx') 18 | 19 | #optional 20 | #df['Position'] = round(df['Position']) 21 | 22 | df = df.dropna() 23 | 24 | #### CTR Vs Title Length #### 25 | X2 = pd.DataFrame(df, columns = ['Title Length','CTR']) 26 | 27 | X2 =X2.dropna() 28 | 29 | #Finding the right number of clusters for Kmeans 30 | 31 | wcss = [] 32 | for i in range(1, 11): 33 | kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) 34 | kmeans.fit(X2) 35 | wcss.append(kmeans.inertia_) 36 | plt.plot(range(1, 11), wcss) 37 | plt.title('The Elbow Method') 38 | plt.xlabel('Number of clusters') 39 | plt.ylabel('WCSS') 40 | plt.show() 41 | 42 | #Kmeans 43 | kmeans = KMeans(n_clusters=3).fit(X2) 44 | centroids = kmeans.cluster_centers_ 45 | print(centroids) 46 | 47 | #Plotting 48 | 49 | plt.title('Title Lenght vs CTR') 50 | plt.scatter(df['Title Length'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs') 51 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid') 52 | plt.xlabel('Title Length') 53 | plt.ylabel('CTR') 54 | plt.legend() 55 | 56 | 57 | #### CTR VS Rankings #### 58 | 59 | X3 = pd.DataFrame(df, columns = ['Position','CTR']) 60 | 61 | #Removing Nan Values 62 | X3 = np.nan_to_num(X3) 63 | 64 | 65 | #Finding the right number of clusters for Kmeans 66 | wcss = [] 67 | for i in range(1, 11): 68 | kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) 69 | kmeans.fit(X3) 70 | wcss.append(kmeans.inertia_) 71 | plt.plot(range(1, 11), wcss) 72 | plt.title('The Elbow Method') 73 | plt.xlabel('Number of clusters') 74 | plt.ylabel('WCSS') 75 | plt.show() 76 | 77 | #Kmeans 78 | kmeans = KMeans(n_clusters=3).fit(X3) 79 | centroids = kmeans.cluster_centers_ 80 | print(centroids) 81 | 82 | #Plotting 83 | 84 | plt.title('Rank vs CTR') 85 | plt.scatter(df['Position'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs') 86 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid') 87 | plt.ylabel('CTR') 88 | plt.xlabel('Rank') 89 | plt.legend() 90 | 91 | 92 | #### CTR VS Impressions #### 93 | 94 | X4 = pd.DataFrame(df, columns = ['Impressions','CTR']) 95 | 96 | 97 | 98 | #Finding the right number of clusters for Kmeans 99 | wcss = [] 100 | for i in range(1, 11): 101 | kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) 102 | kmeans.fit(X4) 103 | wcss.append(kmeans.inertia_) 104 | plt.plot(range(1, 11), wcss) 105 | plt.title('The Elbow Method') 106 | plt.xlabel('Number of clusters') 107 | plt.ylabel('WCSS') 108 | plt.show() 109 | 110 | #Kmeans 111 | kmeans = KMeans(n_clusters=3).fit(X4) 112 | centroids = kmeans.cluster_centers_ 113 | print(centroids) 114 | 115 | #Plotting 116 | 117 | plt.title('Impressions vs CTR') 118 | plt.scatter(df['Impressions'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs') 119 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid') 120 | plt.ylabel('CTR') 121 | plt.xlabel('Impressions') 122 | plt.legend() 123 | 124 | 125 | 126 | 127 | #### Clicks VS Impressions #### 128 | 129 | X5 = pd.DataFrame(df, columns = ['Clicks','CTR']) 130 | 131 | 132 | #Finding the right number of clusters for Kmeans 133 | wcss = [] 134 | for i in range(1, 11): 135 | kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) 136 | kmeans.fit(X5) 137 | wcss.append(kmeans.inertia_) 138 | plt.plot(range(1, 11), wcss) 139 | plt.title('The Elbow Method') 140 | plt.xlabel('Number of clusters') 141 | plt.ylabel('WCSS') 142 | plt.show() 143 | 144 | #Kmeans 145 | kmeans = KMeans(n_clusters=3).fit(X5) 146 | centroids = kmeans.cluster_centers_ 147 | print(centroids) 148 | 149 | #Plotting 150 | 151 | plt.title('Clicks vs CTR') 152 | plt.scatter(df['Clicks'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs') 153 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid') 154 | plt.ylabel('CTR') 155 | plt.xlabel('Clicks') 156 | plt.legend() 157 | 158 | 159 | 160 | #Correlation plots 161 | import matplotlib.pyplot as plt 162 | import seaborn as sns 163 | 164 | 165 | palette = sns.color_palette("bright") 166 | 167 | #correlations 168 | sns.pairplot(df) 169 | plt.show() 170 | 171 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SEO - Kmeans clustering 2 | 3 | In this project I will show step by step how to use Google search Console data and Kmeans clustering to group your URLs depending on a specific KPI. With this you will be able to see some important insights and optimize better your site. It will also give you an overview of the site from a more analitical stand point and will help you to make data driven decisions. 4 | 5 | ## Getting Started 6 | 7 | These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system. 8 | 9 | ### Prerequisites 10 | 11 | To be able to use this you will need the following Prerequisites 12 | 13 | ``` 14 | Python3 15 | Google search console Access 16 | Screaming Frog 17 | 18 | ``` 19 | 20 | ## Building your Data Set 21 | 22 | ### Getting the Data from GSC 23 | 24 | The first step we need is to get access to [Google Search Console](https://search.google.com/search-console) for your own site. 25 | Once in the console we need to go to Performance and export page a report that includes the following. 26 | 27 | 28 | | URL | Clicks | Impressions |CTR | Position| 29 | | ----| ------- | ------------| ----| --------| 30 | 31 | You can select the time frame that you want and add the necessary filterst that you think they are right for your project. 32 | If you want to get more data(more thn 1,000 URLs) You can try using the Google Search Console APi to make bigger requests. 33 | 34 | ### Getting Title and description Lenghts 35 | 36 | Once we have the list URLs, we go to screaming frog and in mode we select list and drop our list manually. 37 | 38 | Here we would only need to export the title lengths and the description lengths. Once we get the length we can do a Vlookup and add it to our Google Search Console export we did before, the spreadsheet shoudl look like this: 39 | 40 | | URL | Title Length | Description Length | Clicks | Impressions |CTR | Position| 41 | | ----| ------------ |------------------- | --------| ------------| ----|-------- | 42 | 43 | ### Cleaning the data with Excel 44 | 45 | Since im lazy, I will use excell to clean some stuff on the data set. The only important thing that we need to remove is the % sign on CTRs. Make sure that all the numbers are real numbers. 46 | 47 | ### Installing Python Libraries 48 | 49 | now that we have our dataset ready, we need to install some libraries that we will be suing for this. 50 | To install the libraries copy the following and run it in the terminal 51 | 52 | ``` 53 | pip install pandas 54 | pip install numpy 55 | pip install sklearn 56 | pip install matplotlib 57 | ``` 58 | 59 | ## Running the script 60 | 61 | ### Opening The Data Set 62 | 63 | The first two steps that we have on the script is opening the script. We need to make sure we are in the folder that our xlsx file is(The dataset we just created from GSC and Screaming frog). Once there open our file 64 | 65 | ```python 66 | #Opening xlsx 67 | df = pd.read_excel (r'www.uselessthingstobuy.com.xlsx') 68 | 69 | ``` 70 | ### Selecting the columns 71 | 72 | After opening the file we need to decide what kind of experiment we will be doing. In this case I want to see what Title length vs CTR. This way I will be able to see what is the optimal title length based on the amount of clicks im getting. 73 | 74 | For this we create a new Dataframe that contains the two columns we just mentioned. Title Lenght and CTR. 75 | 76 | ```python 77 | #### CTR Vs Title Length #### 78 | X2 = pd.DataFrame(df, columns = ['Title Length','CTR']) 79 | 80 | ``` 81 | 82 | ### Finding the right number of clusters for Kmeans 83 | 84 | 85 | Kmeans basically clusters your data in different groups. To know what is the optimal amount of clusters we need to have we run the elbow method. This will tell us based on our data what is the optimal number of clusters we should use for our data. Make sure that you are using the name that your dataframe has in our case its X2. 86 | 87 | ```python 88 | 89 | wcss = [] 90 | for i in range(1, 11): 91 | kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) 92 | kmeans.fit(X2) 93 | wcss.append(kmeans.inertia_) 94 | plt.plot(range(1, 11), wcss) 95 | plt.title('The Elbow Method') 96 | plt.xlabel('Number of clusters') 97 | plt.ylabel('WCSS') 98 | plt.show() 99 | 100 | ``` 101 | 102 | The result should look something like this 103 | 104 | ![Elbow Method](Elbowmethod.png) 105 | 106 | We can see that the optimal number of cluster is where the elbow is bending so in this case it will be 3 107 | 108 | ### Kmeans 109 | 110 | After we perform the kmean clustering with the number of clusters we got from our elbow method. 111 | 112 | ```python 113 | kmeans = KMeans(n_clusters=3).fit(X4) 114 | centroids = kmeans.cluster_centers_ 115 | print(centroids) 116 | ``` 117 | 118 | ### Plotting 119 | 120 | Now that we have run everything we want to see our results and how our data is clustered. To plot our results we run the follwoing 121 | 122 | ```python 123 | plt.title('Impressions vs CTR for UTB') 124 | plt.scatter(df['Impressions'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs') 125 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid') 126 | plt.ylabel('CTR') 127 | plt.xlabel('Impressions') 128 | plt.legend() 129 | ``` 130 | 131 | ![Graph](CTR-Lenght.png) 132 | 133 | 134 | ### Insights 135 | 136 | My data is very Shitty but we can still see that based on our graph the optimal title lenght that has high CTR are between 35 - 55 characters. 137 | 138 | 139 | ### Example 2 CTR v Impressions 140 | 141 | ![Graph](CTR-Impressions.png) 142 | 143 | 144 | This example can show us how well are our pages performing in General. High impressions low CTR will means we need to do something about those pages because we are ranking well we are having impressions but users are not clicking on our results. Why? maybe bad Call to Action or something different. This could help seeing if we have a problem like that. 145 | 146 | Also in this example, we need to remove outliers. Since my data had one big outlier that had over 87K impressions I just removed it from the spreadsheet to do this graph. 147 | 148 | 149 | --------------------------------------------------------------------------------