├── .DS_Store
├── CTR-Rank.png
├── CTR-Lenght.png
├── Elbowmethod.png
├── CTR-Impressions.png
├── www.uselessthingstobuy.com.xlsx
├── Kmeans.py
└── README.md


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/.DS_Store


--------------------------------------------------------------------------------
/CTR-Rank.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Rank.png


--------------------------------------------------------------------------------
/CTR-Lenght.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Lenght.png


--------------------------------------------------------------------------------
/Elbowmethod.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/Elbowmethod.png


--------------------------------------------------------------------------------
/CTR-Impressions.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/CTR-Impressions.png


--------------------------------------------------------------------------------
/www.uselessthingstobuy.com.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/sundios/Kmeans-SEO/HEAD/www.uselessthingstobuy.com.xlsx


--------------------------------------------------------------------------------
/Kmeans.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | Created on Mon Nov 11 16:08:29 2019
  5 | 
  6 | @author: kburch1
  7 | """
  8 | 
  9 | 
 10 | import pandas as pd
 11 | import numpy as np
 12 | import matplotlib.pyplot as plt
 13 | from sklearn.cluster import KMeans
 14 | 
 15 | 
 16 | #Opening xlsx
 17 | df = pd.read_excel (r'www.uselessthingstobuy.com.xlsx')
 18 | 
 19 | #optional
 20 | #df['Position'] = round(df['Position'])
 21 | 
 22 | df = df.dropna()
 23 | 
 24 | #### CTR Vs Title Length ####
 25 | X2 = pd.DataFrame(df, columns = ['Title Length','CTR'])
 26 | 
 27 | X2 =X2.dropna()
 28 | 
 29 | #Finding the right number of clusters for Kmeans
 30 | 
 31 | wcss = []
 32 | for i in range(1, 11):
 33 |     kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
 34 |     kmeans.fit(X2)
 35 |     wcss.append(kmeans.inertia_)
 36 | plt.plot(range(1, 11), wcss)
 37 | plt.title('The Elbow Method')
 38 | plt.xlabel('Number of clusters')
 39 | plt.ylabel('WCSS')
 40 | plt.show()
 41 | 
 42 | #Kmeans
 43 | kmeans = KMeans(n_clusters=3).fit(X2)
 44 | centroids = kmeans.cluster_centers_
 45 | print(centroids)
 46 | 
 47 | #Plotting
 48 | 
 49 | plt.title('Title Lenght vs CTR')
 50 | plt.scatter(df['Title Length'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs')
 51 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid')
 52 | plt.xlabel('Title Length')
 53 | plt.ylabel('CTR')
 54 | plt.legend()
 55 | 
 56 | 
 57 | #### CTR VS Rankings ####
 58 | 
 59 | X3 = pd.DataFrame(df, columns = ['Position','CTR'])
 60 | 
 61 | #Removing Nan Values
 62 | X3 = np.nan_to_num(X3)
 63 | 
 64 | 
 65 | #Finding the right number of clusters for Kmeans
 66 | wcss = []
 67 | for i in range(1, 11):
 68 |     kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
 69 |     kmeans.fit(X3)
 70 |     wcss.append(kmeans.inertia_)
 71 | plt.plot(range(1, 11), wcss)
 72 | plt.title('The Elbow Method')
 73 | plt.xlabel('Number of clusters')
 74 | plt.ylabel('WCSS')
 75 | plt.show()
 76 | 
 77 | #Kmeans
 78 | kmeans = KMeans(n_clusters=3).fit(X3)
 79 | centroids = kmeans.cluster_centers_
 80 | print(centroids)
 81 | 
 82 | #Plotting
 83 | 
 84 | plt.title('Rank vs CTR')
 85 | plt.scatter(df['Position'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs')
 86 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid')
 87 | plt.ylabel('CTR')
 88 | plt.xlabel('Rank')
 89 | plt.legend()
 90 | 
 91 | 
 92 | #### CTR VS Impressions ####
 93 | 
 94 | X4 = pd.DataFrame(df, columns = ['Impressions','CTR'])
 95 | 
 96 | 
 97 | 
 98 | #Finding the right number of clusters for Kmeans
 99 | wcss = []
100 | for i in range(1, 11):
101 |     kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
102 |     kmeans.fit(X4)
103 |     wcss.append(kmeans.inertia_)
104 | plt.plot(range(1, 11), wcss)
105 | plt.title('The Elbow Method')
106 | plt.xlabel('Number of clusters')
107 | plt.ylabel('WCSS')
108 | plt.show()
109 | 
110 | #Kmeans
111 | kmeans = KMeans(n_clusters=3).fit(X4)
112 | centroids = kmeans.cluster_centers_
113 | print(centroids)
114 | 
115 | #Plotting
116 | 
117 | plt.title('Impressions vs CTR')
118 | plt.scatter(df['Impressions'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs')
119 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid')
120 | plt.ylabel('CTR')
121 | plt.xlabel('Impressions')
122 | plt.legend()
123 | 
124 | 
125 | 
126 | 
127 | #### Clicks VS Impressions ####
128 | 
129 | X5 = pd.DataFrame(df, columns = ['Clicks','CTR'])
130 | 
131 | 
132 | #Finding the right number of clusters for Kmeans
133 | wcss = []
134 | for i in range(1, 11):
135 |     kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
136 |     kmeans.fit(X5)
137 |     wcss.append(kmeans.inertia_)
138 | plt.plot(range(1, 11), wcss)
139 | plt.title('The Elbow Method')
140 | plt.xlabel('Number of clusters')
141 | plt.ylabel('WCSS')
142 | plt.show()
143 | 
144 | #Kmeans
145 | kmeans = KMeans(n_clusters=3).fit(X5)
146 | centroids = kmeans.cluster_centers_
147 | print(centroids)
148 | 
149 | #Plotting
150 | 
151 | plt.title('Clicks vs CTR')
152 | plt.scatter(df['Clicks'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs')
153 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid')
154 | plt.ylabel('CTR')
155 | plt.xlabel('Clicks')
156 | plt.legend()
157 | 
158 | 
159 | 
160 | #Correlation plots
161 | import matplotlib.pyplot as plt
162 | import seaborn as sns
163 | 
164 | 
165 | palette = sns.color_palette("bright")
166 | 
167 | #correlations
168 | sns.pairplot(df)
169 | plt.show()
170 | 
171 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # SEO - Kmeans clustering 
  2 | 
  3 | In this project I will show step by step how to use Google search Console data and Kmeans clustering to group your URLs depending on a specific KPI. With this you will be able to see some important insights and optimize better your site. It will also give you an overview of the site from a more analitical stand point and will help you to make data driven decisions.
  4 | 
  5 | ## Getting Started
  6 | 
  7 | These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
  8 | 
  9 | ### Prerequisites
 10 | 
 11 | To be able to use this you will need the following Prerequisites
 12 | 
 13 | ```
 14 | Python3
 15 | Google search console Access
 16 | Screaming Frog
 17 | 
 18 | ```
 19 | 
 20 | ## Building your Data Set
 21 | 
 22 | ### Getting the Data from GSC
 23 | 
 24 | The first step we need is to get access to [Google Search Console](https://search.google.com/search-console) for your own site.
 25 | Once in the console we need to go to Performance and export page a report that includes the following.
 26 | 
 27 | 
 28 | | URL | Clicks	| Impressions |CTR	| Position| 
 29 | | ----| ------- | ------------| ----| --------|
 30 | 
 31 | You can select the time frame that you want and add the necessary filterst that you think they are right for your project.
 32 | If you want to get more data(more thn 1,000 URLs) You can try using the Google Search Console APi to make bigger requests.
 33 | 
 34 | ### Getting Title and description Lenghts
 35 | 
 36 | Once we have the list URLs, we go to screaming frog and in mode we select list and drop our list manually.
 37 | 
 38 | Here we would only need to export the title lengths and the description lengths. Once we get the length we can do a Vlookup and add it to our Google Search Console export we did before, the spreadsheet shoudl look like this:
 39 | 
 40 | | URL | Title Length | Description Length | Clicks	| Impressions |CTR	| Position| 
 41 | | ----| ------------ |------------------- | --------| ------------| ----|-------- |
 42 | 
 43 | ### Cleaning the data with Excel
 44 | 
 45 | Since im lazy, I will use excell to clean some stuff on the data set. The only important thing that we need to remove is the % sign on CTRs. Make sure that all the numbers are real numbers.
 46 | 
 47 | ### Installing Python Libraries
 48 | 
 49 | now that we have our dataset ready, we need to install some libraries that we will be suing for this.
 50 | To install the libraries copy the following and run it in the terminal
 51 | 
 52 | ```
 53 | pip install pandas
 54 | pip install numpy
 55 | pip install sklearn
 56 | pip install matplotlib
 57 | ```
 58 | 
 59 | ## Running the script
 60 | 
 61 | ### Opening The Data Set
 62 | 
 63 | The first two steps that we have on the script is opening the script. We need to make sure we are in the folder that our xlsx file is(The dataset we just created from GSC and Screaming frog). Once there open our file
 64 | 
 65 | ```python
 66 | #Opening xlsx
 67 | df = pd.read_excel (r'www.uselessthingstobuy.com.xlsx')
 68 | 
 69 | ```
 70 | ### Selecting the columns
 71 | 
 72 | After opening the file we need to decide what kind of experiment we will be doing. In this case I want to see what Title length vs CTR. This way I will be able to see what is the optimal title length based on the amount of clicks im getting. 
 73 | 
 74 | For this we create a new Dataframe that contains the two columns we just mentioned. Title Lenght and CTR.
 75 | 
 76 | ```python
 77 | #### CTR Vs Title Length ####
 78 | X2 = pd.DataFrame(df, columns = ['Title Length','CTR'])
 79 | 
 80 | ```
 81 | 
 82 | ### Finding the right number of clusters for Kmeans
 83 | 
 84 | 
 85 | Kmeans basically clusters your data in different groups. To know what is the optimal amount of clusters we need to have we run the elbow method. This will tell us based on our data what is the optimal number of clusters we should use for our data. Make sure that you are using the name that your dataframe has in our case its X2.
 86 | 
 87 | ```python
 88 | 
 89 | wcss = []
 90 | for i in range(1, 11):
 91 |     kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
 92 |     kmeans.fit(X2)
 93 |     wcss.append(kmeans.inertia_)
 94 | plt.plot(range(1, 11), wcss)
 95 | plt.title('The Elbow Method')
 96 | plt.xlabel('Number of clusters')
 97 | plt.ylabel('WCSS')
 98 | plt.show()
 99 | 
100 | ```
101 | 
102 | The result should look something like this
103 | 
104 | ![Elbow Method](Elbowmethod.png)
105 | 
106 | We can see that the optimal number of cluster is where the elbow is bending so in this case it will be 3
107 | 
108 | ### Kmeans
109 | 
110 | After we perform the kmean clustering with the number of clusters we got from our elbow method.
111 | 
112 | ```python
113 | kmeans = KMeans(n_clusters=3).fit(X4)
114 | centroids = kmeans.cluster_centers_
115 | print(centroids)
116 | ```
117 | 
118 | ### Plotting
119 | 
120 | Now that we have run everything we want to see our results and how our data is clustered. To plot our results we run the follwoing
121 | 
122 | ```python
123 | plt.title('Impressions vs CTR for UTB')
124 | plt.scatter(df['Impressions'], df['CTR'], c= kmeans.labels_.astype(float), s=50, alpha=0.5,label = 'URLs')
125 | plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50,label='Centroid')
126 | plt.ylabel('CTR')
127 | plt.xlabel('Impressions')
128 | plt.legend()
129 | ```
130 | 
131 | ![Graph](CTR-Lenght.png)
132 | 
133 | 
134 | ### Insights
135 | 
136 | My data is very Shitty but we can still see that based on our graph the optimal title lenght that has high CTR are between 35 - 55 characters.
137 | 
138 | 
139 | ### Example 2 CTR v Impressions
140 | 
141 | ![Graph](CTR-Impressions.png)
142 | 
143 | 
144 | This example can show us how well are our pages performing in General. High impressions low CTR will means we need to do something about those pages because we are ranking well we are having impressions but users are not clicking on our results. Why? maybe bad Call to Action or something different. This could help seeing if we have a problem like that.
145 | 
146 | Also in this example, we need to remove outliers. Since my data had one big outlier that had over 87K impressions I just removed it from the spreadsheet to do this graph.
147 | 
148 | 
149 | 


--------------------------------------------------------------------------------