├── Elbow_Validation.py
├── Elbow_curve.jpg
├── README.md
└── Unsupervised_clustering.py


/Elbow_Validation.py:
--------------------------------------------------------------------------------
 1 | ﻿checking for optimal number of clusters
 2 | 
 3 | from sklearn.cluster import KMeans
 4 | 
 5 | Sum_of_squared_distances = []
 6 | K = range(1,150)
 7 | for k in K:
 8 |     km = KMeans(n_clusters=k)
 9 |     km = km.fit(X_transformed)
10 |     Sum_of_squared_distances.append(km.inertia_)
11 | 
12 | 
13 | import matplotlib.pyplot as plt
14 | 
15 | plt.plot(K, Sum_of_squared_distances, 'bx-')
16 | plt.xlabel('k')
17 | plt.ylabel('Sum_of_squared_distances')
18 | plt.title('Elbow Method For Optimal k')
19 | plt.show()


--------------------------------------------------------------------------------
/Elbow_curve.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rohithramesh1991/Unsupervised-Text-Clustering/adf8e07ce9e75db1a236d1160ef7e3162b98affa/Elbow_curve.jpg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Unsupervised-Text-Clustering
 2 | Step 1: Vectorization
 3 | 
 4 | TFIDF Vectorizer is used to create a vocabulary.
 5 | TFIDF is a product of how frequent a word is in a document multiplied by how unique a word is w.r.t the entire corpus.
 6 | ngram_range parameter : which will help to create one , two or more word vocabulary depending on the requirement.
 7 | 
 8 | Step 2: kmeans - Clustering
 9 | Grouping similar data points together and discover underlying patterns. 
10 | To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
11 | The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster.
12 | The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
13 | 
14 | Step 3 : Validation for the optimal number of clusters using ELBOW method:
15 | 
16 | The “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.
17 | The total WSS(within-cluster sum of square) measures the compactness of the clustering and we want it to be as small as possible.
18 | The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.
19 | 
20 | How it calculates : 
21 | * For each k, calculate the total within-cluster sum of square (wss).
22 | * Plot the curve of wss according to the number of clusters k.
23 | * The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
24 | For detailed explanation kindly refer "https://medium.com/@rohithramesh1991/unsupervised-text-clustering-using-natural-language-processing-nlp-1a8bc18b048d"
25 | 
26 | ## Note:
27 | Please don't copy the code directly and try and execute as it will give you unresolved issue with respect to X_train. You can use these funtions on your data, so the X_train comes from your data.
28 | 


--------------------------------------------------------------------------------
/Unsupervised_clustering.py:
--------------------------------------------------------------------------------
 1 | ﻿#Text pre-processing
 2 | """removes punctuation, stopwords, and returns a list of the remaining words, or tokens"""
 3 | import nltk
 4 | from nltk.corpus import stopwords
 5 | from nltk.stem import WordNetLemmatizer
 6 | nltk.download('stopwords')
 7 | nltk.download('wordnet')
 8 | 
 9 | #Cleaning the text
10 | 
11 | import string
12 | def text_process(text):
13 |     '''
14 |     Takes in a string of text, then performs the following:
15 |     1. Remove all punctuation
16 |     2. Remove all stopwords
17 |     3. Return the cleaned text as a list of words
18 |     4. Remove words
19 |     '''
20 |     stemmer = WordNetLemmatizer()
21 |     nopunc = [char for char in text if char not in string.punctuation]
22 |     nopunc = ''.join([i for i in nopunc if not i.isdigit()])
23 |     nopunc =  [word.lower() for word in nopunc.split() if word not in stopwords.words('english')]
24 |     return [stemmer.lemmatize(word) for word in nopunc]
25 | 
26 | #Vectorisation : -
27 | 
28 | from sklearn.feature_extraction.text import TfidfVectorizer
29 | 
30 | tfidfconvert = TfidfVectorizer(analyzer=text_process,ngram_range=(1,3)).fit(X_train)
31 | 
32 | X_transformed=tfidfconvert.transform(X_train)
33 | 
34 | # Clustering the training sentences with K-means technique
35 | 
36 | from sklearn.cluster import KMeans
37 | modelkmeans = KMeans(n_clusters=60, init='k-means++', n_init=100)
38 | modelkmeans.fit(X_transformed)
39 | 
40 | 


--------------------------------------------------------------------------------