├── Elbow_Validation.py ├── Elbow_curve.jpg ├── README.md └── Unsupervised_clustering.py /Elbow_Validation.py: -------------------------------------------------------------------------------- 1 | checking for optimal number of clusters 2 | 3 | from sklearn.cluster import KMeans 4 | 5 | Sum_of_squared_distances = [] 6 | K = range(1,150) 7 | for k in K: 8 | km = KMeans(n_clusters=k) 9 | km = km.fit(X_transformed) 10 | Sum_of_squared_distances.append(km.inertia_) 11 | 12 | 13 | import matplotlib.pyplot as plt 14 | 15 | plt.plot(K, Sum_of_squared_distances, 'bx-') 16 | plt.xlabel('k') 17 | plt.ylabel('Sum_of_squared_distances') 18 | plt.title('Elbow Method For Optimal k') 19 | plt.show() -------------------------------------------------------------------------------- /Elbow_curve.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rohithramesh1991/Unsupervised-Text-Clustering/adf8e07ce9e75db1a236d1160ef7e3162b98affa/Elbow_curve.jpg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Unsupervised-Text-Clustering 2 | Step 1: Vectorization 3 | 4 | TFIDF Vectorizer is used to create a vocabulary. 5 | TFIDF is a product of how frequent a word is in a document multiplied by how unique a word is w.r.t the entire corpus. 6 | ngram_range parameter : which will help to create one , two or more word vocabulary depending on the requirement. 7 | 8 | Step 2: kmeans - Clustering 9 | Grouping similar data points together and discover underlying patterns. 10 | To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. 11 | The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster. 12 | The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid. 13 | 14 | Step 3 : Validation for the optimal number of clusters using ELBOW method: 15 | 16 | The “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. 17 | The total WSS(within-cluster sum of square) measures the compactness of the clustering and we want it to be as small as possible. 18 | The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS. 19 | 20 | How it calculates : 21 | * For each k, calculate the total within-cluster sum of square (wss). 22 | * Plot the curve of wss according to the number of clusters k. 23 | * The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. 24 | For detailed explanation kindly refer "https://medium.com/@rohithramesh1991/unsupervised-text-clustering-using-natural-language-processing-nlp-1a8bc18b048d" 25 | 26 | ## Note: 27 | Please don't copy the code directly and try and execute as it will give you unresolved issue with respect to X_train. You can use these funtions on your data, so the X_train comes from your data. 28 | -------------------------------------------------------------------------------- /Unsupervised_clustering.py: -------------------------------------------------------------------------------- 1 | #Text pre-processing 2 | """removes punctuation, stopwords, and returns a list of the remaining words, or tokens""" 3 | import nltk 4 | from nltk.corpus import stopwords 5 | from nltk.stem import WordNetLemmatizer 6 | nltk.download('stopwords') 7 | nltk.download('wordnet') 8 | 9 | #Cleaning the text 10 | 11 | import string 12 | def text_process(text): 13 | ''' 14 | Takes in a string of text, then performs the following: 15 | 1. Remove all punctuation 16 | 2. Remove all stopwords 17 | 3. Return the cleaned text as a list of words 18 | 4. Remove words 19 | ''' 20 | stemmer = WordNetLemmatizer() 21 | nopunc = [char for char in text if char not in string.punctuation] 22 | nopunc = ''.join([i for i in nopunc if not i.isdigit()]) 23 | nopunc = [word.lower() for word in nopunc.split() if word not in stopwords.words('english')] 24 | return [stemmer.lemmatize(word) for word in nopunc] 25 | 26 | #Vectorisation : - 27 | 28 | from sklearn.feature_extraction.text import TfidfVectorizer 29 | 30 | tfidfconvert = TfidfVectorizer(analyzer=text_process,ngram_range=(1,3)).fit(X_train) 31 | 32 | X_transformed=tfidfconvert.transform(X_train) 33 | 34 | # Clustering the training sentences with K-means technique 35 | 36 | from sklearn.cluster import KMeans 37 | modelkmeans = KMeans(n_clusters=60, init='k-means++', n_init=100) 38 | modelkmeans.fit(X_transformed) 39 | 40 | --------------------------------------------------------------------------------