├── Installation ├── README.md ├── wiki_topic_model.py ├── wiki_parser.py └── wiki_topic_cluster.py /Installation: -------------------------------------------------------------------------------- 1 | Topic modelling requires following installations. 2 | 3 | 1. Anaconda Distribution. 4 | Download & install Python 2.7 version (Windows/Linux, 32 bit/64bit) from https://www.anaconda.com/download/ 5 | 6 | 2. NLTK - DATA. 7 | Run the following statements in python to download and install the NLTK data & corpora on your system: 8 | >> import nltk 9 | >> nltk.download() 10 | Choose "All packages" from the "Collections" tab of the user interface which will be popped up. Press "Download". 11 | 12 | 3. Gensim Library for Topic Modelling. 13 | For windows: pip install -U gensim 14 | For Linux: pip install --upgrade gensim 15 | (In case that fails, make sure you’re installing into a writeable location (or use sudo), or read on.) 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Topic-Modelling-on-Wiki-corpus 2 | It uses Latent Dirichlet Allocation algorithm to discover hidden topics from the articles. It is trained on 60,000 articles taken from simple wikipedia english corpus. Finally, It extracts the topic of the given input text article. 3 | 4 | The whole application of topic modelling is performed in 3 steps. The purpose is to build the system from scratch and provide an insight of implementation of the same to viewers. 5 | 6 | The 3 steps are: 7 | 1. Creating an article corpus of 70.000 - 80,000 articles from the simple wiki XML dump file. (done by wiki_parser.py) 8 | 2. Automatically discovering hidden topics from the training articles (60,000 training articles) 9 | 3. Performs different applications like articles clustering, getting similar articles related to specific word, Extracting theme/topic from article based on topic discovered in Step 2. 10 | 11 | Best thing would be to follow series of blog-post for the same. The description about the steps to perform "Topic Modelling" from scratch can be read from my blog: 12 | 13 | Part 1 14 | https://appliedmachinelearning.wordpress.com/2017/08/28/topic-modelling-part-1-creating-article-corpus-from-simple-wikipedia-dump/ 15 | 16 | Part 2 17 | https://appliedmachinelearning.wordpress.com/2017/09/28/topic-modelling-part-2-discovering-topics-from-articles-with-latent-dirichlet-allocation/ 18 | 19 | Part 3 20 | https://appliedmachinelearning.wordpress.com/2017/10/13/topic-modelling-part-3-document-clustering-exploration-theme-extraction-from-simplewiki-articles/ 21 | 22 | -------------------------------------------------------------------------------- /wiki_topic_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | @author: Abhijeet Kumar 4 | """ 5 | 6 | import os 7 | import random 8 | import codecs 9 | import cPickle 10 | from gensim.models.ldamodel import LdaModel as Lda 11 | from gensim import corpora 12 | from nltk.corpus import stopwords 13 | from nltk.stem.wordnet import WordNetLemmatizer 14 | 15 | 16 | # Function to remove stop words from sentences & lemmatize verbs. 17 | def clean(doc): 18 | stop_free = " ".join([i for i in doc.lower().split() if i not in stop]) 19 | normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split()) 20 | x = normalized.split() 21 | y = [s for s in x if len(s) > 2] 22 | return y 23 | 24 | 25 | corpus_path = "articles-corpus/" 26 | article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)] 27 | 28 | # Read contents of all the articles in a list "doc_complete" 29 | doc_complete = [] 30 | for path in article_paths: 31 | fp = codecs.open(path,'r','utf-8') 32 | doc_content = fp.read() 33 | doc_complete.append(doc_content) 34 | 35 | # Randomly sample 70000 articles from the corpus created from the 1st blog-post (wiki_parser.py) 36 | docs_all = random.sample(doc_complete, 70000) 37 | docs = open("docs_wiki.pkl",'wb') 38 | cPickle.dump(docs_all,docs) 39 | 40 | # Use 60000 articles for training. 41 | docs_train = docs_all[:60000] 42 | 43 | # Cleaning all the 60,000 simplewiki articles 44 | stop = set(stopwords.words('english')) 45 | lemma = WordNetLemmatizer() 46 | doc_clean = [clean(doc) for doc in docs_train] 47 | 48 | # Creating the term dictionary of our courpus, where every unique term is assigned an index. 49 | dictionary = corpora.Dictionary(doc_clean) 50 | 51 | # Filter the terms which have occured in less than 3 articles and more than 40% of the articles 52 | dictionary.filter_extremes(no_below=4, no_above=0.4) 53 | 54 | # List of some words which has to be removed from dictionary as they are content neutral words 55 | stoplist = set('also use make people know many call include part find become like mean often different \ 56 | usually take wikt come give well get since type list say change see refer actually iii \ 57 | aisne kinds pas ask would way something need things want every str'.split()) 58 | stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id] 59 | dictionary.filter_tokens(stop_ids) 60 | 61 | #words,ids = dictionary.filter_n_most_frequent(50) 62 | #print words,"\n\n",ids 63 | 64 | # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 65 | doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] 66 | 67 | 68 | #Creating the object for LDA model using gensim library & Training LDA model on the document term matrix. 69 | ldamodel = Lda(doc_term_matrix, num_topics=50, id2word = dictionary, passes=50, iterations=500) 70 | ldafile = open('lda_model_sym_wiki.pkl','wb') 71 | cPickle.dump(ldamodel,ldafile) 72 | ldafile.close() 73 | 74 | #Print all the 50 topics 75 | for topic in ldamodel.print_topics(num_topics=50, num_words=10): 76 | print topic[0]+1, " ", topic[1],"\n" 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | -------------------------------------------------------------------------------- /wiki_parser.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | @author: Abhijeet 4 | """ 5 | 6 | import xml.etree.ElementTree as ET 7 | import codecs 8 | import re 9 | 10 | def is_ascii(s): 11 | return all(ord(c) < 128 for c in s) 12 | 13 | tree = ET.parse('simplewiki-20170201-pages-articles-multistream.xml') 14 | root = tree.getroot() 15 | dir_path = 'articles-corpus//' 16 | 17 | for i,page in enumerate(root.findall('{http://www.mediawiki.org/xml/export-0.10/}page')): 18 | for p in page: 19 | if p.tag == "{http://www.mediawiki.org/xml/export-0.10/}revision": 20 | for x in p: 21 | if x.tag == "{http://www.mediawiki.org/xml/export-0.10/}text": 22 | article_txt = x.text 23 | if not article_txt == None: 24 | article_txt = article_txt[ : article_txt.find("==")] 25 | article_txt = re.sub(r"{{.*}}","",article_txt) 26 | article_txt = re.sub(r"\[\[File:.*\]\]","",article_txt) 27 | article_txt = re.sub(r"\[\[Image:.*\]\]","",article_txt) 28 | article_txt = re.sub(r"\n: \'\'.*","",article_txt) 29 | article_txt = re.sub(r"\n!.*","",article_txt) 30 | article_txt = re.sub(r"^:\'\'.*","",article_txt) 31 | article_txt = re.sub(r" ","",article_txt) 32 | article_txt = re.sub(r"http\S+","",article_txt) 33 | article_txt = re.sub(r"\d+","",article_txt) 34 | article_txt = re.sub(r"\(.*\)","",article_txt) 35 | article_txt = re.sub(r"Category:.*","",article_txt) 36 | article_txt = re.sub(r"\| .*","",article_txt) 37 | article_txt = re.sub(r"\n\|.*","",article_txt) 38 | article_txt = re.sub(r"\n \|.*","",article_txt) 39 | article_txt = re.sub(r".* \|\n","",article_txt) 40 | article_txt = re.sub(r".*\|\n","",article_txt) 41 | article_txt = re.sub(r"{{Infobox.*","",article_txt) 42 | article_txt = re.sub(r"{{infobox.*","",article_txt) 43 | article_txt = re.sub(r"{{taxobox.*","",article_txt) 44 | article_txt = re.sub(r"{{Taxobox.*","",article_txt) 45 | article_txt = re.sub(r"{{ Infobox.*","",article_txt) 46 | article_txt = re.sub(r"{{ infobox.*","",article_txt) 47 | article_txt = re.sub(r"{{ taxobox.*","",article_txt) 48 | article_txt = re.sub(r"{{ Taxobox.*","",article_txt) 49 | article_txt = re.sub(r"\* .*","",article_txt) 50 | article_txt = re.sub(r"<.*>","",article_txt) 51 | article_txt = re.sub(r"\n","",article_txt) 52 | article_txt = re.sub(r"\!|\"|\#|\$|\%|\&|\'|\(|\)|\*|\+|\,|\-|\.|\/|\:|\;|\<|\=|\>|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}|\~"," ",article_txt) 53 | article_txt = re.sub(r" +"," ",article_txt) 54 | article_txt = article_txt.replace(u'\xa0', u' ') 55 | 56 | if not article_txt == None and not article_txt == "" and len(article_txt) > 150 and is_ascii(article_txt): 57 | outfile = dir_path + str(i+1) +"_article.txt" 58 | f = codecs.open(outfile, "w", "utf-8") 59 | f.write(article_txt) 60 | f.close() 61 | print article_txt 62 | print '\n=================================================================\n' 63 | 64 | -------------------------------------------------------------------------------- /wiki_topic_cluster.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jul 31 10:58:49 2017 4 | 5 | @author: Abhijeet 6 | """ 7 | 8 | 9 | import cPickle 10 | from nltk.corpus import stopwords 11 | from nltk.stem.wordnet import WordNetLemmatizer 12 | from operator import itemgetter 13 | import os 14 | 15 | stop = set(stopwords.words('english')) 16 | lemma = WordNetLemmatizer() 17 | 18 | lda_fp = open("lda_model_sym_wiki.pkl", 'rb') 19 | ldamodel = cPickle.load(lda_fp) 20 | 21 | def rem_ascii(s): 22 | return "".join([c for c in s if ord(c) < 128 ]) 23 | 24 | def clean_doc(doc): 25 | doc_ascii = rem_ascii(doc) 26 | stop_free = " ".join([i for i in doc_ascii.lower().split() if i not in stop]) 27 | normalized = " ".join(lemma.lemmatize(word,'v') for word in stop_free.split()) 28 | x = normalized.split() 29 | y = [s for s in x if len(s) > 2] 30 | return y 31 | 32 | def get_theme(doc): 33 | topics = "Electrical_systems_or_Education unknown music unknown Software \ 34 | International_event Literature War_or_Church Lingual_or_Research Biology \ 35 | Waterbody Wikipedia_or_Icehockey unknown unknown html_tags sports TV_shows \ 36 | Terms_and_Services music US_states Timeline Chemistry Germany Location_area \ 37 | Film_awards Games US_school unknown Railways Biography Directions_Australlia \ 38 | France India_Pakistan Canada_politcs_or_WWE Politics unknown British_Royal_Family \ 39 | American_Movies unknown Colors_or_Birds Fauna Chinese_Military unknown unknown \ 40 | unknown unknown unknown html_tags US_Govt Music_band".split() 41 | 42 | theme = "" 43 | cleandoc = clean_doc(doc) 44 | doc_bow = ldamodel.id2word.doc2bow(cleandoc) 45 | doc_topics = ldamodel.get_document_topics(doc_bow, minimum_probability=0.20) 46 | if doc_topics: 47 | doc_topics.sort(key = itemgetter(1), reverse=True) 48 | theme = topics[doc_topics[0][0]] 49 | if theme == "unknown": 50 | theme = topics[doc_topics[1][0]] 51 | else: 52 | theme = "unknown" 53 | return theme 54 | 55 | 56 | def get_related_documents(term, top, corpus): 57 | print "-------------------",top," top articles related to ",term,"-----------------------" 58 | clean_docs = [clean_doc(doc) for doc in corpus] 59 | related_docid = [] 60 | test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs] 61 | doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20) 62 | term_topics = ldamodel.get_term_topics(term, minimum_probability=0.000001) 63 | for k,topics in enumerate(doc_topics): 64 | if topics: 65 | topics.sort(key = itemgetter(1), reverse=True) 66 | if topics[0][0] == term_topics[0][0]: 67 | related_docid.append((k,topics[0][1])) 68 | 69 | related_docid.sort(key = itemgetter(1), reverse=True) 70 | for j,doc_id in enumerate(related_docid): 71 | print docs_test[doc_id[0]],"\n",doc_id[1],"\n" 72 | if j == (top-1): 73 | break 74 | 75 | 76 | def cluster_similar_documents(corpus, dirname): 77 | clean_docs = [clean_doc(doc) for doc in corpus] 78 | test_term = [ldamodel.id2word.doc2bow(doc) for doc in clean_docs] 79 | doc_topics = ldamodel.get_document_topics(test_term, minimum_probability=0.20) 80 | for k,topics in enumerate(doc_topics): 81 | if topics: 82 | topics.sort(key = itemgetter(1), reverse=True) 83 | dir_name = dirname + "/" + str(topics[0][0]) 84 | file_name = dir_name + "/" + str(k) + ".txt" 85 | if not os.path.exists(dir_name): 86 | os.makedirs(dir_name) 87 | fp = open(file_name,"w") 88 | fp.write(docs_test[k] + "\n\n" + str(topics[0][1]) ) 89 | fp.close() 90 | else: 91 | if not os.path.exists(dirname + "/unknown"): 92 | os.makedirs(dirname + "/unknown") 93 | file_name = dirname + "/unknown/" + str(k) + ".txt" 94 | fp = open(file_name,"w") 95 | fp.write(docs_test[k]) 96 | 97 | docs_fp = open("docs_wiki.pkl", 'rb') 98 | docs_all = cPickle.load(docs_fp) 99 | docs_test = docs_all[60000:] 100 | 101 | 102 | get_related_documents("music",5,docs_test) 103 | cluster_similar_documents(docs_test,"root") 104 | article = "Mohandas Karamchand Gandhi[14] was born on 2 October 1869[1] to a \ 105 | Hindu Modh Baniya family[15] in Porbandar (also known as Sudamapuri)\ 106 | , a coastal town on the Kathiawar Peninsula and then part of the \ 107 | small princely state of Porbandar in the Kathiawar Agency of the \ 108 | Indian Empire. His father, Karamchand Uttamchand Gandhi (1822–1885), \ 109 | served as the diwan (chief minister) of Porbandar state.[16] Although\ 110 | he only had an elementary education and had previously been a clerk \ 111 | in the state administration, Karamchand proved a capable chief minister.[17] \ 112 | During his tenure, Karamchand married four times. His first two wives \ 113 | died young, after each had given birth to a daughter, and his third\ 114 | marriage was childless. In 1857, Karamchand sought his third wife's\ 115 | permission to remarry; that year, he married Putlibai (1844–1891), \ 116 | who also came from Junagadh,[18] and was from a Pranami Vaishnava \ 117 | family.[19][20][21][22] Karamchand and Putlibai had three children \ 118 | over the ensuing decade, a son, Laxmidas (c. 1860 – March 1914), a \ 119 | daughter, Raliatbehn (1862–1960) and another son, Karsandas (c. 1866–1913)" 120 | print article, "\n" 121 | 122 | print "Theme -> ",get_theme(article) 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | --------------------------------------------------------------------------------