├── .gitignore ├── README.md ├── embed_wiki_data.py ├── knowledge_graph_example.png ├── main.py ├── source.py ├── source_list.py ├── text_extractor.py └── wiki_knowledge_graph.py /.gitignore: -------------------------------------------------------------------------------- 1 | religion_wiki_data.pkl 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Wikipedia Knowledge Graphs 2 | ## Summary 3 | Given a list of topics, ideally Wikipedia pages, we construct and display a knowledge graph of the key entities and their connecting relationships. 4 | 5 | ## Purpose 6 | Wikipedia is the world's encyclopedia, containing (as far as I'm concerned) the sum total of all human knowledge. Advances in natural language processing techniques 7 | have created the opportunity to mine this repository of knowledge for patterns and structure representing concepts and information, and embed this data 8 | into a structured graph that can be queried to answer questions. 9 | 10 | ## Setup 11 | To set up the environment, we recommend using conda: 12 | 13 | ``` 14 | conda create -n wiki_kg python=3.7 15 | ``` 16 | 17 | Next, activate the environemnt: 18 | 19 | ``` 20 | conda activate wiki_kg 21 | ``` 22 | 23 | And install the following libraries: 24 | 25 | ``` 26 | pip install spacy==2.1.0 27 | pip install neuralcoref 28 | pip install wikipedia-api 29 | pip install networkx 30 | pip install pandas 31 | pip install matplotlib 32 | ``` 33 | 34 | ## Results 35 | The output of running the main program should be a graph that looks something like the following: 36 | 37 | ![alt text](https://github.com/nateburley/WikiKnowledgeGraphs/blob/master/knowledge_graph_example.png) 38 | 39 | ## Further Work 40 | - Productionize this by creating a web page where users can select topics from a drop down menu, and render the knowledge graph in their browser 41 | - For large datasets, the results could possibly be improved by using Word2Vec embeddings and cosine similarities to drop irrelevant terms. We could train 42 | the model on all the pages used and then drop terms with low co-occurences. 43 | - Something else that could be an interesting side project, perhaps on its own: inferring semantic rules from similar dependency trees. The article (linked 44 | below) gives an introduction, but it would be interesting to put some thought into mathematically determining how similar given trees are. This could 45 | actually even involve some Case-Based Reasoning: 46 | - Take some common trees (hyper/hyponyms, etc.) and make rules for them 47 | - Next, for a sentence that doesn't fit into a pre-defined rule, figure out the most similar dependency tree, and 48 | modify the rule as needed to fit the new sentence! 49 | - That rule then goes into our list of rules, and the iteration proceeds 50 | 51 | On that note, here's an algorithm for comparing trees: https://arxiv.org/pdf/1508.03381.pdf 52 | 53 | - Lots of random pages that aren't that relevant get scraped. Can those nodes be dropped later, since in theory they should be "less connected" than the 54 | other nodes that are actually "on topic"? 55 | 56 | ## Sources and Further Reading 57 | - https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/ 58 | - https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/ 59 | - https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/ 60 | - https://realpython.com/natural-language-processing-spacy-python/ 61 | - https://programmerbackpack.com/python-knowledge-graph-understanding-semantic-relationships/ 62 | -------------------------------------------------------------------------------- /embed_wiki_data.py: -------------------------------------------------------------------------------- 1 | """ 2 | Functions used to create word embeddings of the text scraped from Wikipedia. These will be explored 3 | for sentiment, relations, and analogy making, with named entity relationships being compared between 4 | the embedded format, and the knowledge graph, to explore how the structures differ from each other. 5 | 6 | Author: Nathaniel M. Burley 7 | """ 8 | -------------------------------------------------------------------------------- /knowledge_graph_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nateburley/WikiKnowledgeGraphs/e39e22bdad68863edf1e64a310db76d718bb659e/knowledge_graph_example.png -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | ######################################################################################################################## 2 | ## DO THE THING: SCRAPE WIKIPEDIA, APPLY NLP, AND BUILD A KNOWLEDGE GRAPH 3 | # TODO: Lots of random pages that aren't that relevant get scraped. Can those nodes be dropped later, since in theory 4 | # they should be "less connected" or "weaker connected" than the other nodes that are actually "on topic"? 5 | ######################################################################################################################## 6 | 7 | from wiki_knowledge_graph import * 8 | 9 | 10 | if __name__ == '__main__': 11 | # Scrape Wikipedia for a topic 12 | wiki_data = wiki_scrape(['Catholic Church', 'Islam', 'Russian Orthodox Church', 'Judaism', 'Buddhism', 'Panpsychism', 'UFO religion']) 13 | print("WIKIPEDIA SCRAPE DF LENGTH: {}".format(len(wiki_data.index))) 14 | print(wiki_data.head(25)) 15 | print("\n") 16 | 17 | # Pickle the wiki_data to not have to scrape a million times 18 | datafile = open('religion_wiki_data', 'wb') 19 | pickle.dump(wiki_data, datafile) 20 | 21 | # Load in the wiki data, so we don't have to scrape 22 | infile = open('religion_wiki_data', 'rb') 23 | wiki_data = pickle.load(infile) 24 | 25 | # Get subject object relationships (which form vertices and edges in the graph) 26 | # TODO: Parallelize (thread) the shit out of this 27 | # - https://realpython.com/intro-to-python-threading/ 28 | all_pairs = extract_all_relations(wiki_data) 29 | print("ENTITY PAIRS-- SUBJECT/OBJECT RELATIONSHIPS LENGTH: {}".format(len(all_pairs.index))) 30 | print(all_pairs.head(20)) 31 | print(all_pairs.tail(20)) 32 | print("\n") 33 | 34 | # Draw the graph 35 | draw_KG(all_pairs) -------------------------------------------------------------------------------- /source.py: -------------------------------------------------------------------------------- 1 | """ 2 | Class that contains a single "source", i.e. one Wikipedia page 3 | """ 4 | import pandas as pd 5 | 6 | class Source: 7 | # Function that initializes a source object 8 | def __init__(self, title, text, link, categories, topic): 9 | self.page_title = title 10 | self.text = text 11 | self.link = link 12 | self.categories = categories 13 | self.topic = topic 14 | 15 | 16 | ## GETTERS 17 | # Function that returns the title 18 | def getTitle(self): 19 | return self.page_title 20 | 21 | 22 | # Function that returns the text 23 | def getText(self): 24 | return self.text 25 | 26 | 27 | # Function that returns the link 28 | def getLink(self): 29 | return self.link 30 | 31 | 32 | # Function that returns the categories 33 | def getCategories(self): 34 | return self.categories 35 | 36 | 37 | # Function that returns the topic 38 | def getTopic(self): 39 | return self.topic 40 | 41 | 42 | ## SETTERS 43 | # Function that sets the title 44 | def setTitle(self, new_title): 45 | self.page_title = new_title 46 | 47 | 48 | # Function that sets the text 49 | def setText(self, new_text): 50 | self.text = new_text 51 | 52 | 53 | # Function that sets the link 54 | def setLink(self, new_link): 55 | self.link = new_link 56 | 57 | 58 | # Function that sets the categories 59 | def setCategories(self, new_category): 60 | self.categories = new_category 61 | 62 | 63 | # Function that sets the topic 64 | def setTopic(self, new_topic): 65 | self.topic = new_topic 66 | 67 | 68 | ## CREATE AND RETURN DATA FRAME, IF NEEDED 69 | def getDF(self): 70 | return pd.DataFrame({'title': self.page_title, 'text': self.text, 'link': self.link,\ 71 | 'categories': self.categories, 'topic': self.topic}) 72 | 73 | 74 | 75 | ## HELPER FUNCTION THAT TURNS A DATA FRAME INTO A LIST OF SOURCES -------------------------------------------------------------------------------- /source_list.py: -------------------------------------------------------------------------------- 1 | """ 2 | Class that holds a list of sources. Created by scraping Wikipedia 3 | """ 4 | import wikipediaapi 5 | import pandas as pd 6 | import concurrent.futures 7 | from tqdm import tqdm 8 | from source import Source 9 | 10 | class SourceList: 11 | def __init__(self, source_list=[]): 12 | self.source_list = source_list 13 | 14 | # Function that checks if a source has already been added 15 | def checkSourceAdded(self, new_source): 16 | for existing_source in self.source_list: 17 | if new_source.page_title == existing_source.page_title: 18 | print("The page '{}' has already been added!".format(existing_source.page_title)) 19 | return True 20 | else: 21 | return False 22 | 23 | # Function that checks if a page (not a Source) has been added 24 | def checkPageAdded(self, new_title): 25 | for existing_source in self.source_list: 26 | if new_title == existing_source.page_title: 27 | print("The page [title] '{}' has already been added!".format(existing_source.page_title)) 28 | return True 29 | else: 30 | return False 31 | 32 | # Function that adds a new source 33 | def addSource(self, new_source): 34 | if not (self.checkSourceAdded(new_source) or self.checkPageAdded(new_source.page_title)): 35 | self.source_list.append(new_source) 36 | 37 | # Function that scrapes Wikipedia to build the source list 38 | #TODO: Add logic from other file to build sources 39 | def buildSourceList(self, titles, verbose=True): 40 | def wikiLink(link): 41 | try: 42 | page = wiki_api.page(link) 43 | if page.exists(): 44 | return {'page': link, 'text': page.text, 'link': page.fullurl, 45 | 'categories': list(page.categories.keys())} 46 | except: 47 | return None 48 | 49 | for current_title in titles: 50 | if not self.checkPageAdded(current_title): 51 | wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI) 52 | page_name = wiki_api.page(current_title) 53 | if not page_name.exists(): 54 | print('Page {} does not exist.'.format(page_name)) 55 | return 56 | 57 | page_links = list(page_name.links.keys()) 58 | print_description = "Links Scraped for page '{}'".format(current_title) 59 | progress = tqdm(desc=print_description, unit='', total=len(page_links)) if verbose else None 60 | current_source = Source(page_name.title, page_name.text, page_name.fullurl, list(page_name.categories.keys()), page_name) 61 | 62 | # Parallelize the scraping, to speed it up (?) 63 | with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: 64 | future_link = {executor.submit(wikiLink, link): link for link in page_links} 65 | for future in concurrent.futures.as_completed(future_link): 66 | data = future.result() 67 | current_source.append(data) if data else None 68 | progress.update(1) if verbose else None 69 | progress.close() if verbose else None 70 | 71 | namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \ 72 | 'Category talk', 'Portal talk') 73 | current_source = current_source[(len(current_source['text']) > 20) & ~(current_source['page'].str.startswith(namespaces, na=True))] 74 | current_source['categories'] = current_source.categories.apply(lambda x: [y[9:] for y in x]) 75 | current_source['topic'] = page_name 76 | print('Wikipedia pages scraped so far:', len(current_source)) -------------------------------------------------------------------------------- /text_extractor.py: -------------------------------------------------------------------------------- 1 | """ 2 | Scrapes Wikipedia for a given list of topics. 3 | 4 | TODO: Add function that checks if a source has been scraped already 5 | TODO: Figure out if this should run in parallel using Pool 6 | """ 7 | import wikipediaapi 8 | import pandas as pd 9 | import concurrent.futures 10 | from tqdm import tqdm 11 | 12 | 13 | class Sources: 14 | def __init__(self, verbose=True, pages=[]): 15 | self.pages = pages 16 | self.sources = pd.DataFrame(columns=['page', 'text', 'link', 'categories', 'topic']) 17 | self.verbose = verbose 18 | 19 | # TODO: Add function here the checks if a source (or topic) has been scraped already 20 | # Something like "if current_topic in self.sources['']: return True" 21 | def alreadyScraped(self, page_name): 22 | if self.sources['page'].str.contains(page_name).any(): 23 | print("Page '{}' has been scraped already!".format(page_name)) 24 | return True 25 | else: 26 | return False 27 | 28 | def extract(self): 29 | def wikiLink(link): 30 | try: 31 | page = wiki_api.page(link) 32 | if page.exists(): 33 | return {'page': link, 'text': page.text, 'link': page.fullurl, 34 | 'categories': list(page.categories.keys())} 35 | except: 36 | return None 37 | 38 | for page_name in self.pages: 39 | if not self.alreadyScraped(page_name): 40 | wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI) 41 | page_name = wiki_api.page(page_name) 42 | if not page_name.exists(): 43 | print('Page {} does not exist.'.format(page_name)) 44 | return 45 | 46 | page_links = list(page_name.links.keys()) 47 | progress = tqdm(desc='Links Scraped', unit='', total=len(page_links)) if self.verbose else None 48 | current_source = [{'page': page_name, 'text': page_name.text, 'link': page_name.fullurl, 49 | 'categories': list(page_name.categories.keys())}] 50 | 51 | # Parallelize the scraping, to speed it up (?) 52 | with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: 53 | future_link = {executor.submit(wikiLink, link): link for link in page_links} 54 | for future in concurrent.futures.as_completed(future_link): 55 | data = future.result() 56 | self.sources.append(data) if data else None 57 | progress.update(1) if self.verbose else None 58 | progress.close() if self.verbose else None 59 | 60 | namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \ 61 | 'Category talk', 'Portal talk') 62 | current_source = self.sources[(len(self.sources['text']) > 20) & ~(self.sources['page'].str.startswith(namespaces, na=True))] 63 | current_source['categories'] = self.sources.categories.apply(lambda x: [y[9:] for y in x]) 64 | current_source['topic'] = page_name 65 | print('Wikipedia pages scraped so far:', len(self.sources)) 66 | 67 | 68 | return self.sources 69 | -------------------------------------------------------------------------------- /wiki_knowledge_graph.py: -------------------------------------------------------------------------------- 1 | """ 2 | This file mines Wikipedia articles for syntax and keywords, and builds a knowledge graph. 3 | 4 | Author: Nathaniel M. Burley 5 | 6 | Notes: 7 | - Look at Word2Vec, and then embedding that "vector space" as a knowledge graph and 8 | preserve the relations somehow... 9 | 10 | Sources: 11 | - https://towardsdatascience.com/auto-generated-knowledge-graphs-92ca99a81121 12 | - Really useful overview, with real-world examples and lots of pictures (for class paper): 13 | https://usc-isi-i2.github.io/slides/2018-02-aaai-tutorial-constructing-kgs.pdf 14 | 15 | """ 16 | 17 | # Wikipedia scraping libraries 18 | from typing import List 19 | from pandas.core.frame import DataFrame 20 | import wikipediaapi # pip install wikipedia-api 21 | import pandas as pd 22 | import concurrent.futures 23 | from tqdm import tqdm 24 | # NLP/Computational Linguistics libraries 25 | import re 26 | import spacy 27 | import neuralcoref 28 | # Graph libraries 29 | import networkx as nx 30 | import matplotlib.pyplot as plt 31 | # Miscellaneous 32 | import pickle 33 | 34 | 35 | 36 | ######################################################################################################################## 37 | ## SCRAPE WIKIPEDIA REGARDING A TOPIC 38 | ######################################################################################################################## 39 | 40 | # Function that gets all unique links in an array of pages 41 | def get_page_links(page_names: list) -> list: 42 | page_links = [] 43 | for page_name in page_names: 44 | new_links = list(page_name.links.keys()) 45 | page_links = list(set().union(page_links, new_links)) 46 | 47 | print("\n\nLINKS: {}".format(page_links)) 48 | return page_links 49 | 50 | 51 | # Function that gets a page object for a link, or None 52 | def get_link(link): 53 | try: 54 | page = wiki_api.page(link) 55 | if page.exists(): 56 | return {'page': link, 'text': page.text, 'link': page.fullurl, 'categories': list(page.categories.keys())} 57 | except: 58 | return None 59 | 60 | 61 | # Function that removes invalid pages from a list 62 | # TODO: Annnotate what the lists take in the arguments 63 | def remove_null_pages(page_names: list, 64 | topic_names: list) -> tuple(list, list): 65 | 66 | # Iterate through the page names provided and check them against the API 67 | for page_name in page_names: 68 | index = page_names.index(page_name) 69 | 70 | # If the page doesn't exist, remove it from the lists 71 | if not page_name.exists(): 72 | print(f'Page \'{topic_names[index]}\' does not exist') 73 | page_names.remove(page_name) 74 | topic_names.remove() 75 | 76 | # Display the page name 77 | else: 78 | print(f'Verified page \'{page_name}\'') 79 | 80 | # Return the updated topic names and pages 81 | return page_names, topic_names 82 | 83 | 84 | # Function that scrapes all the pages in our list 85 | # TODO: Annotate this 86 | def scrape_all_pages(topics: list, pages: list, links: list, ) -> DataFrame: 87 | progress = tqdm(desc='Links Scraped', unit='', total=len(links)) if verbose else None 88 | sources = [{'page': topic_name, 'text': page_name.text, 'link': page_name.fullurl, \ 89 | 'categories': list(page_name.categories.keys())} for topic_name, page_name in zip(topics, pages)] 90 | 91 | # Parallelize the scraping, to speed it up 92 | with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: 93 | future_link = {executor.submit(get_link, link): link for link in links} 94 | for future in concurrent.futures.as_completed(future_link): 95 | data = future.result() 96 | sources.append(data) if data else None 97 | progress.update(1) if verbose else None 98 | progress.close() if verbose else None 99 | 100 | # Define namespaces (Wikipedia under the MediaWiki umbrella) 101 | namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \ 102 | 'Category talk', 'Portal talk') 103 | sources = pd.DataFrame(sources) 104 | sources = sources[(len(sources['text']) > 20) & ~(sources['page'].str.startswith(namespaces, na=True))] 105 | sources['categories'] = sources.categories.apply(lambda x: [y[9:] for y in x]) 106 | #sources['topic'] = topic_name # Seems redundant? Since 'page' is already set to topic_name? 107 | print('Wikipedia pages scraped:', len(sources)) 108 | 109 | return sources 110 | 111 | 112 | # Function that scrapes Wikipedia for a given topic 113 | def wiki_scrape(topic_names, verbose=True): 114 | """ 115 | Function that scrapes Wikipedia for text. 116 | - Takes in a list of topic names, i.e. page names 117 | - Scrapes text from: 118 | 1) All the pages 119 | 2) All the links in the pages 120 | Into a dataframe 121 | - Returns the dataframe 122 | Args: 123 | topic_names: List of strings. Titles of Wikipedia pages, ex. ['Miami Dolphins', 'World War II'] 124 | Must be an exact match, otherwise no page will be returned. 125 | Outputs: 126 | sources: Pandas dataframe of 127 | """ 128 | 129 | # Establish Wikipedia API connection 130 | wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI) 131 | 132 | # Build a list of page objects, i.e. getting all the pages each topic name corresponds to 133 | page_names = [wiki_api.page(topic_name) for topic_name in topic_names] 134 | 135 | # Remove all NULL pages from the list (i.e. the topic doesn't match an actual Wikipedia page) 136 | page_names, topic_names = remove_null_pages(page_names=page_names, topic_names=topic_names) 137 | 138 | # Get links in all the pages 139 | page_links = get_page_links(page_names) 140 | 141 | # Scrape all the pages and links 142 | sources = scrape_all_pages(topics=topic_names, pages=page_names, links=page_links) 143 | 144 | return sources 145 | 146 | 147 | 148 | ######################################################################################################################## 149 | ## COMPUTATIONAL LINGUISTICS/NLP FUNCTIONING 150 | # This section contains functions that do dependency parsing and find subject-predicate-object dependencies 151 | # TODO: Look into coreference resolution to remove redundancies, normalize, etc. [adds a neural net here] 152 | """ 153 | NOTES: 154 | - Sources for information extraction, and spacy: 155 | - https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/ 156 | - https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/ 157 | - https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/ 158 | - https://realpython.com/natural-language-processing-spacy-python/ 159 | - https://programmerbackpack.com/python-knowledge-graph-understanding-semantic-relationships/ 160 | 161 | - Word2Vec similarity could be used to drop irrelevant terms 162 | - Overview of text embeddings: https://jalammar.github.io/illustrated-word2vec/ 163 | - Train the algorithm on all the pages used and then drop terms with low co-occurences (?) 164 | That might only work for large datasets, but...could be worth looking into! 165 | 166 | - Honestly, I should probably re-write this from scratch using better logic and sentence structure 167 | for the relationship tuples. I can hand-code lots of simple rules, and maybe find libraries with more 168 | 169 | * Something else that could be an interesting side project, perhaps on its own: inferring semantic rules from similar 170 | dependency trees! The article (linked below) gives an introduction, but it would be interesting to put some thought 171 | into mathematically determining how similar given trees are...this could actually even involve some CBR: 172 | - Take some common trees (hyper/hyponyms, etc.) and make rules for them 173 | - Next, for a sentence that doesn't fit into a pre-defined rule, figure out the most similar dependency tree, and 174 | modify the rule as needed to fit the new sentence! 175 | - That rule then goes into our list of rules, and the iteration proceeds 176 | This could be a frickin' paper. "Automatic Semantic Rule Inference Using Case Based Reasoning" 177 | - On that note, here's an algorithm for comparing trees: https://arxiv.org/pdf/1508.03381.pdf 178 | """ 179 | ######################################################################################################################## 180 | 181 | #nlp = spacy.load('en_core_web_sm') 182 | nlp = spacy.load('en_core_web_lg') 183 | neuralcoref.add_to_pipe(nlp) 184 | 185 | # spacy.util.filter_spans 186 | def filter_spans(spans): 187 | # Filter a sequence of spans so they don't contain overlaps 188 | # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans() 189 | get_sort_key = lambda span: (span.end - span.start, -span.start) 190 | sorted_spans = sorted(spans, key=get_sort_key, reverse=True) 191 | result = [] 192 | seen_tokens = set() 193 | for span in sorted_spans: 194 | # Check for end - 1 here because boundaries are inclusive 195 | if span.start not in seen_tokens and span.end - 1 not in seen_tokens: 196 | result.append(span) 197 | seen_tokens.update(range(span.start, span.end)) 198 | result = sorted(result, key=lambda span: span.start) 199 | return result 200 | 201 | # Function that gets pairs of subjects and relationships 202 | def get_entity_pairs(text, coref=True): 203 | # preprocess text 204 | text = re.sub(r'\n+', '.', text) # replace multiple newlines with period 205 | text = re.sub(r'\[\d+\]', ' ', text) # remove reference numbers 206 | text = nlp(text) 207 | if coref: 208 | text = nlp(text._.coref_resolved) # resolve coreference clusters 209 | 210 | def refine_entities(ent, sent): 211 | unwanted_tokens = ( 212 | 'PRON', # pronouns 213 | 'PART', # particle 214 | 'DET', # determiner 215 | 'SCONJ', # subordinating conjunction 216 | 'PUNCT', # punctuation 217 | 'SYM', # symbol 218 | 'X', # other 219 | ) 220 | ent_type = ent.ent_type_ # get entity type 221 | if ent_type == '': 222 | ent_type = 'NOUN_CHUNK' 223 | ent = ' '.join(str(t.text) for t in nlp(str(ent)) if t.pos_ not in unwanted_tokens and t.is_stop == False) 224 | 225 | elif ent_type in ('NOMINAL', 'CARDINAL', 'ORDINAL') and str(ent).find(' ') == -1: 226 | refined = '' 227 | for i in range(len(sent) - ent.i): 228 | if ent.nbor(i).pos_ not in ('VERB', 'PUNCT'): 229 | refined += ' ' + str(ent.nbor(i)) 230 | else: 231 | ent = refined.strip() 232 | break 233 | 234 | return ent, ent_type 235 | 236 | sentences = [sent.string.strip() for sent in text.sents] # split text into sentences 237 | ent_pairs = [] 238 | for sent in sentences: 239 | sent = nlp(sent) 240 | spans = list(sent.ents) + list(sent.noun_chunks) # collect nodes 241 | spans = filter_spans(spans) 242 | with sent.retokenize() as retokenizer: 243 | [retokenizer.merge(span, attrs={'tag': span.root.tag, 'dep': span.root.dep}) for span in spans] 244 | deps = [token.dep_ for token in sent] 245 | 246 | # limit our example to simple sentences with one subject and object 247 | # if (deps.count('obj') + deps.count('dobj')) != 1 or (deps.count('subj') + deps.count('nsubj')) != 1: 248 | # continue 249 | 250 | for token in sent: 251 | if token.dep_ not in ('obj', 'dobj'): # identify object nodes 252 | continue 253 | subject = [w for w in token.head.lefts if w.dep_ in ('subj', 'nsubj')] # identify subject nodes 254 | if subject: 255 | subject = subject[0] 256 | # identify relationship by root dependency 257 | relation = [w for w in token.ancestors if w.dep_ == 'ROOT'] 258 | if relation: 259 | relation = relation[0] 260 | # add adposition or particle to relationship 261 | try: 262 | if relation.nbor(1).pos_ in ('ADP', 'PART'): 263 | relation = ' '.join((str(relation), str(relation.nbor(1)))) 264 | except: 265 | print("Failed at line 207") 266 | return 267 | else: 268 | relation = 'unknown' 269 | 270 | subject, subject_type = refine_entities(subject, sent) 271 | token, object_type = refine_entities(token, sent) 272 | 273 | ent_pairs.append([str(subject), str(relation), str(token), str(subject_type), str(object_type)]) 274 | 275 | ent_pairs = [sublist for sublist in ent_pairs if not any(str(ent) == '' for ent in sublist)] 276 | pairs = pd.DataFrame(ent_pairs, columns=['subject', 'relation', 'object', 'subject_type', 'object_type']) 277 | #print('Entity pairs extracted:', str(len(ent_pairs))) 278 | 279 | return pairs 280 | 281 | # Function that extracts ALL the pairs. Not just the first one smh 282 | def extract_all_relations(wiki_data): 283 | all_pairs = [] 284 | for i in range(0, 100): 285 | pairs = get_entity_pairs(wiki_data.loc[i,'text']) 286 | all_pairs.append(pairs) 287 | print("Made it through {} iterations".format(i)) 288 | all_pairs_df = pd.concat(all_pairs) 289 | print("Successfully extracted {} entity pairs".format(len(all_pairs_df.index))) 290 | return all_pairs_df 291 | 292 | 293 | 294 | ######################################################################################################################## 295 | ## FUNCTION THAT DRAWS, PLOTS THE KNOWLEDGE GRAPH 296 | ######################################################################################################################## 297 | 298 | # Function that plots and draws a knowledge graph 299 | def draw_KG(pairs): 300 | k_graph = nx.from_pandas_edgelist(pairs, 'subject', 'object', create_using=nx.MultiDiGraph()) 301 | node_deg = nx.degree(k_graph) 302 | layout = nx.spring_layout(k_graph, k=0.15, iterations=20) 303 | plt.figure(num=None, figsize=(120, 90), dpi=80) 304 | nx.draw_networkx( 305 | k_graph, 306 | node_size=[int(deg[1]) * 1000 for deg in node_deg], 307 | arrowsize=20, 308 | linewidths=1.5, 309 | pos=layout, 310 | edge_color='red', 311 | edgecolors='black', 312 | node_color='green', 313 | ) 314 | labels = dict(zip(list(zip(pairs.subject, pairs.object)), pairs['relation'].tolist())) 315 | print(labels) 316 | nx.draw_networkx_edge_labels(k_graph, pos=layout, edge_labels=labels, font_color='black') 317 | plt.axis('off') 318 | plt.show() 319 | plt.savefig('church_knowledge_graph.png') 320 | 321 | 322 | # Function that plots a "subgraph", if the main graph is too messy 323 | # Example use: filter_graph(pairs, 'Directed graphs') 324 | def filter_graph(pairs, node): 325 | k_graph = nx.from_pandas_edgelist(pairs, 'subject', 'object', create_using=nx.MultiDiGraph()) 326 | edges = nx.dfs_successors(k_graph, node) 327 | nodes = [] 328 | for k, v in edges.items(): 329 | nodes.extend([k]) 330 | nodes.extend(v) 331 | subgraph = k_graph.subgraph(nodes) 332 | layout = (nx.random_layout(k_graph)) 333 | nx.draw_networkx( 334 | subgraph, 335 | node_size=1000, 336 | arrowsize=20, 337 | linewidths=1.5, 338 | pos=layout, 339 | edge_color='red', 340 | edgecolors='black', 341 | node_color='white' 342 | ) 343 | labels = dict(zip((list(zip(pairs.subject, pairs.object))), pairs['relation'].tolist())) 344 | edges= tuple(subgraph.out_edges(data=False)) 345 | sublabels ={k: labels[k] for k in edges} 346 | nx.draw_networkx_edge_labels(subgraph, pos=layout, edge_labels=sublabels, font_color='red') 347 | plt.axis('off') 348 | plt.show() 349 | plt.savefig('church_knowledge_graph.png') --------------------------------------------------------------------------------