├── .gitignore
├── README.md
├── embed_wiki_data.py
├── knowledge_graph_example.png
├── main.py
├── source.py
├── source_list.py
├── text_extractor.py
└── wiki_knowledge_graph.py


/.gitignore:
--------------------------------------------------------------------------------
1 | religion_wiki_data.pkl
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Wikipedia Knowledge Graphs
 2 | ## Summary
 3 | Given a list of topics, ideally Wikipedia pages, we construct and display a knowledge graph of the key entities and their connecting relationships.
 4 | 
 5 | ## Purpose
 6 | Wikipedia is the world's encyclopedia, containing (as far as I'm concerned) the sum total of all human knowledge. Advances in natural language processing techniques 
 7 | have created the opportunity to mine this repository of knowledge for patterns and structure representing concepts and information, and embed this data 
 8 | into a structured graph that can be queried to answer questions.
 9 | 
10 | ## Setup
11 | To set up the environment, we recommend using conda:
12 | 
13 | ```
14 | conda create -n wiki_kg python=3.7
15 | ```
16 | 
17 | Next, activate the environemnt:
18 | 
19 | ```
20 | conda activate wiki_kg
21 | ```
22 | 
23 | And install the following libraries:
24 | 
25 | ```
26 | pip install spacy==2.1.0
27 | pip install neuralcoref
28 | pip install wikipedia-api
29 | pip install networkx
30 | pip install pandas
31 | pip install matplotlib
32 | ```
33 | 
34 | ## Results
35 | The output of running the main program should be a graph that looks something like the following:
36 | 
37 | ![alt text](https://github.com/nateburley/WikiKnowledgeGraphs/blob/master/knowledge_graph_example.png)
38 | 
39 | ## Further Work
40 | - Productionize this by creating a web page where users can select topics from a drop down menu, and render the knowledge graph in their browser
41 | - For large datasets, the results could possibly be improved by using Word2Vec embeddings and cosine similarities to drop irrelevant terms. We could train 
42 |   the model on all the pages used and then drop terms with low co-occurences.
43 | - Something else that could be an interesting side project, perhaps on its own: inferring semantic rules from similar dependency trees. The article (linked 
44 |   below) gives an introduction, but it would be interesting to put some thought into mathematically determining how similar given trees are. This could 
45 |   actually even involve some Case-Based Reasoning:
46 |   - Take some common trees (hyper/hyponyms, etc.) and make rules for them
47 |   - Next, for a sentence that doesn't fit into a pre-defined rule, figure out the most similar dependency tree, and
48 |     modify the rule as needed to fit the new sentence!
49 |   - That rule then goes into our list of rules, and the iteration proceeds
50 |   
51 |   On that note, here's an algorithm for comparing trees: https://arxiv.org/pdf/1508.03381.pdf
52 |   
53 | - Lots of random pages that aren't that relevant get scraped. Can those nodes be dropped later, since in theory they should be "less connected" than the 
54 |   other nodes that are actually "on topic"?
55 | 
56 | ## Sources and Further Reading
57 | - https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/
58 | - https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/
59 | - https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/
60 | - https://realpython.com/natural-language-processing-spacy-python/
61 | - https://programmerbackpack.com/python-knowledge-graph-understanding-semantic-relationships/
62 | 


--------------------------------------------------------------------------------
/embed_wiki_data.py:
--------------------------------------------------------------------------------
1 | """
2 | Functions used to create word embeddings of the text scraped from Wikipedia. These will be explored
3 | for sentiment, relations, and analogy making, with named entity relationships being compared between
4 | the embedded format, and the knowledge graph, to explore how the structures differ from each other.
5 | 
6 | Author: Nathaniel M. Burley
7 | """
8 | 


--------------------------------------------------------------------------------
/knowledge_graph_example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nateburley/WikiKnowledgeGraphs/e39e22bdad68863edf1e64a310db76d718bb659e/knowledge_graph_example.png


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | ########################################################################################################################
 2 | ## DO THE THING: SCRAPE WIKIPEDIA, APPLY NLP, AND BUILD A KNOWLEDGE GRAPH
 3 | # TODO: Lots of random pages that aren't that relevant get scraped. Can those nodes be dropped later, since in theory
 4 | #       they should be "less connected" or "weaker connected" than the other nodes that are actually "on topic"?
 5 | ########################################################################################################################
 6 | 
 7 | from wiki_knowledge_graph import *
 8 | 
 9 | 
10 | if __name__ == '__main__':
11 |     # Scrape Wikipedia for a topic
12 |     wiki_data = wiki_scrape(['Catholic Church', 'Islam', 'Russian Orthodox Church', 'Judaism', 'Buddhism', 'Panpsychism', 'UFO religion'])
13 |     print("WIKIPEDIA SCRAPE DF LENGTH: {}".format(len(wiki_data.index)))
14 |     print(wiki_data.head(25))
15 |     print("\n")
16 | 
17 |     # Pickle the wiki_data to not have to scrape a million times
18 |     datafile = open('religion_wiki_data', 'wb')
19 |     pickle.dump(wiki_data, datafile)
20 | 
21 |     # Load in the wiki data, so we don't have to scrape
22 |     infile = open('religion_wiki_data', 'rb')
23 |     wiki_data = pickle.load(infile)
24 | 
25 |     # Get subject object relationships (which form vertices and edges in the graph)
26 |     # TODO: Parallelize (thread) the shit out of this
27 |     #   - https://realpython.com/intro-to-python-threading/
28 |     all_pairs = extract_all_relations(wiki_data)
29 |     print("ENTITY PAIRS-- SUBJECT/OBJECT RELATIONSHIPS LENGTH: {}".format(len(all_pairs.index)))
30 |     print(all_pairs.head(20))
31 |     print(all_pairs.tail(20))
32 |     print("\n")
33 | 
34 |     # Draw the graph
35 |     draw_KG(all_pairs)


--------------------------------------------------------------------------------
/source.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Class that contains a single "source", i.e. one Wikipedia page
 3 | """
 4 | import pandas as pd
 5 | 
 6 | class Source:
 7 |     # Function that initializes a source object
 8 |     def __init__(self, title, text, link, categories, topic):
 9 |         self.page_title = title
10 |         self.text = text
11 |         self.link = link
12 |         self.categories = categories
13 |         self.topic = topic
14 | 
15 | 
16 |     ## GETTERS
17 |     # Function that returns the title
18 |     def getTitle(self):
19 |         return self.page_title
20 | 
21 | 
22 |     # Function that returns the text
23 |     def getText(self):
24 |         return self.text
25 | 
26 | 
27 |     # Function that returns the link
28 |     def getLink(self):
29 |         return self.link 
30 | 
31 | 
32 |     # Function that returns the categories
33 |     def getCategories(self):
34 |         return self.categories
35 | 
36 | 
37 |     # Function that returns the topic
38 |     def getTopic(self):
39 |         return self.topic 
40 | 
41 | 
42 |     ## SETTERS
43 |     # Function that sets the title
44 |     def setTitle(self, new_title):
45 |         self.page_title = new_title
46 | 
47 | 
48 |     # Function that sets the text
49 |     def setText(self, new_text):
50 |         self.text = new_text
51 | 
52 | 
53 |     # Function that sets the link
54 |     def setLink(self, new_link):
55 |         self.link = new_link 
56 | 
57 | 
58 |     # Function that sets the categories
59 |     def setCategories(self, new_category):
60 |         self.categories = new_category
61 | 
62 | 
63 |     # Function that sets the topic
64 |     def setTopic(self, new_topic):
65 |         self.topic = new_topic
66 | 
67 | 
68 |     ## CREATE AND RETURN DATA FRAME, IF NEEDED
69 |     def getDF(self):
70 |         return pd.DataFrame({'title': self.page_title, 'text': self.text, 'link': self.link,\
71 |             'categories': self.categories, 'topic': self.topic})
72 | 
73 | 
74 | 
75 | ## HELPER FUNCTION THAT TURNS A DATA FRAME INTO A LIST OF SOURCES


--------------------------------------------------------------------------------
/source_list.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Class that holds a list of sources. Created by scraping Wikipedia
 3 | """
 4 | import wikipediaapi
 5 | import pandas as pd
 6 | import concurrent.futures
 7 | from tqdm import tqdm
 8 | from source import Source
 9 | 
10 | class SourceList:
11 |     def __init__(self, source_list=[]):
12 |         self.source_list = source_list
13 |     
14 |     # Function that checks if a source has already been added
15 |     def checkSourceAdded(self, new_source):
16 |         for existing_source in self.source_list:
17 |             if new_source.page_title == existing_source.page_title:
18 |                 print("The page '{}' has already been added!".format(existing_source.page_title))
19 |                 return True
20 |             else:
21 |                 return False
22 |     
23 |     # Function that checks if a page (not a Source) has been added
24 |     def checkPageAdded(self, new_title):
25 |         for existing_source in self.source_list:
26 |             if new_title == existing_source.page_title:
27 |                 print("The page [title] '{}' has already been added!".format(existing_source.page_title))
28 |                 return True
29 |             else:
30 |                 return False
31 | 
32 |     # Function that adds a new source
33 |     def addSource(self, new_source):
34 |         if not (self.checkSourceAdded(new_source) or self.checkPageAdded(new_source.page_title)):
35 |             self.source_list.append(new_source)
36 |     
37 |     # Function that scrapes Wikipedia to build the source list
38 |     #TODO: Add logic from other file to build sources
39 |     def buildSourceList(self, titles, verbose=True):
40 |         def wikiLink(link):
41 |             try:
42 |                 page = wiki_api.page(link)
43 |                 if page.exists():
44 |                     return {'page': link, 'text': page.text, 'link': page.fullurl,
45 |                             'categories': list(page.categories.keys())}
46 |             except:
47 |                 return None
48 | 
49 |         for current_title in titles:
50 |             if not self.checkPageAdded(current_title):
51 |                 wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
52 |                 page_name = wiki_api.page(current_title)
53 |                 if not page_name.exists():
54 |                     print('Page {} does not exist.'.format(page_name))
55 |                     return
56 |                 
57 |                 page_links = list(page_name.links.keys())
58 |                 print_description = "Links Scraped for page '{}'".format(current_title)
59 |                 progress = tqdm(desc=print_description, unit='', total=len(page_links)) if verbose else None
60 |                 current_source = Source(page_name.title, page_name.text, page_name.fullurl, list(page_name.categories.keys()), page_name)
61 |                 
62 |                 # Parallelize the scraping, to speed it up (?)
63 |                 with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
64 |                     future_link = {executor.submit(wikiLink, link): link for link in page_links}
65 |                     for future in concurrent.futures.as_completed(future_link):
66 |                         data = future.result()
67 |                         current_source.append(data) if data else None
68 |                         progress.update(1) if verbose else None     
69 |                 progress.close() if verbose else None
70 |                 
71 |                 namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \
72 |                     'Category talk', 'Portal talk')
73 |                 current_source = current_source[(len(current_source['text']) > 20) & ~(current_source['page'].str.startswith(namespaces, na=True))]
74 |                 current_source['categories'] = current_source.categories.apply(lambda x: [y[9:] for y in x])
75 |                 current_source['topic'] = page_name
76 |                 print('Wikipedia pages scraped so far:', len(current_source))


--------------------------------------------------------------------------------
/text_extractor.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Scrapes Wikipedia for a given list of topics.
 3 | 
 4 | TODO: Add function that checks if a source has been scraped already
 5 | TODO: Figure out if this should run in parallel using Pool
 6 | """
 7 | import wikipediaapi
 8 | import pandas as pd
 9 | import concurrent.futures
10 | from tqdm import tqdm
11 | 
12 | 
13 | class Sources:
14 |     def __init__(self, verbose=True, pages=[]):
15 |         self.pages = pages
16 |         self.sources = pd.DataFrame(columns=['page', 'text', 'link', 'categories', 'topic'])
17 |         self.verbose = verbose
18 | 
19 |     # TODO: Add function here the checks if a source (or topic) has been scraped already
20 |     # Something like "if current_topic in self.sources['']: return True"
21 |     def alreadyScraped(self, page_name):
22 |         if self.sources['page'].str.contains(page_name).any():
23 |             print("Page '{}' has been scraped already!".format(page_name))
24 |             return True
25 |         else:
26 |             return False
27 | 
28 |     def extract(self):
29 |         def wikiLink(link):
30 |             try:
31 |                 page = wiki_api.page(link)
32 |                 if page.exists():
33 |                     return {'page': link, 'text': page.text, 'link': page.fullurl,
34 |                             'categories': list(page.categories.keys())}
35 |             except:
36 |                 return None
37 | 
38 |         for page_name in self.pages:
39 |             if not self.alreadyScraped(page_name):
40 |                 wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
41 |                 page_name = wiki_api.page(page_name)
42 |                 if not page_name.exists():
43 |                     print('Page {} does not exist.'.format(page_name))
44 |                     return
45 |                 
46 |                 page_links = list(page_name.links.keys())
47 |                 progress = tqdm(desc='Links Scraped', unit='', total=len(page_links)) if self.verbose else None
48 |                 current_source = [{'page': page_name, 'text': page_name.text, 'link': page_name.fullurl,
49 |                             'categories': list(page_name.categories.keys())}]
50 |                 
51 |                 # Parallelize the scraping, to speed it up (?)
52 |                 with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
53 |                     future_link = {executor.submit(wikiLink, link): link for link in page_links}
54 |                     for future in concurrent.futures.as_completed(future_link):
55 |                         data = future.result()
56 |                         self.sources.append(data) if data else None
57 |                         progress.update(1) if self.verbose else None     
58 |                 progress.close() if self.verbose else None
59 |                 
60 |                 namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \
61 |                     'Category talk', 'Portal talk')
62 |                 current_source = self.sources[(len(self.sources['text']) > 20) & ~(self.sources['page'].str.startswith(namespaces, na=True))]
63 |                 current_source['categories'] = self.sources.categories.apply(lambda x: [y[9:] for y in x])
64 |                 current_source['topic'] = page_name
65 |                 print('Wikipedia pages scraped so far:', len(self.sources))
66 | 
67 |             
68 |         return self.sources
69 | 


--------------------------------------------------------------------------------
/wiki_knowledge_graph.py:
--------------------------------------------------------------------------------
  1 | """
  2 | This file mines Wikipedia articles for syntax and keywords, and builds a knowledge graph.
  3 | 
  4 | Author: Nathaniel M. Burley
  5 | 
  6 | Notes:
  7 | - Look at Word2Vec, and then embedding that "vector space" as a knowledge graph and
  8 |   preserve the relations somehow...
  9 | 
 10 | Sources:
 11 |     - https://towardsdatascience.com/auto-generated-knowledge-graphs-92ca99a81121
 12 |     - Really useful overview, with real-world examples and lots of pictures (for class paper):
 13 |         https://usc-isi-i2.github.io/slides/2018-02-aaai-tutorial-constructing-kgs.pdf
 14 |     
 15 | """
 16 | 
 17 | # Wikipedia scraping libraries
 18 | from typing import List
 19 | from pandas.core.frame import DataFrame
 20 | import wikipediaapi  # pip install wikipedia-api
 21 | import pandas as pd
 22 | import concurrent.futures
 23 | from tqdm import tqdm
 24 | # NLP/Computational Linguistics libraries
 25 | import re
 26 | import spacy
 27 | import neuralcoref
 28 | # Graph libraries
 29 | import networkx as nx
 30 | import matplotlib.pyplot as plt
 31 | # Miscellaneous
 32 | import pickle
 33 | 
 34 | 
 35 | 
 36 | ########################################################################################################################
 37 | ## SCRAPE WIKIPEDIA REGARDING A TOPIC
 38 | ########################################################################################################################
 39 | 
 40 | # Function that gets all unique links in an array of pages
 41 | def get_page_links(page_names: list) -> list:
 42 |     page_links = []
 43 |     for page_name in page_names:
 44 |         new_links = list(page_name.links.keys())
 45 |         page_links = list(set().union(page_links, new_links))
 46 |     
 47 |     print("\n\nLINKS: {}".format(page_links))
 48 |     return page_links
 49 | 
 50 | 
 51 | # Function that gets a page object for a link, or None
 52 | def get_link(link):
 53 |         try:
 54 |             page = wiki_api.page(link)
 55 |             if page.exists():
 56 |                 return {'page': link, 'text': page.text, 'link': page.fullurl, 'categories': list(page.categories.keys())}
 57 |         except:
 58 |             return None
 59 | 
 60 | 
 61 | # Function that removes invalid pages from a list
 62 | # TODO: Annnotate what the lists take in the arguments
 63 | def remove_null_pages(page_names: list,
 64 |                       topic_names: list) -> tuple(list, list):
 65 |     
 66 |     # Iterate through the page names provided and check them against the API
 67 |     for page_name in page_names:
 68 |         index = page_names.index(page_name)
 69 | 
 70 |         # If the page doesn't exist, remove it from the lists
 71 |         if not page_name.exists():
 72 |             print(f'Page \'{topic_names[index]}\' does not exist')
 73 |             page_names.remove(page_name)
 74 |             topic_names.remove()
 75 | 
 76 |         # Display the page name
 77 |         else:
 78 |             print(f'Verified page \'{page_name}\'')
 79 |     
 80 |     # Return the updated topic names and pages
 81 |     return page_names, topic_names
 82 | 
 83 | 
 84 | # Function that scrapes all the pages in our list
 85 | # TODO: Annotate this
 86 | def scrape_all_pages(topics: list, pages: list, links: list, ) -> DataFrame:
 87 |     progress = tqdm(desc='Links Scraped', unit='', total=len(links)) if verbose else None
 88 |     sources = [{'page': topic_name, 'text': page_name.text, 'link': page_name.fullurl, \
 89 |         'categories': list(page_name.categories.keys())} for topic_name, page_name in zip(topics, pages)]
 90 |     
 91 |     # Parallelize the scraping, to speed it up
 92 |     with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
 93 |         future_link = {executor.submit(get_link, link): link for link in links}
 94 |         for future in concurrent.futures.as_completed(future_link):
 95 |             data = future.result()
 96 |             sources.append(data) if data else None
 97 |             progress.update(1) if verbose else None     
 98 |     progress.close() if verbose else None
 99 |     
100 |     # Define namespaces (Wikipedia under the MediaWiki umbrella)
101 |     namespaces = ('Wikipedia', 'Special', 'Talk', 'LyricWiki', 'File', 'MediaWiki', 'Template', 'Help', 'User', \
102 |         'Category talk', 'Portal talk')
103 |     sources = pd.DataFrame(sources)
104 |     sources = sources[(len(sources['text']) > 20) & ~(sources['page'].str.startswith(namespaces, na=True))]
105 |     sources['categories'] = sources.categories.apply(lambda x: [y[9:] for y in x])
106 |     #sources['topic'] = topic_name # Seems redundant? Since 'page' is already set to topic_name?
107 |     print('Wikipedia pages scraped:', len(sources))
108 | 
109 |     return sources
110 | 
111 | 
112 | # Function that scrapes Wikipedia for a given topic
113 | def wiki_scrape(topic_names, verbose=True):
114 |     """
115 |     Function that scrapes Wikipedia for text.
116 |         - Takes in a list of topic names, i.e. page names
117 |         - Scrapes text from:
118 |             1) All the pages
119 |             2) All the links in the pages
120 |           Into a dataframe
121 |         - Returns the dataframe
122 |     Args:
123 |         topic_names: List of strings. Titles of Wikipedia pages, ex. ['Miami Dolphins', 'World War II']
124 |             Must be an exact match, otherwise no page will be returned.
125 |     Outputs:
126 |         sources: Pandas dataframe of 
127 |     """
128 | 
129 |     # Establish Wikipedia API connection
130 |     wiki_api = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)
131 | 
132 |     # Build a list of page objects, i.e. getting all the pages each topic name corresponds to
133 |     page_names = [wiki_api.page(topic_name) for topic_name in topic_names]
134 | 
135 |     # Remove all NULL pages from the list (i.e. the topic doesn't match an actual Wikipedia page)
136 |     page_names, topic_names = remove_null_pages(page_names=page_names, topic_names=topic_names)
137 |     
138 |     # Get links in all the pages
139 |     page_links = get_page_links(page_names)
140 | 
141 |     # Scrape all the pages and links
142 |     sources = scrape_all_pages(topics=topic_names, pages=page_names, links=page_links)
143 |     
144 |     return sources
145 | 
146 | 
147 | 
148 | ########################################################################################################################
149 | ## COMPUTATIONAL LINGUISTICS/NLP FUNCTIONING
150 | # This section contains functions that do dependency parsing and find subject-predicate-object dependencies
151 | # TODO: Look into coreference resolution to remove redundancies, normalize, etc. [adds a neural net here]
152 | """
153 | NOTES:
154 |     - Sources for information extraction, and spacy:
155 |         - https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/
156 |         - https://www.analyticsvidhya.com/blog/2020/06/nlp-project-information-extraction/
157 |         - https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/
158 |         - https://realpython.com/natural-language-processing-spacy-python/
159 |         - https://programmerbackpack.com/python-knowledge-graph-understanding-semantic-relationships/
160 | 
161 |     - Word2Vec similarity could be used to drop irrelevant terms
162 |         - Overview of text embeddings: https://jalammar.github.io/illustrated-word2vec/
163 |         - Train the algorithm on all the pages used and then drop terms with low co-occurences (?)
164 |           That might only work for large datasets, but...could be worth looking into!
165 | 
166 |     - Honestly, I should probably re-write this from scratch using better logic and sentence structure
167 |       for the relationship tuples. I can hand-code lots of simple rules, and maybe find libraries with more
168 | 
169 |     * Something else that could be an interesting side project, perhaps on its own: inferring semantic rules from similar
170 |       dependency trees! The article (linked below) gives an introduction, but it would be interesting to put some thought
171 |       into mathematically determining how similar given trees are...this could actually even involve some CBR:
172 |         - Take some common trees (hyper/hyponyms, etc.) and make rules for them
173 |         - Next, for a sentence that doesn't fit into a pre-defined rule, figure out the most similar dependency tree, and
174 |           modify the rule as needed to fit the new sentence!
175 |         - That rule then goes into our list of rules, and the iteration proceeds
176 |       This could be a frickin' paper. "Automatic Semantic Rule Inference Using Case Based Reasoning"
177 |     - On that note, here's an algorithm for comparing trees: https://arxiv.org/pdf/1508.03381.pdf
178 | """
179 | ########################################################################################################################
180 | 
181 | #nlp = spacy.load('en_core_web_sm')
182 | nlp = spacy.load('en_core_web_lg')
183 | neuralcoref.add_to_pipe(nlp)
184 | 
185 | # spacy.util.filter_spans
186 | def filter_spans(spans):
187 |     # Filter a sequence of spans so they don't contain overlaps
188 |     # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans()
189 |     get_sort_key = lambda span: (span.end - span.start, -span.start)
190 |     sorted_spans = sorted(spans, key=get_sort_key, reverse=True)
191 |     result = []
192 |     seen_tokens = set()
193 |     for span in sorted_spans:
194 |         # Check for end - 1 here because boundaries are inclusive
195 |         if span.start not in seen_tokens and span.end - 1 not in seen_tokens:
196 |             result.append(span)
197 |         seen_tokens.update(range(span.start, span.end))
198 |     result = sorted(result, key=lambda span: span.start)
199 |     return result
200 | 
201 | # Function that gets pairs of subjects and relationships
202 | def get_entity_pairs(text, coref=True):
203 |     # preprocess text
204 |     text = re.sub(r'\n+', '.', text)  # replace multiple newlines with period
205 |     text = re.sub(r'\[\d+\]', ' ', text)  # remove reference numbers
206 |     text = nlp(text)
207 |     if coref:
208 |         text = nlp(text._.coref_resolved)  # resolve coreference clusters
209 | 
210 |     def refine_entities(ent, sent):
211 |         unwanted_tokens = (
212 |             'PRON',  # pronouns
213 |             'PART',  # particle
214 |             'DET',  # determiner
215 |             'SCONJ',  # subordinating conjunction
216 |             'PUNCT',  # punctuation
217 |             'SYM',  # symbol
218 |             'X',  # other
219 |         )
220 |         ent_type = ent.ent_type_  # get entity type
221 |         if ent_type == '':
222 |             ent_type = 'NOUN_CHUNK'
223 |             ent = ' '.join(str(t.text) for t in nlp(str(ent)) if t.pos_ not in unwanted_tokens and t.is_stop == False)
224 | 
225 |         elif ent_type in ('NOMINAL', 'CARDINAL', 'ORDINAL') and str(ent).find(' ') == -1:
226 |             refined = ''
227 |             for i in range(len(sent) - ent.i):
228 |                 if ent.nbor(i).pos_ not in ('VERB', 'PUNCT'):
229 |                     refined += ' ' + str(ent.nbor(i))
230 |                 else:
231 |                     ent = refined.strip()
232 |                     break
233 | 
234 |         return ent, ent_type
235 | 
236 |     sentences = [sent.string.strip() for sent in text.sents]  # split text into sentences
237 |     ent_pairs = []
238 |     for sent in sentences:
239 |         sent = nlp(sent)
240 |         spans = list(sent.ents) + list(sent.noun_chunks)  # collect nodes
241 |         spans = filter_spans(spans)
242 |         with sent.retokenize() as retokenizer:
243 |             [retokenizer.merge(span, attrs={'tag': span.root.tag, 'dep': span.root.dep}) for span in spans]
244 |         deps = [token.dep_ for token in sent]
245 | 
246 |         # limit our example to simple sentences with one subject and object
247 |         # if (deps.count('obj') + deps.count('dobj')) != 1 or (deps.count('subj') + deps.count('nsubj')) != 1:
248 |             # continue
249 | 
250 |         for token in sent:
251 |             if token.dep_ not in ('obj', 'dobj'):  # identify object nodes
252 |                 continue
253 |             subject = [w for w in token.head.lefts if w.dep_ in ('subj', 'nsubj')]  # identify subject nodes
254 |             if subject:
255 |                 subject = subject[0]
256 |                 # identify relationship by root dependency
257 |                 relation = [w for w in token.ancestors if w.dep_ == 'ROOT']
258 |                 if relation:
259 |                     relation = relation[0]
260 |                     # add adposition or particle to relationship
261 |                     try:
262 |                         if relation.nbor(1).pos_ in ('ADP', 'PART'):
263 |                             relation = ' '.join((str(relation), str(relation.nbor(1))))
264 |                     except:
265 |                         print("Failed at line 207")
266 |                         return
267 |                 else:
268 |                     relation = 'unknown'
269 | 
270 |                 subject, subject_type = refine_entities(subject, sent)
271 |                 token, object_type = refine_entities(token, sent)
272 | 
273 |                 ent_pairs.append([str(subject), str(relation), str(token), str(subject_type), str(object_type)])
274 | 
275 |     ent_pairs = [sublist for sublist in ent_pairs if not any(str(ent) == '' for ent in sublist)]
276 |     pairs = pd.DataFrame(ent_pairs, columns=['subject', 'relation', 'object', 'subject_type', 'object_type'])
277 |     #print('Entity pairs extracted:', str(len(ent_pairs)))
278 | 
279 |     return pairs
280 | 
281 | # Function that extracts ALL the pairs. Not just the first one smh
282 | def extract_all_relations(wiki_data):
283 |     all_pairs = []
284 |     for i in range(0, 100):
285 |         pairs = get_entity_pairs(wiki_data.loc[i,'text'])
286 |         all_pairs.append(pairs)
287 |         print("Made it through {} iterations".format(i))
288 |     all_pairs_df = pd.concat(all_pairs)
289 |     print("Successfully extracted {} entity pairs".format(len(all_pairs_df.index)))
290 |     return all_pairs_df
291 | 
292 | 
293 | 
294 | ########################################################################################################################
295 | ## FUNCTION THAT DRAWS, PLOTS THE KNOWLEDGE GRAPH
296 | ########################################################################################################################
297 | 
298 | # Function that plots and draws a knowledge graph
299 | def draw_KG(pairs):
300 |     k_graph = nx.from_pandas_edgelist(pairs, 'subject', 'object', create_using=nx.MultiDiGraph())
301 |     node_deg = nx.degree(k_graph)
302 |     layout = nx.spring_layout(k_graph, k=0.15, iterations=20)
303 |     plt.figure(num=None, figsize=(120, 90), dpi=80)
304 |     nx.draw_networkx(
305 |         k_graph,
306 |         node_size=[int(deg[1]) * 1000 for deg in node_deg],
307 |         arrowsize=20,
308 |         linewidths=1.5,
309 |         pos=layout,
310 |         edge_color='red',
311 |         edgecolors='black',
312 |         node_color='green',
313 |         )
314 |     labels = dict(zip(list(zip(pairs.subject, pairs.object)), pairs['relation'].tolist()))
315 |     print(labels)
316 |     nx.draw_networkx_edge_labels(k_graph, pos=layout, edge_labels=labels, font_color='black')
317 |     plt.axis('off')
318 |     plt.show()
319 |     plt.savefig('church_knowledge_graph.png')
320 | 
321 | 
322 | # Function that plots a "subgraph", if the main graph is too messy
323 | # Example use: filter_graph(pairs, 'Directed graphs')
324 | def filter_graph(pairs, node):
325 |     k_graph = nx.from_pandas_edgelist(pairs, 'subject', 'object', create_using=nx.MultiDiGraph())
326 |     edges = nx.dfs_successors(k_graph, node)
327 |     nodes = []
328 |     for k, v in edges.items():
329 |         nodes.extend([k])
330 |         nodes.extend(v)
331 |     subgraph = k_graph.subgraph(nodes)
332 |     layout = (nx.random_layout(k_graph))
333 |     nx.draw_networkx(
334 |         subgraph,
335 |         node_size=1000,
336 |         arrowsize=20,
337 |         linewidths=1.5,
338 |         pos=layout,
339 |         edge_color='red',
340 |         edgecolors='black',
341 |         node_color='white'
342 |         )
343 |     labels = dict(zip((list(zip(pairs.subject, pairs.object))), pairs['relation'].tolist()))
344 |     edges= tuple(subgraph.out_edges(data=False))
345 |     sublabels ={k: labels[k] for k in edges}
346 |     nx.draw_networkx_edge_labels(subgraph, pos=layout, edge_labels=sublabels, font_color='red')
347 |     plt.axis('off')
348 |     plt.show()
349 |     plt.savefig('church_knowledge_graph.png')


--------------------------------------------------------------------------------