├── favicon.ico
├── wikilogo.png
├── user-config.py
├── requirements.txt
├── LICENSE
├── README.md
├── wikiOutput.html
└── app.py


/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/favicon.ico


--------------------------------------------------------------------------------
/wikilogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/wikilogo.png


--------------------------------------------------------------------------------
/user-config.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8  -*-
2 | mylang = 'en'
3 | family = 'wikipedia'
4 | usernames['wikipedia']['en'] = 'test'


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | streamlit==0.75.0
 2 | 
 3 | pyvis
 4 | google-api-core==1.21.0
 5 | google-auth==1.19.1
 6 | google-cloud-language==1.3.0
 7 | googleapis-common-protos==1.52.0
 8 | 
 9 | beautifulsoup4==4.9.3
10 | matplotlib
11 | networkx==2.5
12 | pandas==1.2.1
13 | pywikibot==5.1.0
14 | requests==2.24.0
15 | seaborn==0.11.0
16 | validators==0.18.1
17 | 
18 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Charly Wargnier
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # Wiki Topic Grapher
 3 | 
 4 | Leverage the power of Google [#NLP](https://threadreaderapp.com/hashtag/NLP) to retrieve entity relationships from Wikipedia URLs or topics!
 5 | 
 6 | -   Get interactive graphs of connected entities  
 7 | -   Export results with entity types and salience to CSV!
 8 | 
 9 | _________________
10 | 
11 | [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/charlywargnier/s4_wiki_topic_grapher/main/app.py)
12 | 
13 | 
14 | ### Use cases
15 | 
16 | Many cool use cases!
17 | 
18 | -   Research any topic then get entity associations that exist from that seed topic
19 |     
20 | -   Map out related entities with your product, service or brand
21 |     
22 | -   Find how well you've covered a specific topic on your website
23 |     
24 | -   Differentiate your pages!
25 |     
26 | 
27 | ### Stack
28 | 
29 | ###   
30 | 
31 | About the stack, it's 100% [#Python](https://threadreaderapp.com/hashtag/Python)! 🐍🔥
32 | 
33 | -   [@GCPcloud](https://twitter.com/GCPcloud) Natural Language API
34 |     
35 | -   PyWikibot
36 |     
37 | -   Networkx
38 |     
39 | -   PyVis
40 |     
41 | -   [@streamlit](https://twitter.com/streamlit)
42 |     
43 | -   Streamlit Components -> [streamlit.io/components](https://www.streamlit.io/components)
44 |     
45 | 
46 |   
47 | 
48 | ### **⚒️ Still To-Do’s**
49 | 
50 | -   💰 Add a budget estimator to get a sense of [@GCPcloud](https://twitter.com/GCPcloud) costs!
51 |     
52 | -   🌍Add a multilingual option (currently English only)
53 |     
54 | -   📈Add on-the-fly physics controls to the network graph
55 |     
56 | -   💯Add Google KG [#API](https://threadreaderapp.com/hashtag/API) to add more data (scores, etc.) (ht [@LoukilAymen](https://twitter.com/LoukilAymen))
57 |     
58 | 
59 | That code currently lays in a private repo. I should be able to make it public soon for you to re-use it in your own apps and creations! I just need to clean it a tad, remove some sensitive bits, etc.
60 | 
61 |   
62 | 
63 | ### 🙌 Shout-outs
64 | 
65 | Kudos to [@jroakes](https://twitter.com/jroakes) for the original script. Buy that man a 🍺 for his sterling contributions! -> [paypal.com/paypalme/codes…](https://www.paypal.com/paypalme/codeseo)
66 | 
67 | Kudos also to fellow [@streamlit](https://twitter.com/streamlit) Creators:
68 | 
69 | -   [@napoles3D](https://twitter.com/napoles3D) who told me about the PyVis lib! 🔥
70 |     
71 | -   [@andfanilo](https://twitter.com/andfanilo)/[@ChristianKlose3](https://twitter.com/ChristianKlose3) for their precious advice! 🙏
72 |     
73 | 
74 |   
75 | 
76 | ### 💲 Beware on costs!
77 | 
78 | It can get expensive quickly with that Google Natural Language API!
79 | 
80 |   
81 | 
82 | Monitor your costs via the GCP console regularly and/or put quotas to tame that G beast! I'm planning to add a budget estimator pre-API calls. Should come handy.
83 | 
84 |   
85 | 
86 | ### Feedback and support
87 | 
88 |   
89 | 
90 | Wiki Topic Grapher's still in Beta, with possible rough edges! Head-off to my [Gitter page](https://gitter.im/DataChaz/WikiTopic) for bug reports, questions, or suggestions.
91 | 
92 | This app is free. If it's useful to you, you can buy me a ☕ to support my work! 🙏 ▶️ [buymeacoffee.com/cwar05](https://www.buymeacoffee.com/cwar05)
93 | 
94 | 
95 |   
96 | 
97 | That's all, folks. Enjoy!
98 | 
99 | 


--------------------------------------------------------------------------------
/wikiOutput.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 | 
  3 | <head>
  4 |     <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis.css" type="text/css" />
  5 |     <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis-network.min.js"> </script>
  6 |     <center>
  7 |         <h1>asdsad!</h1>
  8 |     </center>
  9 | 
 10 |     <!-- <link rel="stylesheet" href="../node_modules/vis/dist/vis.min.css" type="text/css" />
 11 | <script type="text/javascript" src="../node_modules/vis/dist/vis.js"> </script>-->
 12 | 
 13 |     <style type="text/css">
 14 |         #mynetwork {
 15 |             width: 1300px;
 16 |             height: 1000px;
 17 |             background-color: #ffffff;
 18 |             border: 1px solid lightgray;
 19 |             position: relative;
 20 |             float: left;
 21 |         }
 22 |     </style>
 23 | 
 24 | </head>
 25 | 
 26 | <body>
 27 |     <div id="mynetwork"></div>
 28 | 
 29 | 
 30 |     <script type="text/javascript">
 31 | 
 32 |         // initialize global variables.
 33 |         var edges;
 34 |         var nodes;
 35 |         var network;
 36 |         var container;
 37 |         var options, data;
 38 | 
 39 | 
 40 |         // This method is responsible for drawing the graph, returns the drawn network
 41 |         function drawGraph() {
 42 |             var container = document.getElementById('mynetwork');
 43 | 
 44 | 
 45 | 
 46 |             // parsing and collecting nodes and edges from the python
 47 |             nodes = new vis.DataSet([{ "id": "Diet (Nutrition) - Wikipedia", "label": "Diet (Nutrition) - Wikipedia", "shape": "dot", "size": 10 }, { "id": "Buddhists", "label": "Buddhists", "shape": "dot", "size": 10 }, { "id": "Theravada Buddhism", "label": "Theravada Buddhism", "shape": "dot", "size": 10 }]);
 48 |             edges = new vis.DataSet([{ "arrows": "to", "from": "Diet (Nutrition) - Wikipedia", "to": "Buddhists", "weight": 1 }, { "arrows": "to", "from": "Buddhists", "to": "Theravada Buddhism", "weight": 1 }]);
 49 | 
 50 |             // adding nodes and edges to the graph
 51 |             data = { nodes: nodes, edges: edges };
 52 | 
 53 |             var options = {
 54 |                 "configure": {
 55 |                     "enabled": false
 56 |                 },
 57 |                 "edges": {
 58 |                     "color": {
 59 |                         "inherit": true
 60 |                     },
 61 |                     "smooth": {
 62 |                         "enabled": false,
 63 |                         "type": "continuous"
 64 |                     }
 65 |                 },
 66 |                 "interaction": {
 67 |                     "dragNodes": true,
 68 |                     "hideEdgesOnDrag": false,
 69 |                     "hideNodesOnDrag": false
 70 |                 },
 71 |                 "physics": {
 72 |                     "enabled": true,
 73 |                     "stabilization": {
 74 |                         "enabled": true,
 75 |                         "fit": true,
 76 |                         "iterations": 1000,
 77 |                         "onlyDynamicEdges": false,
 78 |                         "updateInterval": 50
 79 |                     }
 80 |                 }
 81 |             };
 82 | 
 83 | 
 84 | 
 85 | 
 86 | 
 87 |             network = new vis.Network(container, data, options);
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 |             return network;
 95 | 
 96 |         }
 97 | 
 98 |         drawGraph();
 99 | 
100 |     </script>
101 | </body>
102 | 
103 | </html>


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import streamlit.components.v1 as components
  3 | import pandas as pd
  4 | import numpy as np
  5 | 
  6 | import networkx as nx
  7 | from networkx.readwrite import json_graph
  8 | from pyvis import network as net
  9 | 
 10 | import matplotlib.pyplot as plt
 11 | import seaborn as sns
 12 | 
 13 | from bs4 import BeautifulSoup
 14 | import pywikibot
 15 | import math
 16 | import os
 17 | import re
 18 | import requests
 19 | import tempfile
 20 | import validators
 21 | 
 22 | from google.cloud import language
 23 | from google.cloud.language import enums
 24 | from google.cloud.language import types
 25 | from google.cloud import language_v1
 26 | from google.cloud.language_v1 import enums
 27 | 
 28 | 
 29 | st.set_page_config(
 30 |     page_title="Wiki Topic Grapher",
 31 |     page_icon="favicon.ico",
 32 | )
 33 | 
 34 | 
 35 | def _max_width_():
 36 |     max_width_str = f"max-width: 1500px;"
 37 |     st.markdown(
 38 |         f"""
 39 |     <style>
 40 |     .reportview-container .main .block-container{{
 41 |         {max_width_str}
 42 |     }}
 43 |     </style>    
 44 |     """,
 45 |         unsafe_allow_html=True,
 46 |     )
 47 | 
 48 | 
 49 | _max_width_()
 50 | 
 51 | 
 52 | c30, c31, c32 = st.beta_columns([1, 3.3, 3])
 53 | 
 54 | 
 55 | with c30:
 56 |     st.markdown("###")
 57 |     st.image("wikilogo.png", width=520)
 58 |     st.header("")
 59 | 
 60 | with c32:
 61 |     st.markdown("#")
 62 |     st.text("")
 63 |     st.text("")
 64 |     st.markdown(
 65 |         "###### Original script by [JR Oakes](https://twitter.com/jroakes) - Ported to [![this is an image link](https://i.imgur.com/iIOA6kU.png)](https://www.streamlit.io/)&nbsp, with :heart: by [DataChaz](https://twitter.com/DataChaz) &nbsp [![this is an image link](https://i.imgur.com/thJhzOO.png)](https://www.buymeacoffee.com/cwar05)"
 66 |     )
 67 | 
 68 | 
 69 | with st.beta_expander("ℹ️  - About this app ", expanded=True):
 70 |     st.write(
 71 |         """  
 72 | 
 73 | -   Wiki Topic Grapher leverages the power of [Google Natural Language API] (https://cloud.google.com/natural-language) to recursively retrieve entity relationships from any Wikipedia seed topic! 🔥
 74 | -   Get a network graph of these connected entities, save the graph as jpg or export the results ordered by salience to CSV!
 75 | -   The tool is still in Beta, with possible rough edges! [![Gitter](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/DataChaz/WikiTopic) for bug report, questions, or suggestions.
 76 | -   Kudos to JR Oakes for the original script - [buy the man a 🍺 here!](https://www.paypal.com/paypalme/codeseo)
 77 | -   This app is free. If it's useful to you, you can [buy me a ☕](https://www.buymeacoffee.com/cwar05) to support my work! 🙏
 78 | 
 79 | 
 80 | 	    """
 81 |     )
 82 | 
 83 |     st.markdown("---")
 84 | 
 85 | 
 86 | with st.beta_expander("🛠️ - How to use it ", expanded=False):
 87 | 
 88 |     st.markdown(
 89 |         """  
 90 | - Wiki Topic Grapher takes the top entities for each Wikipedia URL and follows those entities according to the specified limit and depth parameters
 91 | - Here's a [neat chart](https://i.imgur.com/wZOU1wh.png) explaining how it all works"""
 92 |     )
 93 | 
 94 |     st.markdown("---")
 95 | 
 96 |     st.markdown(
 97 |         """  
 98 | 
 99 | **URL:**
100 | 
101 | - Paste a Wikipedia URL
102 | - Make sure the URL belongs to https://en.wikipedia.org/
103 | - Only English is currently supported. More languages to come! :)
104 | 
105 | _
106 | 
107 | **Topic:**
108 | 
109 | - Select "Topic" via the left-hand toggle and type your keyword
110 | - It will return the closest matching Wikipedia page for that given string
111 | - Use that method with caution as currently there's no way to get the related page before calling the API
112 | - Can be costly if the page has lots of text!
113 | 
114 | _
115 | 
116 | **Depth**:
117 | - The maximum number of entities to pull for each Wikipedia page
118 | - Depth 1 or 2 are the recommended settings
119 | - Depth 3 and above work yet it may not be usable nor legible!
120 | 
121 | _
122 | 
123 | **Limit**:
124 | - The max number of entities to pull for each page
125 | 
126 | 	    """
127 |     )
128 | 
129 |     st.markdown("---")
130 | 
131 | with st.beta_expander("🔎- SEO use cases ", expanded=False):
132 |     st.write(
133 |         """  
134 | 
135 | -   Research any topic then get entity associations that exist from that seed topic
136 | -   Map out these related entities & alternative lexical fields with your product, service or brand
137 | -   Find how well you've covered a specific topic on your website
138 | -   Differentiate pages on your website!
139 | 
140 | 	    """
141 |     )
142 | 
143 |     st.markdown("---")
144 | 
145 | 
146 | with st.beta_expander("🧰 - Stack + To-Do's", expanded=False):
147 | 
148 |     st.markdown("")
149 | 
150 |     st.write(
151 |         """  
152 | ** Stack **
153 | 
154 | -   100% Python! 🐍🔥
155 | -   [Google Natural Language API](https://cloud.google.com/natural-language)
156 | -   [PyWikibot](https://www.mediawiki.org/wiki/Manual:Pywikibot)
157 | -   [Networkx](https://networkx.org/)
158 | -   [Streamlit](https://www.streamlit.io/)
159 | -   [Streamlit Components](https://www.streamlit.io/components)"""
160 |     )
161 | 
162 |     st.markdown("")
163 | 
164 |     st.write(
165 |         """  
166 | 
167 | ** To-Do's **
168 | 
169 | -   Add a budget estimator to estimate Google Cloud Language API costs
170 | -   Add a multilingual option (currently English only)  
171 | -   Add on-the-fly physics controls to the network graph 
172 | -   Exception handling is still pretty broad at the moment and could be improved
173 | 
174 | 	    """
175 |     )
176 | 
177 |     st.markdown("---")
178 | 
179 | st.markdown("## **① Upload your Google NLP key **")
180 | with st.beta_expander("ℹ️ - How to create your credentials?", expanded=False):
181 | 
182 |     st.write(
183 |         """
184 | 	          
185 |       - In the [Cloud Console](https://console.cloud.google.com/), go to the _'Create Service Account Key'_  page
186 |       - From the *Service account list*, select  _'New service account'_
187 |       - In the *Service account name* field, enter a name
188 |       - From the *Role list*, select  _'Project > Owner'_
189 |       - Click create, then download your JSON key
190 |       - Upload it (or drag and drop it) in the grey box below 👇
191 | 
192 | 	    """
193 |     )
194 |     st.markdown("---")
195 | 
196 | 
197 | # Pywikibot needs a config file
198 | pywikibot_config = r"""# -*- coding: utf-8  -*-
199 | mylang = 'en'
200 | family = 'wikipedia'
201 | usernames['wikipedia']['en'] = 'test'"""
202 | 
203 | with open("user-config.py", "w", encoding="utf-8") as f:
204 |     f.write(pywikibot_config)
205 | 
206 | c3, c4 = st.beta_columns(2)
207 | 
208 | with c3:
209 |     try:
210 |         uploaded_file = st.file_uploader("", type="json")
211 |         with tempfile.NamedTemporaryFile(delete=False) as fp:
212 |             fp.write(uploaded_file.getvalue())
213 |         try:
214 |             os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = fp.name
215 |             with open(fp.name, "rb") as a:
216 |                 client = language.LanguageServiceClient.from_service_account_json(
217 |                     fp.name
218 |                 )
219 | 
220 |         finally:
221 |             if os.path.isfile(fp.name):
222 |                 os.unlink(fp.name)
223 | 
224 |     except AttributeError:
225 | 
226 |         print("wait")
227 | 
228 | with c4:
229 |     st.markdown("###")
230 |     c = st.beta_container()
231 |     if uploaded_file:
232 |         st.success("✅ Nice! Your credentials are uploaded!")
233 | 
234 | 
235 | def google_nlp_entities(
236 |     input,
237 |     input_type="html",
238 |     result_type="all",
239 |     limit=10,
240 |     invalid_types=["OTHER", "NUMBER", "DATE"],
241 | ):
242 | 
243 |     """
244 |     Loads HTML or text from a URL and passes to the Google NLP API
245 |     Parameters:
246 |         * input: HTML or Plain Text to send to the Google Language API
247 |         * input_type: Either `html` or `text` (string)
248 |         * result_type: Either `all`(pull all entities) or `wikipedia` (only pull entities with Wikipedia pages)
249 |         * limit: Limits the number of results to this number sorted, decending, by salience.
250 |         * invalid_types: A list of entity types to exclude.
251 |     Returns:
252 |         List of entities in format [{'name':<name>,'type':<type>,'salience':<salience>, 'wikipedia': <wikipedia url - optional>}]
253 |     """
254 | 
255 |     def get_type(type):
256 |         return client.enums.Entity.Type(d.type).name
257 | 
258 |     if not input:
259 |         print("No input content found.")
260 |         return None
261 | 
262 |     if input_type == "html":
263 |         doc_type = language.enums.Document.Type.HTML
264 |     else:
265 |         doc_type = language.enums.Document.Type.PLAIN_TEXT
266 | 
267 |     document = types.Document(content=input, type=doc_type)
268 | 
269 |     features = {"extract_entities": True}
270 | 
271 |     try:
272 |         response = client.annotate_text(
273 |             document=document, features=features, timeout=20
274 |         )
275 |     except Exception as e:
276 |         print("Error with language API: ", re.sub(r"\(.*$", "", str(e)))
277 |         return []
278 | 
279 |     used = []
280 |     results = []
281 |     for d in response.entities:
282 | 
283 |         if limit and len(results) >= limit:
284 |             break
285 | 
286 |         if get_type(d.type) not in invalid_types and d.name not in used:
287 | 
288 |             data = {
289 |                 "name": d.name,
290 |                 "type": client.enums.Entity.Type(d.type).name,
291 |                 "salience": d.salience,
292 |             }
293 |             if result_type is "wikipedia":
294 |                 if "wikipedia_url" in d.metadata:
295 |                     data["wikipedia"] = d.metadata["wikipedia_url"]
296 |                     results.append(data)
297 |             else:
298 |                 results.append(data)
299 | 
300 |             used.append(d.name)
301 | 
302 |     return results
303 | 
304 | 
305 | def load_page_title(url):
306 |     """
307 |     Returns the <title> given a URL.
308 |     Parameters:
309 |         * url: URL (string)
310 |     Returns:
311 |        Inner text of <title> (string)
312 |     """
313 |     soup = BeautifulSoup(requests.get(url).text)
314 |     return soup.title.text
315 | 
316 | 
317 | @st.cache(allow_output_mutation=True, show_spinner=False)
318 | def html_to_text(html, target_elements=None):
319 |     """
320 |     Transforms HTML to clean text
321 |     Parameters:
322 |         * html: HTML from a web page (str)
323 |         * target_elements: Elements like `div` or `p` to target pulling text from. (optional) (string)
324 |     Returns:
325 |         Text (string)
326 |     """
327 |     soup = BeautifulSoup(html)
328 | 
329 |     for script in soup(
330 |         ["script", "style"]
331 |     ):  # remove all javascript and stylesheet code
332 |         script.extract()
333 | 
334 |     targets = []
335 | 
336 |     if target_elements:
337 |         targets = soup.find_all(target_elements)
338 | 
339 |     if target_elements and len(targets) > 3:
340 |         text = " ".join([t.text for t in targets])
341 |     else:
342 |         text = soup.get_text()
343 | 
344 |     # break into lines and remove leading and trailing space on each
345 |     lines = (line.strip() for line in text.splitlines())
346 |     # break multi-headlines into a line each
347 |     chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
348 |     # drop blank lines
349 |     text = "\n".join(chunk for chunk in chunks if chunk)
350 |     return text
351 | 
352 | 
353 | @st.cache(allow_output_mutation=True, show_spinner=False)
354 | def load_text_from_url(url, **data):
355 | 
356 |     """
357 |     Loads html from a URL
358 |     Parameters:
359 |         * url: url of page to load (str)
360 |         * timeout: request timeout in seconds (int) default: 20
361 |     Returns:
362 |         HTML (str)
363 |     """
364 | 
365 |     timeout = data.get("timeout", 20)
366 | 
367 |     results = []
368 | 
369 |     try:
370 |         # print("Extracting HTML from: {}".format(url))
371 |         response = requests.get(
372 |             url,
373 |             headers={
374 |                 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0"
375 |             },
376 |             timeout=timeout,
377 |         )
378 | 
379 |         text = response.text
380 |         status = response.status_code
381 | 
382 |         if status == 200 and len(text) > 0:
383 |             return text
384 |         else:
385 |             print("Incorrect status returned: ", status)
386 | 
387 |         return None
388 | 
389 |     except Exception as e:
390 |         print("Problem with url: {0}.".format(url))
391 |         return None
392 | 
393 | 
394 | @st.cache(allow_output_mutation=True, show_spinner=False)
395 | def get_wikipedia_url(query):
396 |     """
397 |     Finds the closest matching Wikipedia page for a given string.
398 |     Parameters:
399 |         * query: Query to search Wikipedia with. (string)
400 |     Returns:
401 |        The top matching URL for the query.  Follows redirects (string)
402 |     """
403 |     sitew = pywikibot.Site("en", "wikipedia")
404 |     result = None
405 |     print("looking up:", query)
406 |     search = sitew.search(
407 |         query, where="title", get_redirects=True, total=1, content=False, namespaces="0"
408 |     )
409 |     for page in search:
410 |         if page.isRedirectPage():
411 |             page = page.getRedirectTarget()
412 |         result = page.full_url()
413 |         break
414 | 
415 |     return result
416 | 
417 | 
418 | @st.cache(allow_output_mutation=True, show_spinner=False)
419 | def recurse_entities(
420 |     input_data, entity_results=[], G=nx.Graph(), current_depth=0, depth=2, limit=3
421 | ):
422 |     """
423 |     Recursively finds entities of connected Wikipedia topics by taking the top entities
424 |     for each page and following those entities up to the specified depth
425 |     Parameters:
426 |         * input_data: A topic or URL.  If topic, finds the closes matching Wikipedia start page.
427 |                       If URL, starts with the top enetities of that page. (string)
428 |         * depth: Max recursion depth (integer)
429 |         * limit: The max number of entities to pull for each page. (integer)
430 |     Returns:
431 |        A tuple of:
432 |         * entity_results: List of dictionaries of found entities.
433 |         * G: Networkx graph of entities.
434 |     """
435 |     if isinstance(input_data, str):
436 |         # Starting fresh.  Make sure variables are fresh.
437 |         entity_results = []
438 |         G = nx.Graph()
439 |         current_depth = 0
440 |         if not validators.url(input_data):
441 |             input_data = get_wikipedia_url(input_data)
442 |             if not input_data:
443 |                 print("No Wikipedia URL Found.")
444 |                 return None, None
445 |             else:
446 |                 print("Wikipedia URL: ", input_data)
447 |             name = load_page_title(input_data).split("-")[0].strip()
448 |         else:
449 |             name = load_page_title(input_data)
450 |         input_data = (
451 |             [
452 |                 {
453 |                     "name": name.title(),
454 |                     "type": "START",
455 |                     "salience": 0.0,
456 |                     "wikipedia": input_data,
457 |                 }
458 |             ]
459 |             if input_data
460 |             else []
461 |         )
462 | 
463 |     # Regex for wikipedia terms to not bias entities returned
464 |     subs = r"(wikipedia|wikimedia|wikitext|mediawiki|wikibase)"
465 | 
466 |     for d in input_data:
467 |         url = d["wikipedia"]
468 |         name = d["name"]
469 | 
470 |         print(
471 |             "   " * current_depth + "Level: {0} Name: {1}".format(current_depth, name)
472 |         )
473 | 
474 |         html = load_text_from_url(url)
475 | 
476 |         # html_to_text will default to all text if < 4 `p` elements found.
477 |         if "wikipedia.org" in url:
478 |             html = html_to_text(html, target_elements="p")
479 |         else:
480 |             html = html_to_text(html)
481 | 
482 |         # Kill brutally wikipedia terms.
483 |         html = re.sub(subs, "", html, flags=re.IGNORECASE)
484 | 
485 |         results = [
486 |             r
487 |             for r in google_nlp_entities(
488 |                 html, input_type="text", limit=None, result_type="wikipedia"
489 |             )
490 |             if "wiki" not in r["name"].lower() and not G.has_node(r["name"])
491 |         ][:limit]
492 |         _ = [G.add_edge(name, r["name"]) for r in results]
493 |         entity_results.extend(results)
494 | 
495 |         new_depth = int(current_depth + 1)
496 |         if results and new_depth <= depth:
497 |             recurse_entities(results, entity_results, G, new_depth, depth, limit)
498 | 
499 |     if current_depth == 0:
500 |         return entity_results, G
501 | 
502 | 
503 | @st.cache(allow_output_mutation=True, show_spinner=False)
504 | def hierarchy_pos(G, root=None, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5):
505 | 
506 |     """
507 |     From Joel's answer at https://stackoverflow.com/a/29597209/2966723.
508 |     Licensed under Creative Commons Attribution-Share Alike
509 | 
510 |     If the graph is a tree this will return the positions to plot this in a
511 |     hierarchical layout.
512 | 
513 |     G: the graph (must be a tree)
514 | 
515 |     root: the root node of current branch
516 |     - if the tree is directed and this is not given,
517 |       the root will be found and used
518 |     - if the tree is directed and this is given, then
519 |       the positions will be just for the descendants of this node.
520 |     - if the tree is undirected and not given,
521 |       then a random choice will be used.
522 | 
523 |     width: horizontal space allocated for this branch - avoids overlap with other branches
524 | 
525 |     vert_gap: gap between levels of hierarchy
526 | 
527 |     vert_loc: vertical location of root
528 | 
529 |     xcenter: horizontal location of root
530 |     """
531 |     if not nx.is_tree(G):
532 |         raise TypeError("cannot use hierarchy_pos on a graph that is not a tree")
533 | 
534 |     if root is None:
535 |         if isinstance(G, nx.DiGraph):
536 |             root = next(
537 |                 iter(nx.topological_sort(G))
538 |             )  # allows back compatibility with nx version 1.11
539 |         else:
540 |             root = random.choice(list(G.nodes))
541 | 
542 |     def _hierarchy_pos(
543 |         G, root, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5, pos=None, parent=None
544 |     ):
545 |         """
546 |         see hierarchy_pos docstring for most arguments
547 | 
548 |         pos: a dict saying where all nodes go if they have been assigned
549 |         parent: parent of this branch. - only affects it if non-directed
550 | 
551 |         """
552 | 
553 |         if pos is None:
554 |             pos = {root: (xcenter, vert_loc)}
555 |         else:
556 |             pos[root] = (xcenter, vert_loc)
557 |         children = list(G.neighbors(root))
558 |         if not isinstance(G, nx.DiGraph) and parent is not None:
559 |             children.remove(parent)
560 |         if len(children) != 0:
561 |             dx = width / len(children)
562 |             nextx = xcenter - width / 2 - dx / 2
563 |             for child in children:
564 |                 nextx += dx
565 |                 pos = _hierarchy_pos(
566 |                     G,
567 |                     child,
568 |                     width=dx,
569 |                     vert_gap=vert_gap,
570 |                     vert_loc=vert_loc - vert_gap,
571 |                     xcenter=nextx,
572 |                     pos=pos,
573 |                     parent=root,
574 |                 )
575 |         return pos
576 | 
577 |     return _hierarchy_pos(G, root, width, vert_gap, vert_loc, xcenter)
578 | 
579 | 
580 | def plot_entity_branches(G, w=10, h=10, c=1, font_size=14, filename=None):
581 |     """
582 |     Given a networkx graph, builds a recursive tree graph
583 | 
584 |     Parameters:
585 |         * G: Networkx graph of entities.
586 |         * w: Width of output plot
587 |         * h: height of output plot
588 |         * c: Circle percentage (float) 0.5 is a semi-circle. Range: 0.1-1.0
589 |         * font_size: Font Size of labels (integer)
590 |         * filename: Filename for the saved plot.  Optional (string)
591 |     Returns:
592 |        Nothing. Plots a graph
593 | 
594 |     """
595 |     start = list(G.nodes)[0]
596 |     G = nx.bfs_tree(G, start)
597 |     plt.figure(figsize=(w, h))
598 |     pos = hierarchy_pos(G, start, width=float(2 * c) * math.pi, xcenter=0)
599 |     new_pos = {
600 |         u: (r * math.sin(theta), r * math.cos(theta)) for u, (theta, r) in pos.items()
601 |     }
602 |     nx.draw(
603 |         G,
604 |         pos=new_pos,
605 |         alpha=0.8,
606 |         node_size=25,
607 |         with_labels=True,
608 |         font_size=font_size,
609 |         edge_color="gray",
610 |     )
611 |     nx.draw_networkx_nodes(
612 |         G, pos=new_pos, nodelist=[start], node_color="blue", node_size=500
613 |     )
614 | 
615 |     if filename:
616 |         plt.savefig("{0}/{1}".format("images", filename))
617 | 
618 | 
619 | st.set_option("deprecation.showPyplotGlobalUse", False)
620 | 
621 | st.markdown("## **② Choose a URL or a topic **")
622 | 
623 | with st.beta_expander("ℹ️ - How Google Cloud pricing works ", expanded=False):
624 | 
625 |     st.write(
626 |         """
627 |         - Your usage of the Google Natural Language API is calculated in terms
628 |        of "units"
629 |        - Each document sent to the API for analysis is at least one unit
630 |        - Documents that have more than 1,000 Unicode characters are considered as multiple units (1 unit per 1,000 characters)
631 |        -   More info about pricing on [Google's website](https://cloud.google.com/natural-language/pricing)
632 | 
633 | 	    """
634 |     )
635 | 
636 |     st.markdown("---")
637 | 
638 | st.text("")
639 | 
640 | try:
641 | 
642 |     c10, c0, c8, c1, c2, c3, c4, c5, c6 = st.beta_columns(
643 |         [0.10, 0.50, 0.10, 8, 0.10, 1.5, 0.10, 1.5, 0.10]
644 |     )
645 | 
646 |     with c0:
647 |         st.text("")
648 |         toggle = st.select_slider("", options=("URL", "Tpc"))
649 | 
650 |     with c1:
651 | 
652 |         from re import search
653 | 
654 |         substring = "http://|https://"
655 | 
656 |         if toggle == "Tpc":
657 |             keyword = st.text_input(
658 |                 "Enter a topic. (Returns the closest matching Wikipedia page for a given string)",
659 |                 key=1,
660 |             )
661 |             if keyword:
662 |                 if search(substring, keyword):
663 |                     st.warning(
664 |                         "⚠️ Seems like you&#39re trying to paste a URL. Switch to &#39URL&#39 mode?"
665 |                     )
666 |                     st.stop()
667 |                 else:
668 |                     st.markdown('Keyword is "' + str(keyword) + '"')
669 | 
670 |         elif toggle == "URL":
671 | 
672 |             keyword = st.text_input(
673 |                 "Enter a Wikipedia URL",
674 |                 key=2,
675 |             )
676 | 
677 |             if keyword:
678 |                 if search(substring, keyword):
679 |                     st.markdown('URL is "' + str(keyword) + '"')
680 |                 else:
681 |                     st.warning(
682 |                         "⚠️ Please check the URL format as it's invalid. It needs to start with http:// or https://. If you wanted to paste a keyword, switch to 'Topic' mode."
683 |                     )
684 |                     st.stop()
685 | 
686 |     with c3:
687 |         depth = st.number_input(
688 |             "Depth", step=1, value=1, min_value=1, max_value=3, key=1
689 |         )
690 | 
691 |     with c5:
692 |         limit = st.number_input(
693 |             "Limit", step=1, value=1, min_value=1, max_value=3, key=2
694 |         )
695 | 
696 |     c3, c4 = st.beta_columns(2)
697 | 
698 |     with c3:
699 |         st.text("")
700 |         st.text("")
701 |         cButton = st.beta_container()
702 | 
703 |     with c4:
704 |         st.text("")
705 |         c30 = st.beta_container()
706 | 
707 |     button1 = cButton.button("✨ Happy with costs, get me the data!")
708 | 
709 |     if not button1 and not uploaded_file:
710 |         st.stop()
711 |     elif not button1 and uploaded_file:
712 |         st.stop()
713 |     elif button1 and not uploaded_file:
714 |         c.warning("◀️ Add credentials 1st")
715 |         st.stop()
716 |     else:
717 |         pass
718 | 
719 |     if button1:
720 | 
721 |         import time
722 | 
723 |         latest_iteration = st.empty()
724 |         bar = st.progress(0)
725 | 
726 |         for i in range(100):
727 |             latest_iteration.markdown(f"Sending your request ({i+1} % Completed)")
728 |             bar.progress(i + 1)
729 |             time.sleep(0.05)
730 | 
731 |     data, G = recurse_entities(keyword, depth=depth, limit=limit)
732 | 
733 |     st.markdown("## **③ Check results! ✨**")
734 | 
735 |     st.text("")
736 | 
737 |     g4 = net.Network(
738 |         directed=True,
739 |         heading="",
740 |         height="800px",
741 |         width="800px",
742 |         notebook=True,
743 |     )
744 | 
745 |     c1, c2, c3 = st.beta_columns([1, 3, 2])
746 | 
747 |     with c2:
748 |         g4.from_nx(G)
749 |         g4.show("wikiOutput.html")
750 |         HtmlFile = open("wikiOutput.html", "r")
751 |         source_code = HtmlFile.read()
752 |         components.html(source_code, height=1000, width=1000)
753 | 
754 |     c30, c31, c32 = st.beta_columns(3)
755 | 
756 |     with c30:
757 |         c1 = st.beta_container()
758 |     with c31:
759 |         c2 = st.beta_container()
760 | 
761 |     cm = sns.light_palette("green", as_cmap=True)
762 |     df = pd.DataFrame(data).sort_values(by="salience", ascending=False)
763 |     df = df.reset_index()
764 |     df.index += 1
765 |     df = df.drop(["index"], axis=1)
766 |     format_dictionary = {
767 |         "salience": "{:.1%}",
768 |     }
769 |     dfStyled = df.style.background_gradient(cmap=cm)
770 |     dfStyled2 = dfStyled.format(format_dictionary)
771 |     st.table(dfStyled2)
772 | 
773 |     try:
774 |         import base64
775 | 
776 |         csv = df.to_csv(index=False)
777 |         b64 = base64.b64encode(csv.encode()).decode()
778 |         href = f'<a href="data:file/csv;base64,{b64}" download="listViewExport.csv">** - Download data to CSV 🎁 **</a>'
779 |         c1.markdown(href, unsafe_allow_html=True)
780 |     except NameError:
781 |         print("wait")
782 | 
783 | except Exception as e:
784 | 
785 |     st.warning(
786 |         f"""
787 |             🤔 ** Snap! **
788 |             have you checked that:
789 |              -  The credentials JSON file you have added is valid?
790 |              -  Google Cloud's billing is enabled?
791 |              -  The URL you typed is a valid Wikipedia URL (that is, if you selected the "URL" option)?            
792 | 
793 |             If this keeps happening -> [![Gitter](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/DataChaz/WikiTopic)
794 |             
795 |             """
796 |     )
797 | 


--------------------------------------------------------------------------------