├── favicon.ico
├── wikilogo.png
├── user-config.py
├── requirements.txt
├── LICENSE
├── README.md
├── wikiOutput.html
└── app.py
/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/favicon.ico
--------------------------------------------------------------------------------
/wikilogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/wikilogo.png
--------------------------------------------------------------------------------
/user-config.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | mylang = 'en'
3 | family = 'wikipedia'
4 | usernames['wikipedia']['en'] = 'test'
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | streamlit==0.75.0
2 |
3 | pyvis
4 | google-api-core==1.21.0
5 | google-auth==1.19.1
6 | google-cloud-language==1.3.0
7 | googleapis-common-protos==1.52.0
8 |
9 | beautifulsoup4==4.9.3
10 | matplotlib
11 | networkx==2.5
12 | pandas==1.2.1
13 | pywikibot==5.1.0
14 | requests==2.24.0
15 | seaborn==0.11.0
16 | validators==0.18.1
17 |
18 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Charly Wargnier
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Wiki Topic Grapher
3 |
4 | Leverage the power of Google [#NLP](https://threadreaderapp.com/hashtag/NLP) to retrieve entity relationships from Wikipedia URLs or topics!
5 |
6 | - Get interactive graphs of connected entities
7 | - Export results with entity types and salience to CSV!
8 |
9 | _________________
10 |
11 | [](https://share.streamlit.io/charlywargnier/s4_wiki_topic_grapher/main/app.py)
12 |
13 |
14 | ### Use cases
15 |
16 | Many cool use cases!
17 |
18 | - Research any topic then get entity associations that exist from that seed topic
19 |
20 | - Map out related entities with your product, service or brand
21 |
22 | - Find how well you've covered a specific topic on your website
23 |
24 | - Differentiate your pages!
25 |
26 |
27 | ### Stack
28 |
29 | ###
30 |
31 | About the stack, it's 100% [#Python](https://threadreaderapp.com/hashtag/Python)! 🐍🔥
32 |
33 | - [@GCPcloud](https://twitter.com/GCPcloud) Natural Language API
34 |
35 | - PyWikibot
36 |
37 | - Networkx
38 |
39 | - PyVis
40 |
41 | - [@streamlit](https://twitter.com/streamlit)
42 |
43 | - Streamlit Components -> [streamlit.io/components](https://www.streamlit.io/components)
44 |
45 |
46 |
47 |
48 | ### **⚒️ Still To-Do’s**
49 |
50 | - 💰 Add a budget estimator to get a sense of [@GCPcloud](https://twitter.com/GCPcloud) costs!
51 |
52 | - 🌍Add a multilingual option (currently English only)
53 |
54 | - 📈Add on-the-fly physics controls to the network graph
55 |
56 | - 💯Add Google KG [#API](https://threadreaderapp.com/hashtag/API) to add more data (scores, etc.) (ht [@LoukilAymen](https://twitter.com/LoukilAymen))
57 |
58 |
59 | That code currently lays in a private repo. I should be able to make it public soon for you to re-use it in your own apps and creations! I just need to clean it a tad, remove some sensitive bits, etc.
60 |
61 |
62 |
63 | ### 🙌 Shout-outs
64 |
65 | Kudos to [@jroakes](https://twitter.com/jroakes) for the original script. Buy that man a 🍺 for his sterling contributions! -> [paypal.com/paypalme/codes…](https://www.paypal.com/paypalme/codeseo)
66 |
67 | Kudos also to fellow [@streamlit](https://twitter.com/streamlit) Creators:
68 |
69 | - [@napoles3D](https://twitter.com/napoles3D) who told me about the PyVis lib! 🔥
70 |
71 | - [@andfanilo](https://twitter.com/andfanilo)/[@ChristianKlose3](https://twitter.com/ChristianKlose3) for their precious advice! 🙏
72 |
73 |
74 |
75 |
76 | ### 💲 Beware on costs!
77 |
78 | It can get expensive quickly with that Google Natural Language API!
79 |
80 |
81 |
82 | Monitor your costs via the GCP console regularly and/or put quotas to tame that G beast! I'm planning to add a budget estimator pre-API calls. Should come handy.
83 |
84 |
85 |
86 | ### Feedback and support
87 |
88 |
89 |
90 | Wiki Topic Grapher's still in Beta, with possible rough edges! Head-off to my [Gitter page](https://gitter.im/DataChaz/WikiTopic) for bug reports, questions, or suggestions.
91 |
92 | This app is free. If it's useful to you, you can buy me a ☕ to support my work! 🙏 ▶️ [buymeacoffee.com/cwar05](https://www.buymeacoffee.com/cwar05)
93 |
94 |
95 |
96 |
97 | That's all, folks. Enjoy!
98 |
99 |
--------------------------------------------------------------------------------
/wikiOutput.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
asdsad!
8 |
9 |
10 |
12 |
13 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
101 |
102 |
103 |
--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import streamlit.components.v1 as components
3 | import pandas as pd
4 | import numpy as np
5 |
6 | import networkx as nx
7 | from networkx.readwrite import json_graph
8 | from pyvis import network as net
9 |
10 | import matplotlib.pyplot as plt
11 | import seaborn as sns
12 |
13 | from bs4 import BeautifulSoup
14 | import pywikibot
15 | import math
16 | import os
17 | import re
18 | import requests
19 | import tempfile
20 | import validators
21 |
22 | from google.cloud import language
23 | from google.cloud.language import enums
24 | from google.cloud.language import types
25 | from google.cloud import language_v1
26 | from google.cloud.language_v1 import enums
27 |
28 |
29 | st.set_page_config(
30 | page_title="Wiki Topic Grapher",
31 | page_icon="favicon.ico",
32 | )
33 |
34 |
35 | def _max_width_():
36 | max_width_str = f"max-width: 1500px;"
37 | st.markdown(
38 | f"""
39 |
44 | """,
45 | unsafe_allow_html=True,
46 | )
47 |
48 |
49 | _max_width_()
50 |
51 |
52 | c30, c31, c32 = st.beta_columns([1, 3.3, 3])
53 |
54 |
55 | with c30:
56 | st.markdown("###")
57 | st.image("wikilogo.png", width=520)
58 | st.header("")
59 |
60 | with c32:
61 | st.markdown("#")
62 | st.text("")
63 | st.text("")
64 | st.markdown(
65 | "###### Original script by [JR Oakes](https://twitter.com/jroakes) - Ported to [](https://www.streamlit.io/) , with :heart: by [DataChaz](https://twitter.com/DataChaz)   [](https://www.buymeacoffee.com/cwar05)"
66 | )
67 |
68 |
69 | with st.beta_expander("ℹ️ - About this app ", expanded=True):
70 | st.write(
71 | """
72 |
73 | - Wiki Topic Grapher leverages the power of [Google Natural Language API] (https://cloud.google.com/natural-language) to recursively retrieve entity relationships from any Wikipedia seed topic! 🔥
74 | - Get a network graph of these connected entities, save the graph as jpg or export the results ordered by salience to CSV!
75 | - The tool is still in Beta, with possible rough edges! [](https://gitter.im/DataChaz/WikiTopic) for bug report, questions, or suggestions.
76 | - Kudos to JR Oakes for the original script - [buy the man a 🍺 here!](https://www.paypal.com/paypalme/codeseo)
77 | - This app is free. If it's useful to you, you can [buy me a ☕](https://www.buymeacoffee.com/cwar05) to support my work! 🙏
78 |
79 |
80 | """
81 | )
82 |
83 | st.markdown("---")
84 |
85 |
86 | with st.beta_expander("🛠️ - How to use it ", expanded=False):
87 |
88 | st.markdown(
89 | """
90 | - Wiki Topic Grapher takes the top entities for each Wikipedia URL and follows those entities according to the specified limit and depth parameters
91 | - Here's a [neat chart](https://i.imgur.com/wZOU1wh.png) explaining how it all works"""
92 | )
93 |
94 | st.markdown("---")
95 |
96 | st.markdown(
97 | """
98 |
99 | **URL:**
100 |
101 | - Paste a Wikipedia URL
102 | - Make sure the URL belongs to https://en.wikipedia.org/
103 | - Only English is currently supported. More languages to come! :)
104 |
105 | _
106 |
107 | **Topic:**
108 |
109 | - Select "Topic" via the left-hand toggle and type your keyword
110 | - It will return the closest matching Wikipedia page for that given string
111 | - Use that method with caution as currently there's no way to get the related page before calling the API
112 | - Can be costly if the page has lots of text!
113 |
114 | _
115 |
116 | **Depth**:
117 | - The maximum number of entities to pull for each Wikipedia page
118 | - Depth 1 or 2 are the recommended settings
119 | - Depth 3 and above work yet it may not be usable nor legible!
120 |
121 | _
122 |
123 | **Limit**:
124 | - The max number of entities to pull for each page
125 |
126 | """
127 | )
128 |
129 | st.markdown("---")
130 |
131 | with st.beta_expander("🔎- SEO use cases ", expanded=False):
132 | st.write(
133 | """
134 |
135 | - Research any topic then get entity associations that exist from that seed topic
136 | - Map out these related entities & alternative lexical fields with your product, service or brand
137 | - Find how well you've covered a specific topic on your website
138 | - Differentiate pages on your website!
139 |
140 | """
141 | )
142 |
143 | st.markdown("---")
144 |
145 |
146 | with st.beta_expander("🧰 - Stack + To-Do's", expanded=False):
147 |
148 | st.markdown("")
149 |
150 | st.write(
151 | """
152 | ** Stack **
153 |
154 | - 100% Python! 🐍🔥
155 | - [Google Natural Language API](https://cloud.google.com/natural-language)
156 | - [PyWikibot](https://www.mediawiki.org/wiki/Manual:Pywikibot)
157 | - [Networkx](https://networkx.org/)
158 | - [Streamlit](https://www.streamlit.io/)
159 | - [Streamlit Components](https://www.streamlit.io/components)"""
160 | )
161 |
162 | st.markdown("")
163 |
164 | st.write(
165 | """
166 |
167 | ** To-Do's **
168 |
169 | - Add a budget estimator to estimate Google Cloud Language API costs
170 | - Add a multilingual option (currently English only)
171 | - Add on-the-fly physics controls to the network graph
172 | - Exception handling is still pretty broad at the moment and could be improved
173 |
174 | """
175 | )
176 |
177 | st.markdown("---")
178 |
179 | st.markdown("## **① Upload your Google NLP key **")
180 | with st.beta_expander("ℹ️ - How to create your credentials?", expanded=False):
181 |
182 | st.write(
183 | """
184 |
185 | - In the [Cloud Console](https://console.cloud.google.com/), go to the _'Create Service Account Key'_ page
186 | - From the *Service account list*, select _'New service account'_
187 | - In the *Service account name* field, enter a name
188 | - From the *Role list*, select _'Project > Owner'_
189 | - Click create, then download your JSON key
190 | - Upload it (or drag and drop it) in the grey box below 👇
191 |
192 | """
193 | )
194 | st.markdown("---")
195 |
196 |
197 | # Pywikibot needs a config file
198 | pywikibot_config = r"""# -*- coding: utf-8 -*-
199 | mylang = 'en'
200 | family = 'wikipedia'
201 | usernames['wikipedia']['en'] = 'test'"""
202 |
203 | with open("user-config.py", "w", encoding="utf-8") as f:
204 | f.write(pywikibot_config)
205 |
206 | c3, c4 = st.beta_columns(2)
207 |
208 | with c3:
209 | try:
210 | uploaded_file = st.file_uploader("", type="json")
211 | with tempfile.NamedTemporaryFile(delete=False) as fp:
212 | fp.write(uploaded_file.getvalue())
213 | try:
214 | os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = fp.name
215 | with open(fp.name, "rb") as a:
216 | client = language.LanguageServiceClient.from_service_account_json(
217 | fp.name
218 | )
219 |
220 | finally:
221 | if os.path.isfile(fp.name):
222 | os.unlink(fp.name)
223 |
224 | except AttributeError:
225 |
226 | print("wait")
227 |
228 | with c4:
229 | st.markdown("###")
230 | c = st.beta_container()
231 | if uploaded_file:
232 | st.success("✅ Nice! Your credentials are uploaded!")
233 |
234 |
235 | def google_nlp_entities(
236 | input,
237 | input_type="html",
238 | result_type="all",
239 | limit=10,
240 | invalid_types=["OTHER", "NUMBER", "DATE"],
241 | ):
242 |
243 | """
244 | Loads HTML or text from a URL and passes to the Google NLP API
245 | Parameters:
246 | * input: HTML or Plain Text to send to the Google Language API
247 | * input_type: Either `html` or `text` (string)
248 | * result_type: Either `all`(pull all entities) or `wikipedia` (only pull entities with Wikipedia pages)
249 | * limit: Limits the number of results to this number sorted, decending, by salience.
250 | * invalid_types: A list of entity types to exclude.
251 | Returns:
252 | List of entities in format [{'name':,'type':,'salience':, 'wikipedia': }]
253 | """
254 |
255 | def get_type(type):
256 | return client.enums.Entity.Type(d.type).name
257 |
258 | if not input:
259 | print("No input content found.")
260 | return None
261 |
262 | if input_type == "html":
263 | doc_type = language.enums.Document.Type.HTML
264 | else:
265 | doc_type = language.enums.Document.Type.PLAIN_TEXT
266 |
267 | document = types.Document(content=input, type=doc_type)
268 |
269 | features = {"extract_entities": True}
270 |
271 | try:
272 | response = client.annotate_text(
273 | document=document, features=features, timeout=20
274 | )
275 | except Exception as e:
276 | print("Error with language API: ", re.sub(r"\(.*$", "", str(e)))
277 | return []
278 |
279 | used = []
280 | results = []
281 | for d in response.entities:
282 |
283 | if limit and len(results) >= limit:
284 | break
285 |
286 | if get_type(d.type) not in invalid_types and d.name not in used:
287 |
288 | data = {
289 | "name": d.name,
290 | "type": client.enums.Entity.Type(d.type).name,
291 | "salience": d.salience,
292 | }
293 | if result_type is "wikipedia":
294 | if "wikipedia_url" in d.metadata:
295 | data["wikipedia"] = d.metadata["wikipedia_url"]
296 | results.append(data)
297 | else:
298 | results.append(data)
299 |
300 | used.append(d.name)
301 |
302 | return results
303 |
304 |
305 | def load_page_title(url):
306 | """
307 | Returns the given a URL.
308 | Parameters:
309 | * url: URL (string)
310 | Returns:
311 | Inner text of (string)
312 | """
313 | soup = BeautifulSoup(requests.get(url).text)
314 | return soup.title.text
315 |
316 |
317 | @st.cache(allow_output_mutation=True, show_spinner=False)
318 | def html_to_text(html, target_elements=None):
319 | """
320 | Transforms HTML to clean text
321 | Parameters:
322 | * html: HTML from a web page (str)
323 | * target_elements: Elements like `div` or `p` to target pulling text from. (optional) (string)
324 | Returns:
325 | Text (string)
326 | """
327 | soup = BeautifulSoup(html)
328 |
329 | for script in soup(
330 | ["script", "style"]
331 | ): # remove all javascript and stylesheet code
332 | script.extract()
333 |
334 | targets = []
335 |
336 | if target_elements:
337 | targets = soup.find_all(target_elements)
338 |
339 | if target_elements and len(targets) > 3:
340 | text = " ".join([t.text for t in targets])
341 | else:
342 | text = soup.get_text()
343 |
344 | # break into lines and remove leading and trailing space on each
345 | lines = (line.strip() for line in text.splitlines())
346 | # break multi-headlines into a line each
347 | chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
348 | # drop blank lines
349 | text = "\n".join(chunk for chunk in chunks if chunk)
350 | return text
351 |
352 |
353 | @st.cache(allow_output_mutation=True, show_spinner=False)
354 | def load_text_from_url(url, **data):
355 |
356 | """
357 | Loads html from a URL
358 | Parameters:
359 | * url: url of page to load (str)
360 | * timeout: request timeout in seconds (int) default: 20
361 | Returns:
362 | HTML (str)
363 | """
364 |
365 | timeout = data.get("timeout", 20)
366 |
367 | results = []
368 |
369 | try:
370 | # print("Extracting HTML from: {}".format(url))
371 | response = requests.get(
372 | url,
373 | headers={
374 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0"
375 | },
376 | timeout=timeout,
377 | )
378 |
379 | text = response.text
380 | status = response.status_code
381 |
382 | if status == 200 and len(text) > 0:
383 | return text
384 | else:
385 | print("Incorrect status returned: ", status)
386 |
387 | return None
388 |
389 | except Exception as e:
390 | print("Problem with url: {0}.".format(url))
391 | return None
392 |
393 |
394 | @st.cache(allow_output_mutation=True, show_spinner=False)
395 | def get_wikipedia_url(query):
396 | """
397 | Finds the closest matching Wikipedia page for a given string.
398 | Parameters:
399 | * query: Query to search Wikipedia with. (string)
400 | Returns:
401 | The top matching URL for the query. Follows redirects (string)
402 | """
403 | sitew = pywikibot.Site("en", "wikipedia")
404 | result = None
405 | print("looking up:", query)
406 | search = sitew.search(
407 | query, where="title", get_redirects=True, total=1, content=False, namespaces="0"
408 | )
409 | for page in search:
410 | if page.isRedirectPage():
411 | page = page.getRedirectTarget()
412 | result = page.full_url()
413 | break
414 |
415 | return result
416 |
417 |
418 | @st.cache(allow_output_mutation=True, show_spinner=False)
419 | def recurse_entities(
420 | input_data, entity_results=[], G=nx.Graph(), current_depth=0, depth=2, limit=3
421 | ):
422 | """
423 | Recursively finds entities of connected Wikipedia topics by taking the top entities
424 | for each page and following those entities up to the specified depth
425 | Parameters:
426 | * input_data: A topic or URL. If topic, finds the closes matching Wikipedia start page.
427 | If URL, starts with the top enetities of that page. (string)
428 | * depth: Max recursion depth (integer)
429 | * limit: The max number of entities to pull for each page. (integer)
430 | Returns:
431 | A tuple of:
432 | * entity_results: List of dictionaries of found entities.
433 | * G: Networkx graph of entities.
434 | """
435 | if isinstance(input_data, str):
436 | # Starting fresh. Make sure variables are fresh.
437 | entity_results = []
438 | G = nx.Graph()
439 | current_depth = 0
440 | if not validators.url(input_data):
441 | input_data = get_wikipedia_url(input_data)
442 | if not input_data:
443 | print("No Wikipedia URL Found.")
444 | return None, None
445 | else:
446 | print("Wikipedia URL: ", input_data)
447 | name = load_page_title(input_data).split("-")[0].strip()
448 | else:
449 | name = load_page_title(input_data)
450 | input_data = (
451 | [
452 | {
453 | "name": name.title(),
454 | "type": "START",
455 | "salience": 0.0,
456 | "wikipedia": input_data,
457 | }
458 | ]
459 | if input_data
460 | else []
461 | )
462 |
463 | # Regex for wikipedia terms to not bias entities returned
464 | subs = r"(wikipedia|wikimedia|wikitext|mediawiki|wikibase)"
465 |
466 | for d in input_data:
467 | url = d["wikipedia"]
468 | name = d["name"]
469 |
470 | print(
471 | " " * current_depth + "Level: {0} Name: {1}".format(current_depth, name)
472 | )
473 |
474 | html = load_text_from_url(url)
475 |
476 | # html_to_text will default to all text if < 4 `p` elements found.
477 | if "wikipedia.org" in url:
478 | html = html_to_text(html, target_elements="p")
479 | else:
480 | html = html_to_text(html)
481 |
482 | # Kill brutally wikipedia terms.
483 | html = re.sub(subs, "", html, flags=re.IGNORECASE)
484 |
485 | results = [
486 | r
487 | for r in google_nlp_entities(
488 | html, input_type="text", limit=None, result_type="wikipedia"
489 | )
490 | if "wiki" not in r["name"].lower() and not G.has_node(r["name"])
491 | ][:limit]
492 | _ = [G.add_edge(name, r["name"]) for r in results]
493 | entity_results.extend(results)
494 |
495 | new_depth = int(current_depth + 1)
496 | if results and new_depth <= depth:
497 | recurse_entities(results, entity_results, G, new_depth, depth, limit)
498 |
499 | if current_depth == 0:
500 | return entity_results, G
501 |
502 |
503 | @st.cache(allow_output_mutation=True, show_spinner=False)
504 | def hierarchy_pos(G, root=None, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5):
505 |
506 | """
507 | From Joel's answer at https://stackoverflow.com/a/29597209/2966723.
508 | Licensed under Creative Commons Attribution-Share Alike
509 |
510 | If the graph is a tree this will return the positions to plot this in a
511 | hierarchical layout.
512 |
513 | G: the graph (must be a tree)
514 |
515 | root: the root node of current branch
516 | - if the tree is directed and this is not given,
517 | the root will be found and used
518 | - if the tree is directed and this is given, then
519 | the positions will be just for the descendants of this node.
520 | - if the tree is undirected and not given,
521 | then a random choice will be used.
522 |
523 | width: horizontal space allocated for this branch - avoids overlap with other branches
524 |
525 | vert_gap: gap between levels of hierarchy
526 |
527 | vert_loc: vertical location of root
528 |
529 | xcenter: horizontal location of root
530 | """
531 | if not nx.is_tree(G):
532 | raise TypeError("cannot use hierarchy_pos on a graph that is not a tree")
533 |
534 | if root is None:
535 | if isinstance(G, nx.DiGraph):
536 | root = next(
537 | iter(nx.topological_sort(G))
538 | ) # allows back compatibility with nx version 1.11
539 | else:
540 | root = random.choice(list(G.nodes))
541 |
542 | def _hierarchy_pos(
543 | G, root, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5, pos=None, parent=None
544 | ):
545 | """
546 | see hierarchy_pos docstring for most arguments
547 |
548 | pos: a dict saying where all nodes go if they have been assigned
549 | parent: parent of this branch. - only affects it if non-directed
550 |
551 | """
552 |
553 | if pos is None:
554 | pos = {root: (xcenter, vert_loc)}
555 | else:
556 | pos[root] = (xcenter, vert_loc)
557 | children = list(G.neighbors(root))
558 | if not isinstance(G, nx.DiGraph) and parent is not None:
559 | children.remove(parent)
560 | if len(children) != 0:
561 | dx = width / len(children)
562 | nextx = xcenter - width / 2 - dx / 2
563 | for child in children:
564 | nextx += dx
565 | pos = _hierarchy_pos(
566 | G,
567 | child,
568 | width=dx,
569 | vert_gap=vert_gap,
570 | vert_loc=vert_loc - vert_gap,
571 | xcenter=nextx,
572 | pos=pos,
573 | parent=root,
574 | )
575 | return pos
576 |
577 | return _hierarchy_pos(G, root, width, vert_gap, vert_loc, xcenter)
578 |
579 |
580 | def plot_entity_branches(G, w=10, h=10, c=1, font_size=14, filename=None):
581 | """
582 | Given a networkx graph, builds a recursive tree graph
583 |
584 | Parameters:
585 | * G: Networkx graph of entities.
586 | * w: Width of output plot
587 | * h: height of output plot
588 | * c: Circle percentage (float) 0.5 is a semi-circle. Range: 0.1-1.0
589 | * font_size: Font Size of labels (integer)
590 | * filename: Filename for the saved plot. Optional (string)
591 | Returns:
592 | Nothing. Plots a graph
593 |
594 | """
595 | start = list(G.nodes)[0]
596 | G = nx.bfs_tree(G, start)
597 | plt.figure(figsize=(w, h))
598 | pos = hierarchy_pos(G, start, width=float(2 * c) * math.pi, xcenter=0)
599 | new_pos = {
600 | u: (r * math.sin(theta), r * math.cos(theta)) for u, (theta, r) in pos.items()
601 | }
602 | nx.draw(
603 | G,
604 | pos=new_pos,
605 | alpha=0.8,
606 | node_size=25,
607 | with_labels=True,
608 | font_size=font_size,
609 | edge_color="gray",
610 | )
611 | nx.draw_networkx_nodes(
612 | G, pos=new_pos, nodelist=[start], node_color="blue", node_size=500
613 | )
614 |
615 | if filename:
616 | plt.savefig("{0}/{1}".format("images", filename))
617 |
618 |
619 | st.set_option("deprecation.showPyplotGlobalUse", False)
620 |
621 | st.markdown("## **② Choose a URL or a topic **")
622 |
623 | with st.beta_expander("ℹ️ - How Google Cloud pricing works ", expanded=False):
624 |
625 | st.write(
626 | """
627 | - Your usage of the Google Natural Language API is calculated in terms
628 | of "units"
629 | - Each document sent to the API for analysis is at least one unit
630 | - Documents that have more than 1,000 Unicode characters are considered as multiple units (1 unit per 1,000 characters)
631 | - More info about pricing on [Google's website](https://cloud.google.com/natural-language/pricing)
632 |
633 | """
634 | )
635 |
636 | st.markdown("---")
637 |
638 | st.text("")
639 |
640 | try:
641 |
642 | c10, c0, c8, c1, c2, c3, c4, c5, c6 = st.beta_columns(
643 | [0.10, 0.50, 0.10, 8, 0.10, 1.5, 0.10, 1.5, 0.10]
644 | )
645 |
646 | with c0:
647 | st.text("")
648 | toggle = st.select_slider("", options=("URL", "Tpc"))
649 |
650 | with c1:
651 |
652 | from re import search
653 |
654 | substring = "http://|https://"
655 |
656 | if toggle == "Tpc":
657 | keyword = st.text_input(
658 | "Enter a topic. (Returns the closest matching Wikipedia page for a given string)",
659 | key=1,
660 | )
661 | if keyword:
662 | if search(substring, keyword):
663 | st.warning(
664 | "⚠️ Seems like you're trying to paste a URL. Switch to 'URL' mode?"
665 | )
666 | st.stop()
667 | else:
668 | st.markdown('Keyword is "' + str(keyword) + '"')
669 |
670 | elif toggle == "URL":
671 |
672 | keyword = st.text_input(
673 | "Enter a Wikipedia URL",
674 | key=2,
675 | )
676 |
677 | if keyword:
678 | if search(substring, keyword):
679 | st.markdown('URL is "' + str(keyword) + '"')
680 | else:
681 | st.warning(
682 | "⚠️ Please check the URL format as it's invalid. It needs to start with http:// or https://. If you wanted to paste a keyword, switch to 'Topic' mode."
683 | )
684 | st.stop()
685 |
686 | with c3:
687 | depth = st.number_input(
688 | "Depth", step=1, value=1, min_value=1, max_value=3, key=1
689 | )
690 |
691 | with c5:
692 | limit = st.number_input(
693 | "Limit", step=1, value=1, min_value=1, max_value=3, key=2
694 | )
695 |
696 | c3, c4 = st.beta_columns(2)
697 |
698 | with c3:
699 | st.text("")
700 | st.text("")
701 | cButton = st.beta_container()
702 |
703 | with c4:
704 | st.text("")
705 | c30 = st.beta_container()
706 |
707 | button1 = cButton.button("✨ Happy with costs, get me the data!")
708 |
709 | if not button1 and not uploaded_file:
710 | st.stop()
711 | elif not button1 and uploaded_file:
712 | st.stop()
713 | elif button1 and not uploaded_file:
714 | c.warning("◀️ Add credentials 1st")
715 | st.stop()
716 | else:
717 | pass
718 |
719 | if button1:
720 |
721 | import time
722 |
723 | latest_iteration = st.empty()
724 | bar = st.progress(0)
725 |
726 | for i in range(100):
727 | latest_iteration.markdown(f"Sending your request ({i+1} % Completed)")
728 | bar.progress(i + 1)
729 | time.sleep(0.05)
730 |
731 | data, G = recurse_entities(keyword, depth=depth, limit=limit)
732 |
733 | st.markdown("## **③ Check results! ✨**")
734 |
735 | st.text("")
736 |
737 | g4 = net.Network(
738 | directed=True,
739 | heading="",
740 | height="800px",
741 | width="800px",
742 | notebook=True,
743 | )
744 |
745 | c1, c2, c3 = st.beta_columns([1, 3, 2])
746 |
747 | with c2:
748 | g4.from_nx(G)
749 | g4.show("wikiOutput.html")
750 | HtmlFile = open("wikiOutput.html", "r")
751 | source_code = HtmlFile.read()
752 | components.html(source_code, height=1000, width=1000)
753 |
754 | c30, c31, c32 = st.beta_columns(3)
755 |
756 | with c30:
757 | c1 = st.beta_container()
758 | with c31:
759 | c2 = st.beta_container()
760 |
761 | cm = sns.light_palette("green", as_cmap=True)
762 | df = pd.DataFrame(data).sort_values(by="salience", ascending=False)
763 | df = df.reset_index()
764 | df.index += 1
765 | df = df.drop(["index"], axis=1)
766 | format_dictionary = {
767 | "salience": "{:.1%}",
768 | }
769 | dfStyled = df.style.background_gradient(cmap=cm)
770 | dfStyled2 = dfStyled.format(format_dictionary)
771 | st.table(dfStyled2)
772 |
773 | try:
774 | import base64
775 |
776 | csv = df.to_csv(index=False)
777 | b64 = base64.b64encode(csv.encode()).decode()
778 | href = f'** - Download data to CSV 🎁 **'
779 | c1.markdown(href, unsafe_allow_html=True)
780 | except NameError:
781 | print("wait")
782 |
783 | except Exception as e:
784 |
785 | st.warning(
786 | f"""
787 | 🤔 ** Snap! **
788 | have you checked that:
789 | - The credentials JSON file you have added is valid?
790 | - Google Cloud's billing is enabled?
791 | - The URL you typed is a valid Wikipedia URL (that is, if you selected the "URL" option)?
792 |
793 | If this keeps happening -> [](https://gitter.im/DataChaz/WikiTopic)
794 |
795 | """
796 | )
797 |
--------------------------------------------------------------------------------