├── favicon.ico ├── wikilogo.png ├── user-config.py ├── requirements.txt ├── LICENSE ├── README.md ├── wikiOutput.html └── app.py /favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/favicon.ico -------------------------------------------------------------------------------- /wikilogo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CharlyWargnier/S4_wiki_topic_grapher/HEAD/wikilogo.png -------------------------------------------------------------------------------- /user-config.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | mylang = 'en' 3 | family = 'wikipedia' 4 | usernames['wikipedia']['en'] = 'test' -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit==0.75.0 2 | 3 | pyvis 4 | google-api-core==1.21.0 5 | google-auth==1.19.1 6 | google-cloud-language==1.3.0 7 | googleapis-common-protos==1.52.0 8 | 9 | beautifulsoup4==4.9.3 10 | matplotlib 11 | networkx==2.5 12 | pandas==1.2.1 13 | pywikibot==5.1.0 14 | requests==2.24.0 15 | seaborn==0.11.0 16 | validators==0.18.1 17 | 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Charly Wargnier 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Wiki Topic Grapher 3 | 4 | Leverage the power of Google [#NLP](https://threadreaderapp.com/hashtag/NLP) to retrieve entity relationships from Wikipedia URLs or topics! 5 | 6 | - Get interactive graphs of connected entities 7 | - Export results with entity types and salience to CSV! 8 | 9 | _________________ 10 | 11 | [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/charlywargnier/s4_wiki_topic_grapher/main/app.py) 12 | 13 | 14 | ### Use cases 15 | 16 | Many cool use cases! 17 | 18 | - Research any topic then get entity associations that exist from that seed topic 19 | 20 | - Map out related entities with your product, service or brand 21 | 22 | - Find how well you've covered a specific topic on your website 23 | 24 | - Differentiate your pages! 25 | 26 | 27 | ### Stack 28 | 29 | ### 30 | 31 | About the stack, it's 100% [#Python](https://threadreaderapp.com/hashtag/Python)! 🐍🔥 32 | 33 | - [@GCPcloud](https://twitter.com/GCPcloud) Natural Language API 34 | 35 | - PyWikibot 36 | 37 | - Networkx 38 | 39 | - PyVis 40 | 41 | - [@streamlit](https://twitter.com/streamlit) 42 | 43 | - Streamlit Components -> [streamlit.io/components](https://www.streamlit.io/components) 44 | 45 | 46 | 47 | 48 | ### **⚒️ Still To-Do’s** 49 | 50 | - 💰 Add a budget estimator to get a sense of [@GCPcloud](https://twitter.com/GCPcloud) costs! 51 | 52 | - 🌍Add a multilingual option (currently English only) 53 | 54 | - 📈Add on-the-fly physics controls to the network graph 55 | 56 | - 💯Add Google KG [#API](https://threadreaderapp.com/hashtag/API) to add more data (scores, etc.) (ht [@LoukilAymen](https://twitter.com/LoukilAymen)) 57 | 58 | 59 | That code currently lays in a private repo. I should be able to make it public soon for you to re-use it in your own apps and creations! I just need to clean it a tad, remove some sensitive bits, etc. 60 | 61 | 62 | 63 | ### 🙌 Shout-outs 64 | 65 | Kudos to [@jroakes](https://twitter.com/jroakes) for the original script. Buy that man a 🍺 for his sterling contributions! -> [paypal.com/paypalme/codes…](https://www.paypal.com/paypalme/codeseo) 66 | 67 | Kudos also to fellow [@streamlit](https://twitter.com/streamlit) Creators: 68 | 69 | - [@napoles3D](https://twitter.com/napoles3D) who told me about the PyVis lib! 🔥 70 | 71 | - [@andfanilo](https://twitter.com/andfanilo)/[@ChristianKlose3](https://twitter.com/ChristianKlose3) for their precious advice! 🙏 72 | 73 | 74 | 75 | 76 | ### 💲 Beware on costs! 77 | 78 | It can get expensive quickly with that Google Natural Language API! 79 | 80 | 81 | 82 | Monitor your costs via the GCP console regularly and/or put quotas to tame that G beast! I'm planning to add a budget estimator pre-API calls. Should come handy. 83 | 84 | 85 | 86 | ### Feedback and support 87 | 88 | 89 | 90 | Wiki Topic Grapher's still in Beta, with possible rough edges! Head-off to my [Gitter page](https://gitter.im/DataChaz/WikiTopic) for bug reports, questions, or suggestions. 91 | 92 | This app is free. If it's useful to you, you can buy me a ☕ to support my work! 🙏 ▶️ [buymeacoffee.com/cwar05](https://www.buymeacoffee.com/cwar05) 93 | 94 | 95 | 96 | 97 | That's all, folks. Enjoy! 98 | 99 | -------------------------------------------------------------------------------- /wikiOutput.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 |
7 |

asdsad!

8 |
9 | 10 | 12 | 13 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import streamlit.components.v1 as components 3 | import pandas as pd 4 | import numpy as np 5 | 6 | import networkx as nx 7 | from networkx.readwrite import json_graph 8 | from pyvis import network as net 9 | 10 | import matplotlib.pyplot as plt 11 | import seaborn as sns 12 | 13 | from bs4 import BeautifulSoup 14 | import pywikibot 15 | import math 16 | import os 17 | import re 18 | import requests 19 | import tempfile 20 | import validators 21 | 22 | from google.cloud import language 23 | from google.cloud.language import enums 24 | from google.cloud.language import types 25 | from google.cloud import language_v1 26 | from google.cloud.language_v1 import enums 27 | 28 | 29 | st.set_page_config( 30 | page_title="Wiki Topic Grapher", 31 | page_icon="favicon.ico", 32 | ) 33 | 34 | 35 | def _max_width_(): 36 | max_width_str = f"max-width: 1500px;" 37 | st.markdown( 38 | f""" 39 | 44 | """, 45 | unsafe_allow_html=True, 46 | ) 47 | 48 | 49 | _max_width_() 50 | 51 | 52 | c30, c31, c32 = st.beta_columns([1, 3.3, 3]) 53 | 54 | 55 | with c30: 56 | st.markdown("###") 57 | st.image("wikilogo.png", width=520) 58 | st.header("") 59 | 60 | with c32: 61 | st.markdown("#") 62 | st.text("") 63 | st.text("") 64 | st.markdown( 65 | "###### Original script by [JR Oakes](https://twitter.com/jroakes) - Ported to [![this is an image link](https://i.imgur.com/iIOA6kU.png)](https://www.streamlit.io/) , with :heart: by [DataChaz](https://twitter.com/DataChaz)   [![this is an image link](https://i.imgur.com/thJhzOO.png)](https://www.buymeacoffee.com/cwar05)" 66 | ) 67 | 68 | 69 | with st.beta_expander("ℹ️ - About this app ", expanded=True): 70 | st.write( 71 | """ 72 | 73 | - Wiki Topic Grapher leverages the power of [Google Natural Language API] (https://cloud.google.com/natural-language) to recursively retrieve entity relationships from any Wikipedia seed topic! 🔥 74 | - Get a network graph of these connected entities, save the graph as jpg or export the results ordered by salience to CSV! 75 | - The tool is still in Beta, with possible rough edges! [![Gitter](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/DataChaz/WikiTopic) for bug report, questions, or suggestions. 76 | - Kudos to JR Oakes for the original script - [buy the man a 🍺 here!](https://www.paypal.com/paypalme/codeseo) 77 | - This app is free. If it's useful to you, you can [buy me a ☕](https://www.buymeacoffee.com/cwar05) to support my work! 🙏 78 | 79 | 80 | """ 81 | ) 82 | 83 | st.markdown("---") 84 | 85 | 86 | with st.beta_expander("🛠️ - How to use it ", expanded=False): 87 | 88 | st.markdown( 89 | """ 90 | - Wiki Topic Grapher takes the top entities for each Wikipedia URL and follows those entities according to the specified limit and depth parameters 91 | - Here's a [neat chart](https://i.imgur.com/wZOU1wh.png) explaining how it all works""" 92 | ) 93 | 94 | st.markdown("---") 95 | 96 | st.markdown( 97 | """ 98 | 99 | **URL:** 100 | 101 | - Paste a Wikipedia URL 102 | - Make sure the URL belongs to https://en.wikipedia.org/ 103 | - Only English is currently supported. More languages to come! :) 104 | 105 | _ 106 | 107 | **Topic:** 108 | 109 | - Select "Topic" via the left-hand toggle and type your keyword 110 | - It will return the closest matching Wikipedia page for that given string 111 | - Use that method with caution as currently there's no way to get the related page before calling the API 112 | - Can be costly if the page has lots of text! 113 | 114 | _ 115 | 116 | **Depth**: 117 | - The maximum number of entities to pull for each Wikipedia page 118 | - Depth 1 or 2 are the recommended settings 119 | - Depth 3 and above work yet it may not be usable nor legible! 120 | 121 | _ 122 | 123 | **Limit**: 124 | - The max number of entities to pull for each page 125 | 126 | """ 127 | ) 128 | 129 | st.markdown("---") 130 | 131 | with st.beta_expander("🔎- SEO use cases ", expanded=False): 132 | st.write( 133 | """ 134 | 135 | - Research any topic then get entity associations that exist from that seed topic 136 | - Map out these related entities & alternative lexical fields with your product, service or brand 137 | - Find how well you've covered a specific topic on your website 138 | - Differentiate pages on your website! 139 | 140 | """ 141 | ) 142 | 143 | st.markdown("---") 144 | 145 | 146 | with st.beta_expander("🧰 - Stack + To-Do's", expanded=False): 147 | 148 | st.markdown("") 149 | 150 | st.write( 151 | """ 152 | ** Stack ** 153 | 154 | - 100% Python! 🐍🔥 155 | - [Google Natural Language API](https://cloud.google.com/natural-language) 156 | - [PyWikibot](https://www.mediawiki.org/wiki/Manual:Pywikibot) 157 | - [Networkx](https://networkx.org/) 158 | - [Streamlit](https://www.streamlit.io/) 159 | - [Streamlit Components](https://www.streamlit.io/components)""" 160 | ) 161 | 162 | st.markdown("") 163 | 164 | st.write( 165 | """ 166 | 167 | ** To-Do's ** 168 | 169 | - Add a budget estimator to estimate Google Cloud Language API costs 170 | - Add a multilingual option (currently English only) 171 | - Add on-the-fly physics controls to the network graph 172 | - Exception handling is still pretty broad at the moment and could be improved 173 | 174 | """ 175 | ) 176 | 177 | st.markdown("---") 178 | 179 | st.markdown("## **① Upload your Google NLP key **") 180 | with st.beta_expander("ℹ️ - How to create your credentials?", expanded=False): 181 | 182 | st.write( 183 | """ 184 | 185 | - In the [Cloud Console](https://console.cloud.google.com/), go to the _'Create Service Account Key'_ page 186 | - From the *Service account list*, select _'New service account'_ 187 | - In the *Service account name* field, enter a name 188 | - From the *Role list*, select _'Project > Owner'_ 189 | - Click create, then download your JSON key 190 | - Upload it (or drag and drop it) in the grey box below 👇 191 | 192 | """ 193 | ) 194 | st.markdown("---") 195 | 196 | 197 | # Pywikibot needs a config file 198 | pywikibot_config = r"""# -*- coding: utf-8 -*- 199 | mylang = 'en' 200 | family = 'wikipedia' 201 | usernames['wikipedia']['en'] = 'test'""" 202 | 203 | with open("user-config.py", "w", encoding="utf-8") as f: 204 | f.write(pywikibot_config) 205 | 206 | c3, c4 = st.beta_columns(2) 207 | 208 | with c3: 209 | try: 210 | uploaded_file = st.file_uploader("", type="json") 211 | with tempfile.NamedTemporaryFile(delete=False) as fp: 212 | fp.write(uploaded_file.getvalue()) 213 | try: 214 | os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = fp.name 215 | with open(fp.name, "rb") as a: 216 | client = language.LanguageServiceClient.from_service_account_json( 217 | fp.name 218 | ) 219 | 220 | finally: 221 | if os.path.isfile(fp.name): 222 | os.unlink(fp.name) 223 | 224 | except AttributeError: 225 | 226 | print("wait") 227 | 228 | with c4: 229 | st.markdown("###") 230 | c = st.beta_container() 231 | if uploaded_file: 232 | st.success("✅ Nice! Your credentials are uploaded!") 233 | 234 | 235 | def google_nlp_entities( 236 | input, 237 | input_type="html", 238 | result_type="all", 239 | limit=10, 240 | invalid_types=["OTHER", "NUMBER", "DATE"], 241 | ): 242 | 243 | """ 244 | Loads HTML or text from a URL and passes to the Google NLP API 245 | Parameters: 246 | * input: HTML or Plain Text to send to the Google Language API 247 | * input_type: Either `html` or `text` (string) 248 | * result_type: Either `all`(pull all entities) or `wikipedia` (only pull entities with Wikipedia pages) 249 | * limit: Limits the number of results to this number sorted, decending, by salience. 250 | * invalid_types: A list of entity types to exclude. 251 | Returns: 252 | List of entities in format [{'name':,'type':,'salience':, 'wikipedia': }] 253 | """ 254 | 255 | def get_type(type): 256 | return client.enums.Entity.Type(d.type).name 257 | 258 | if not input: 259 | print("No input content found.") 260 | return None 261 | 262 | if input_type == "html": 263 | doc_type = language.enums.Document.Type.HTML 264 | else: 265 | doc_type = language.enums.Document.Type.PLAIN_TEXT 266 | 267 | document = types.Document(content=input, type=doc_type) 268 | 269 | features = {"extract_entities": True} 270 | 271 | try: 272 | response = client.annotate_text( 273 | document=document, features=features, timeout=20 274 | ) 275 | except Exception as e: 276 | print("Error with language API: ", re.sub(r"\(.*$", "", str(e))) 277 | return [] 278 | 279 | used = [] 280 | results = [] 281 | for d in response.entities: 282 | 283 | if limit and len(results) >= limit: 284 | break 285 | 286 | if get_type(d.type) not in invalid_types and d.name not in used: 287 | 288 | data = { 289 | "name": d.name, 290 | "type": client.enums.Entity.Type(d.type).name, 291 | "salience": d.salience, 292 | } 293 | if result_type is "wikipedia": 294 | if "wikipedia_url" in d.metadata: 295 | data["wikipedia"] = d.metadata["wikipedia_url"] 296 | results.append(data) 297 | else: 298 | results.append(data) 299 | 300 | used.append(d.name) 301 | 302 | return results 303 | 304 | 305 | def load_page_title(url): 306 | """ 307 | Returns the given a URL. 308 | Parameters: 309 | * url: URL (string) 310 | Returns: 311 | Inner text of <title> (string) 312 | """ 313 | soup = BeautifulSoup(requests.get(url).text) 314 | return soup.title.text 315 | 316 | 317 | @st.cache(allow_output_mutation=True, show_spinner=False) 318 | def html_to_text(html, target_elements=None): 319 | """ 320 | Transforms HTML to clean text 321 | Parameters: 322 | * html: HTML from a web page (str) 323 | * target_elements: Elements like `div` or `p` to target pulling text from. (optional) (string) 324 | Returns: 325 | Text (string) 326 | """ 327 | soup = BeautifulSoup(html) 328 | 329 | for script in soup( 330 | ["script", "style"] 331 | ): # remove all javascript and stylesheet code 332 | script.extract() 333 | 334 | targets = [] 335 | 336 | if target_elements: 337 | targets = soup.find_all(target_elements) 338 | 339 | if target_elements and len(targets) > 3: 340 | text = " ".join([t.text for t in targets]) 341 | else: 342 | text = soup.get_text() 343 | 344 | # break into lines and remove leading and trailing space on each 345 | lines = (line.strip() for line in text.splitlines()) 346 | # break multi-headlines into a line each 347 | chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 348 | # drop blank lines 349 | text = "\n".join(chunk for chunk in chunks if chunk) 350 | return text 351 | 352 | 353 | @st.cache(allow_output_mutation=True, show_spinner=False) 354 | def load_text_from_url(url, **data): 355 | 356 | """ 357 | Loads html from a URL 358 | Parameters: 359 | * url: url of page to load (str) 360 | * timeout: request timeout in seconds (int) default: 20 361 | Returns: 362 | HTML (str) 363 | """ 364 | 365 | timeout = data.get("timeout", 20) 366 | 367 | results = [] 368 | 369 | try: 370 | # print("Extracting HTML from: {}".format(url)) 371 | response = requests.get( 372 | url, 373 | headers={ 374 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0" 375 | }, 376 | timeout=timeout, 377 | ) 378 | 379 | text = response.text 380 | status = response.status_code 381 | 382 | if status == 200 and len(text) > 0: 383 | return text 384 | else: 385 | print("Incorrect status returned: ", status) 386 | 387 | return None 388 | 389 | except Exception as e: 390 | print("Problem with url: {0}.".format(url)) 391 | return None 392 | 393 | 394 | @st.cache(allow_output_mutation=True, show_spinner=False) 395 | def get_wikipedia_url(query): 396 | """ 397 | Finds the closest matching Wikipedia page for a given string. 398 | Parameters: 399 | * query: Query to search Wikipedia with. (string) 400 | Returns: 401 | The top matching URL for the query. Follows redirects (string) 402 | """ 403 | sitew = pywikibot.Site("en", "wikipedia") 404 | result = None 405 | print("looking up:", query) 406 | search = sitew.search( 407 | query, where="title", get_redirects=True, total=1, content=False, namespaces="0" 408 | ) 409 | for page in search: 410 | if page.isRedirectPage(): 411 | page = page.getRedirectTarget() 412 | result = page.full_url() 413 | break 414 | 415 | return result 416 | 417 | 418 | @st.cache(allow_output_mutation=True, show_spinner=False) 419 | def recurse_entities( 420 | input_data, entity_results=[], G=nx.Graph(), current_depth=0, depth=2, limit=3 421 | ): 422 | """ 423 | Recursively finds entities of connected Wikipedia topics by taking the top entities 424 | for each page and following those entities up to the specified depth 425 | Parameters: 426 | * input_data: A topic or URL. If topic, finds the closes matching Wikipedia start page. 427 | If URL, starts with the top enetities of that page. (string) 428 | * depth: Max recursion depth (integer) 429 | * limit: The max number of entities to pull for each page. (integer) 430 | Returns: 431 | A tuple of: 432 | * entity_results: List of dictionaries of found entities. 433 | * G: Networkx graph of entities. 434 | """ 435 | if isinstance(input_data, str): 436 | # Starting fresh. Make sure variables are fresh. 437 | entity_results = [] 438 | G = nx.Graph() 439 | current_depth = 0 440 | if not validators.url(input_data): 441 | input_data = get_wikipedia_url(input_data) 442 | if not input_data: 443 | print("No Wikipedia URL Found.") 444 | return None, None 445 | else: 446 | print("Wikipedia URL: ", input_data) 447 | name = load_page_title(input_data).split("-")[0].strip() 448 | else: 449 | name = load_page_title(input_data) 450 | input_data = ( 451 | [ 452 | { 453 | "name": name.title(), 454 | "type": "START", 455 | "salience": 0.0, 456 | "wikipedia": input_data, 457 | } 458 | ] 459 | if input_data 460 | else [] 461 | ) 462 | 463 | # Regex for wikipedia terms to not bias entities returned 464 | subs = r"(wikipedia|wikimedia|wikitext|mediawiki|wikibase)" 465 | 466 | for d in input_data: 467 | url = d["wikipedia"] 468 | name = d["name"] 469 | 470 | print( 471 | " " * current_depth + "Level: {0} Name: {1}".format(current_depth, name) 472 | ) 473 | 474 | html = load_text_from_url(url) 475 | 476 | # html_to_text will default to all text if < 4 `p` elements found. 477 | if "wikipedia.org" in url: 478 | html = html_to_text(html, target_elements="p") 479 | else: 480 | html = html_to_text(html) 481 | 482 | # Kill brutally wikipedia terms. 483 | html = re.sub(subs, "", html, flags=re.IGNORECASE) 484 | 485 | results = [ 486 | r 487 | for r in google_nlp_entities( 488 | html, input_type="text", limit=None, result_type="wikipedia" 489 | ) 490 | if "wiki" not in r["name"].lower() and not G.has_node(r["name"]) 491 | ][:limit] 492 | _ = [G.add_edge(name, r["name"]) for r in results] 493 | entity_results.extend(results) 494 | 495 | new_depth = int(current_depth + 1) 496 | if results and new_depth <= depth: 497 | recurse_entities(results, entity_results, G, new_depth, depth, limit) 498 | 499 | if current_depth == 0: 500 | return entity_results, G 501 | 502 | 503 | @st.cache(allow_output_mutation=True, show_spinner=False) 504 | def hierarchy_pos(G, root=None, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5): 505 | 506 | """ 507 | From Joel's answer at https://stackoverflow.com/a/29597209/2966723. 508 | Licensed under Creative Commons Attribution-Share Alike 509 | 510 | If the graph is a tree this will return the positions to plot this in a 511 | hierarchical layout. 512 | 513 | G: the graph (must be a tree) 514 | 515 | root: the root node of current branch 516 | - if the tree is directed and this is not given, 517 | the root will be found and used 518 | - if the tree is directed and this is given, then 519 | the positions will be just for the descendants of this node. 520 | - if the tree is undirected and not given, 521 | then a random choice will be used. 522 | 523 | width: horizontal space allocated for this branch - avoids overlap with other branches 524 | 525 | vert_gap: gap between levels of hierarchy 526 | 527 | vert_loc: vertical location of root 528 | 529 | xcenter: horizontal location of root 530 | """ 531 | if not nx.is_tree(G): 532 | raise TypeError("cannot use hierarchy_pos on a graph that is not a tree") 533 | 534 | if root is None: 535 | if isinstance(G, nx.DiGraph): 536 | root = next( 537 | iter(nx.topological_sort(G)) 538 | ) # allows back compatibility with nx version 1.11 539 | else: 540 | root = random.choice(list(G.nodes)) 541 | 542 | def _hierarchy_pos( 543 | G, root, width=1.0, vert_gap=0.2, vert_loc=0, xcenter=0.5, pos=None, parent=None 544 | ): 545 | """ 546 | see hierarchy_pos docstring for most arguments 547 | 548 | pos: a dict saying where all nodes go if they have been assigned 549 | parent: parent of this branch. - only affects it if non-directed 550 | 551 | """ 552 | 553 | if pos is None: 554 | pos = {root: (xcenter, vert_loc)} 555 | else: 556 | pos[root] = (xcenter, vert_loc) 557 | children = list(G.neighbors(root)) 558 | if not isinstance(G, nx.DiGraph) and parent is not None: 559 | children.remove(parent) 560 | if len(children) != 0: 561 | dx = width / len(children) 562 | nextx = xcenter - width / 2 - dx / 2 563 | for child in children: 564 | nextx += dx 565 | pos = _hierarchy_pos( 566 | G, 567 | child, 568 | width=dx, 569 | vert_gap=vert_gap, 570 | vert_loc=vert_loc - vert_gap, 571 | xcenter=nextx, 572 | pos=pos, 573 | parent=root, 574 | ) 575 | return pos 576 | 577 | return _hierarchy_pos(G, root, width, vert_gap, vert_loc, xcenter) 578 | 579 | 580 | def plot_entity_branches(G, w=10, h=10, c=1, font_size=14, filename=None): 581 | """ 582 | Given a networkx graph, builds a recursive tree graph 583 | 584 | Parameters: 585 | * G: Networkx graph of entities. 586 | * w: Width of output plot 587 | * h: height of output plot 588 | * c: Circle percentage (float) 0.5 is a semi-circle. Range: 0.1-1.0 589 | * font_size: Font Size of labels (integer) 590 | * filename: Filename for the saved plot. Optional (string) 591 | Returns: 592 | Nothing. Plots a graph 593 | 594 | """ 595 | start = list(G.nodes)[0] 596 | G = nx.bfs_tree(G, start) 597 | plt.figure(figsize=(w, h)) 598 | pos = hierarchy_pos(G, start, width=float(2 * c) * math.pi, xcenter=0) 599 | new_pos = { 600 | u: (r * math.sin(theta), r * math.cos(theta)) for u, (theta, r) in pos.items() 601 | } 602 | nx.draw( 603 | G, 604 | pos=new_pos, 605 | alpha=0.8, 606 | node_size=25, 607 | with_labels=True, 608 | font_size=font_size, 609 | edge_color="gray", 610 | ) 611 | nx.draw_networkx_nodes( 612 | G, pos=new_pos, nodelist=[start], node_color="blue", node_size=500 613 | ) 614 | 615 | if filename: 616 | plt.savefig("{0}/{1}".format("images", filename)) 617 | 618 | 619 | st.set_option("deprecation.showPyplotGlobalUse", False) 620 | 621 | st.markdown("## **② Choose a URL or a topic **") 622 | 623 | with st.beta_expander("ℹ️ - How Google Cloud pricing works ", expanded=False): 624 | 625 | st.write( 626 | """ 627 | - Your usage of the Google Natural Language API is calculated in terms 628 | of "units" 629 | - Each document sent to the API for analysis is at least one unit 630 | - Documents that have more than 1,000 Unicode characters are considered as multiple units (1 unit per 1,000 characters) 631 | - More info about pricing on [Google's website](https://cloud.google.com/natural-language/pricing) 632 | 633 | """ 634 | ) 635 | 636 | st.markdown("---") 637 | 638 | st.text("") 639 | 640 | try: 641 | 642 | c10, c0, c8, c1, c2, c3, c4, c5, c6 = st.beta_columns( 643 | [0.10, 0.50, 0.10, 8, 0.10, 1.5, 0.10, 1.5, 0.10] 644 | ) 645 | 646 | with c0: 647 | st.text("") 648 | toggle = st.select_slider("", options=("URL", "Tpc")) 649 | 650 | with c1: 651 | 652 | from re import search 653 | 654 | substring = "http://|https://" 655 | 656 | if toggle == "Tpc": 657 | keyword = st.text_input( 658 | "Enter a topic. (Returns the closest matching Wikipedia page for a given string)", 659 | key=1, 660 | ) 661 | if keyword: 662 | if search(substring, keyword): 663 | st.warning( 664 | "⚠️ Seems like you're trying to paste a URL. Switch to 'URL' mode?" 665 | ) 666 | st.stop() 667 | else: 668 | st.markdown('Keyword is "' + str(keyword) + '"') 669 | 670 | elif toggle == "URL": 671 | 672 | keyword = st.text_input( 673 | "Enter a Wikipedia URL", 674 | key=2, 675 | ) 676 | 677 | if keyword: 678 | if search(substring, keyword): 679 | st.markdown('URL is "' + str(keyword) + '"') 680 | else: 681 | st.warning( 682 | "⚠️ Please check the URL format as it's invalid. It needs to start with http:// or https://. If you wanted to paste a keyword, switch to 'Topic' mode." 683 | ) 684 | st.stop() 685 | 686 | with c3: 687 | depth = st.number_input( 688 | "Depth", step=1, value=1, min_value=1, max_value=3, key=1 689 | ) 690 | 691 | with c5: 692 | limit = st.number_input( 693 | "Limit", step=1, value=1, min_value=1, max_value=3, key=2 694 | ) 695 | 696 | c3, c4 = st.beta_columns(2) 697 | 698 | with c3: 699 | st.text("") 700 | st.text("") 701 | cButton = st.beta_container() 702 | 703 | with c4: 704 | st.text("") 705 | c30 = st.beta_container() 706 | 707 | button1 = cButton.button("✨ Happy with costs, get me the data!") 708 | 709 | if not button1 and not uploaded_file: 710 | st.stop() 711 | elif not button1 and uploaded_file: 712 | st.stop() 713 | elif button1 and not uploaded_file: 714 | c.warning("◀️ Add credentials 1st") 715 | st.stop() 716 | else: 717 | pass 718 | 719 | if button1: 720 | 721 | import time 722 | 723 | latest_iteration = st.empty() 724 | bar = st.progress(0) 725 | 726 | for i in range(100): 727 | latest_iteration.markdown(f"Sending your request ({i+1} % Completed)") 728 | bar.progress(i + 1) 729 | time.sleep(0.05) 730 | 731 | data, G = recurse_entities(keyword, depth=depth, limit=limit) 732 | 733 | st.markdown("## **③ Check results! ✨**") 734 | 735 | st.text("") 736 | 737 | g4 = net.Network( 738 | directed=True, 739 | heading="", 740 | height="800px", 741 | width="800px", 742 | notebook=True, 743 | ) 744 | 745 | c1, c2, c3 = st.beta_columns([1, 3, 2]) 746 | 747 | with c2: 748 | g4.from_nx(G) 749 | g4.show("wikiOutput.html") 750 | HtmlFile = open("wikiOutput.html", "r") 751 | source_code = HtmlFile.read() 752 | components.html(source_code, height=1000, width=1000) 753 | 754 | c30, c31, c32 = st.beta_columns(3) 755 | 756 | with c30: 757 | c1 = st.beta_container() 758 | with c31: 759 | c2 = st.beta_container() 760 | 761 | cm = sns.light_palette("green", as_cmap=True) 762 | df = pd.DataFrame(data).sort_values(by="salience", ascending=False) 763 | df = df.reset_index() 764 | df.index += 1 765 | df = df.drop(["index"], axis=1) 766 | format_dictionary = { 767 | "salience": "{:.1%}", 768 | } 769 | dfStyled = df.style.background_gradient(cmap=cm) 770 | dfStyled2 = dfStyled.format(format_dictionary) 771 | st.table(dfStyled2) 772 | 773 | try: 774 | import base64 775 | 776 | csv = df.to_csv(index=False) 777 | b64 = base64.b64encode(csv.encode()).decode() 778 | href = f'<a href="data:file/csv;base64,{b64}" download="listViewExport.csv">** - Download data to CSV 🎁 **</a>' 779 | c1.markdown(href, unsafe_allow_html=True) 780 | except NameError: 781 | print("wait") 782 | 783 | except Exception as e: 784 | 785 | st.warning( 786 | f""" 787 | 🤔 ** Snap! ** 788 | have you checked that: 789 | - The credentials JSON file you have added is valid? 790 | - Google Cloud's billing is enabled? 791 | - The URL you typed is a valid Wikipedia URL (that is, if you selected the "URL" option)? 792 | 793 | If this keeps happening -> [![Gitter](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/DataChaz/WikiTopic) 794 | 795 | """ 796 | ) 797 | --------------------------------------------------------------------------------