├── .readthedocs.yaml ├── LICENSE ├── README.md ├── docs ├── Top2Vec.md ├── api.rst ├── conf.py └── index.rst ├── images ├── doc_word_embedding.svg ├── hdbscan_docs.png ├── restful-top2vec.png ├── top2vec_logo.svg ├── topic21.png ├── topic29.png ├── topic48.png ├── topic61.png ├── topic9.png ├── topic_vector.svg ├── topic_words.svg └── umap_docs.png ├── notebooks └── CORD-19_top2vec.ipynb ├── requirements.txt ├── setup.py └── top2vec ├── __init__.py ├── embedding.py ├── tests └── test_top2vec.py └── top2vec.py /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | sphinx: 4 | configuration: docs/conf.py 5 | 6 | build: 7 | os: ubuntu-22.04 8 | tools: 9 | python: "3.10" 10 | jobs: 11 | post_create_environment: 12 | - python -m pip install sphinx_rtd_theme 13 | - python -m pip install recommonmark 14 | 15 | formats: [] 16 | 17 | python: 18 | install: 19 | - requirements: requirements.txt -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2020, Dimo Angelov 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [](https://pypi.org/project/top2vec/) 2 | [](https://github.com/ddangelov/Top2Vec/blob/master/LICENSE) 3 | [](https://top2vec.readthedocs.io/en/latest/?badge=latest) 4 | [](http://arxiv.org/abs/2008.09470) 5 | 6 | 7 | 8 |
9 |
10 |
145 |
146 |
153 |
154 |
161 |
162 |
169 |
170 |
177 |
178 |
373 |
374 |
375 |
376 |
377 |
378 |
379 | ### Search Documents by Topic
380 |
381 | We are going to search by **topic 48**, a topic that appears to be about **science**.
382 | ```python
383 | documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
384 | ```
385 | Returns:
386 | * ``documents``: The documents in a list, the most similar are first.
387 |
388 | * ``doc_scores``: Semantic similarity of document to topic. The cosine similarity of the
389 | document and topic vector.
390 |
391 | * ``doc_ids``: Unique ids of documents. If ids were not given, the index of document
392 | in the original corpus.
393 |
394 | For each of the returned documents we are going to print its content, score and document number.
395 | ```python
396 | documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
397 | for doc, score, doc_id in zip(documents, document_scores, document_ids):
398 | print(f"Document: {doc_id}, Score: {score}")
399 | print("-----------")
400 | print(doc)
401 | print("-----------")
402 | print()
403 | ```
404 |
405 |
406 | Document: 15227, Score: 0.6322
407 | -----------
408 | Evolution is both fact and theory. The THEORY of evolution represents the
409 | scientific attempt to explain the FACT of evolution. The theory of evolution
410 | does not provide facts; it explains facts. It can be safely assumed that ALL
411 | scientific theories neither provide nor become facts but rather EXPLAIN facts.
412 | I recommend that you do some appropriate reading in general science. A good
413 | starting point with regard to evolution for the layman would be "Evolution as
414 | Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
415 | Gould. There is a great deal of other useful information in this publication.
416 | -----------
417 |
418 | Document: 14515, Score: 0.6186
419 | -----------
420 | Just what are these "scientific facts"? I have never heard of such a thing.
421 | Science never proves or disproves any theory - history does.
422 |
423 | -Tim
424 | -----------
425 |
426 | Document: 9433, Score: 0.5997
427 | -----------
428 | The same way that any theory is proven false. You examine the predicitions
429 | that the theory makes, and try to observe them. If you don't, or if you
430 | observe things that the theory predicts wouldn't happen, then you have some
431 | evidence against the theory. If the theory can't be modified to
432 | incorporate the new observations, then you say that it is false.
433 |
434 | For example, people used to believe that the earth had been created
435 | 10,000 years ago. But, as evidence showed that predictions from this
436 | theory were not true, it was abandoned.
437 | -----------
438 |
439 | Document: 11917, Score: 0.5845
440 | -----------
441 | The point about its being real or not is that one does not waste time with
442 | what reality might be when one wants predictions. The questions if the
443 | atoms are there or if something else is there making measurements indicate
444 | atoms is not necessary in such a system.
445 |
446 | And one does not have to write a new theory of existence everytime new
447 | models are used in Physics.
448 | -----------
449 |
450 | ...
451 |
452 | ### Semantic Search Documents by Keywords
453 |
454 | Search documents for content semantically similar to **cryptography** and **privacy**.
455 | ```python
456 | documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
457 | for doc, score, doc_id in zip(documents, document_scores, document_ids):
458 | print(f"Document: {doc_id}, Score: {score}")
459 | print("-----------")
460 | print(doc)
461 | print("-----------")
462 | print()
463 | ```
464 | Document: 16837, Score: 0.6112
465 | -----------
466 | ...
467 | Email and account privacy, anonymity, file encryption, academic
468 | computer policies, relevant legislation and references, EFF, and
469 | other privacy and rights issues associated with use of the Internet
470 | and global networks in general.
471 | ...
472 |
473 | Document: 16254, Score: 0.5722
474 | -----------
475 | ...
476 | The President today announced a new initiative that will bring
477 | the Federal Government together with industry in a voluntary
478 | program to improve the security and privacy of telephone
479 | communications while meeting the legitimate needs of law
480 | enforcement.
481 | ...
482 | -----------
483 | ...
484 |
485 | ### Similar Keywords
486 |
487 | Search for similar words to **space**.
488 | ```python
489 | words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
490 | for word, score in zip(words, word_scores):
491 | print(f"{word} {score}")
492 | ```
493 | space 1.0
494 | nasa 0.6589
495 | shuttle 0.5976
496 | exploration 0.5448
497 | planetary 0.5391
498 | missions 0.5069
499 | launch 0.4941
500 | telescope 0.4821
501 | astro 0.4696
502 | jsc 0.4549
503 | ames 0.4515
504 | satellite 0.446
505 | station 0.4445
506 | orbital 0.4438
507 | solar 0.4386
508 | astronomy 0.4378
509 | observatory 0.4355
510 | facility 0.4325
511 | propulsion 0.4251
512 | aerospace 0.4226
513 |
--------------------------------------------------------------------------------
/docs/Top2Vec.md:
--------------------------------------------------------------------------------
1 | ../README.md
--------------------------------------------------------------------------------
/docs/api.rst:
--------------------------------------------------------------------------------
1 | Top2Vec API Guide
2 | =================
3 |
4 | .. automodule:: top2vec.top2vec
5 | :members:
6 |
--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
1 | import sphinx_rtd_theme
2 | from recommonmark.parser import CommonMarkParser
3 |
4 | import os
5 | import sys
6 | sys.path.insert(0, os.path.abspath('..'))
7 |
8 | # Configuration file for the Sphinx documentation builder.
9 | #
10 | # This file only contains a selection of the most common options. For a full
11 | # list see the documentation:
12 | # https://www.sphinx-doc.org/en/master/usage/configuration.html
13 |
14 | # -- Path setup --------------------------------------------------------------
15 |
16 | # If extensions (or modules to document with autodoc) are in another directory,
17 | # add these directories to sys.path here. If the directory is relative to the
18 | # documentation root, use os.path.abspath to make it absolute, like shown here.
19 | #
20 | # import os
21 | # import sys
22 | # sys.path.insert(0, os.path.abspath('.'))
23 |
24 |
25 | # -- Project information -----------------------------------------------------
26 |
27 | project = 'Top2Vec'
28 | copyright = '2020, Dimo Angelov'
29 | author = 'Dimo Angelov'
30 |
31 | # The full version, including alpha/beta/rc tags
32 | release = '1.0.36'
33 |
34 |
35 | # -- General configuration ---------------------------------------------------
36 |
37 | # Add any Sphinx extension module names here, as strings. They can be
38 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
39 | # ones.
40 | extensions = ['recommonmark', 'sphinx_rtd_theme', 'sphinx.ext.autodoc', 'sphinx.ext.napoleon']
41 |
42 | # Add any paths that contain templates here, relative to this directory.
43 | templates_path = ['_templates']
44 |
45 | # List of patterns, relative to source directory, that match files and
46 | # directories to ignore when looking for source files.
47 | # This pattern also affects html_static_path and html_extra_path.
48 | exclude_patterns = []
49 |
50 |
51 | # -- Options for HTML output -------------------------------------------------
52 |
53 | # The theme to use for HTML and HTML Help pages. See the documentation for
54 | # a list of builtin themes.
55 | #
56 | #html_theme = 'alabaster'
57 |
58 |
59 | html_theme = "sphinx_rtd_theme"
60 | #html_theme_path = ["_themes", ]
61 |
62 | # Add any paths that contain custom static files (such as style sheets) here,
63 | # relative to this directory. They are copied after the builtin static files,
64 | # so a file named "default.css" will overwrite the builtin "default.css".
65 | html_static_path = ['_static']
66 |
67 | master_doc = 'index'
68 |
69 | # source_parsers = {
70 | # '.md': CommonMarkParser,
71 | # }
72 |
73 | #source_suffix = ['.rst', '.md']
74 |
75 |
--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
1 | .. Top2Vec documentation master file, created by
2 | sphinx-quickstart on Mon Mar 23 19:00:08 2020.
3 | You can adapt this file completely to your liking, but it should at least
4 | contain the root `toctree` directive.
5 |
6 | Welcome to Top2Vec's documentation!
7 | ===================================
8 |
9 | .. toctree::
10 | :maxdepth: 2
11 | :caption: User Guide / Tutorial:
12 |
13 | Top2Vec
14 |
15 | .. toctree::
16 | :caption: API Reference:
17 |
18 | api
19 |
20 | Indices and tables
21 | ==================
22 |
23 | * :ref:`genindex`
24 | * :ref:`modindex`
25 | * :ref:`search`
26 |
--------------------------------------------------------------------------------
/images/hdbscan_docs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/hdbscan_docs.png
--------------------------------------------------------------------------------
/images/restful-top2vec.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/restful-top2vec.png
--------------------------------------------------------------------------------
/images/top2vec_logo.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/images/topic21.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic21.png
--------------------------------------------------------------------------------
/images/topic29.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic29.png
--------------------------------------------------------------------------------
/images/topic48.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic48.png
--------------------------------------------------------------------------------
/images/topic61.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic61.png
--------------------------------------------------------------------------------
/images/topic9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic9.png
--------------------------------------------------------------------------------
/images/topic_vector.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/images/umap_docs.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/umap_docs.png
--------------------------------------------------------------------------------
/notebooks/CORD-19_top2vec.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"metadata":{},"cell_type":"markdown","source":"# COVID-19: Topic Modelling and Search with Top2Vec\n\n[Top2Vec](https://github.com/ddangelov/Top2Vec) is an algorithm for **topic modelling** and **semantic search**. It **automatically** detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:\n* Get number of detected topics.\n* Get topics.\n* Search topics by keywords.\n* Search documents by topic.\n* Find similar words.\n* Find similar documents.\n\nThis notebook preprocesses the [Kaggle COVID-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), it treats each section of every paper as a distinct document. A Top2Vec model is trained on those documents. \n\nOnce the model is trained you can do **semantic** search for documents by topic, searching for documents with keywords, searching for topics with keywords, and for finding similar words. These methods all leverage the joint topic, document, word embeddings distances, which represent semantic similarity. \n\n### For an interactive version of this notebook with search widgets check out my [github](https://github.com/ddangelov/Top2Vec/blob/master/notebooks/CORD-19_top2vec.ipynb) or my [kaggle](https://www.kaggle.com/dangelov/covid-19-top2vec-interactive-search)!\n\n"},{"metadata":{},"cell_type":"markdown","source":"# Import and Setup "},{"metadata":{},"cell_type":"markdown","source":"### 1. Install the [Top2Vec](https://github.com/ddangelov/Top2Vec) library"},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"!pip install top2vec","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Import Libraries"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np \nimport pandas as pd \nimport json\nimport os\nimport ipywidgets as widgets\nfrom IPython.display import clear_output, display\nfrom top2vec import Top2Vec","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Pre-process Data"},{"metadata":{},"cell_type":"markdown","source":"### 1. Import Metadata"},{"metadata":{"trusted":true},"cell_type":"code","source":"metadata_df = pd.read_csv(\"../input/CORD-19-research-challenge/metadata.csv\")\nmetadata_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Pre-process Papers\n\nA document will be created for each section of every paper. This document will contain the id, title, abstract, and setion of the paper. It will also contain the text of that section."},{"metadata":{"trusted":true},"cell_type":"code","source":"def preproccess_papers():\n\n dataset_dir = \"../input/CORD-19-research-challenge/\"\n comm_dir = dataset_dir+\"comm_use_subset/comm_use_subset/\"\n noncomm_dir = dataset_dir+\"noncomm_use_subset/noncomm_use_subset/\"\n custom_dir = dataset_dir+\"custom_license/custom_license/\"\n biorxiv_dir = dataset_dir+\"biorxiv_medrxiv/biorxiv_medrxiv/\"\n directories_to_process = [comm_dir,noncomm_dir, custom_dir, biorxiv_dir]\n\n papers_with_text = list(metadata_df[metadata_df.has_full_text==True].sha)\n\n paper_ids = []\n titles = []\n abstracts = []\n sections = []\n body_texts = []\n\n for directory in directories_to_process:\n\n filenames = os.listdir(directory)\n\n for filename in filenames:\n\n file = json.load(open(directory+filename, 'rb'))\n\n #check if file contains text\n if file[\"paper_id\"] in papers_with_text:\n\n section = []\n text = []\n\n for bod in file[\"body_text\"]:\n section.append(bod[\"section\"])\n text.append(bod[\"text\"])\n\n res_df = pd.DataFrame({\"section\":section, \"text\":text}).groupby(\"section\")[\"text\"].apply(' '.join).reset_index()\n\n for index, row in res_df.iterrows():\n\n # metadata\n paper_ids.append(file[\"paper_id\"])\n\n if(len(file[\"abstract\"])):\n abstracts.append(file[\"abstract\"][0][\"text\"])\n else:\n abstracts.append(\"\")\n\n titles.append(file[\"metadata\"][\"title\"])\n\n # add section and text\n sections.append(row.section)\n body_texts.append(row.text)\n\n return pd.DataFrame({\"id\":paper_ids, \"title\": titles, \"abstract\": abstracts, \"section\": sections, \"text\": body_texts})","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# papers_df = preproccess_papers()\n# papers_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 3. Filter Short Sections"},{"metadata":{"trusted":true},"cell_type":"code","source":"def filter_short(papers_df):\n papers_df[\"token_counts\"] = papers_df[\"text\"].str.split().map(len)\n papers_df = papers_df[papers_df.token_counts>200].reset_index(drop=True)\n papers_df.drop('token_counts', axis=1, inplace=True)\n \n return papers_df\n ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# papers_df = filter_short(papers_df)\n# papers_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Train Top2Vec Model\nParameters:\n * ``documents``: Input corpus, should be a list of strings.\n \n * ``speed``: This parameter will determine how fast the model takes to train. \n The 'fast-learn' option is the fastest and will generate the lowest quality\n vectors. The 'learn' option will learn better quality vectors but take a longer\n time to train. The 'deep-learn' option will learn the best quality vectors but \n will take significant time to train. \n \n * ``workers``: The amount of worker threads to be used in training the model. Larger\n amount will lead to faster training.\n \nSee [Documentation](https://top2vec.readthedocs.io/en/latest/README.html)."},{"metadata":{"trusted":true},"cell_type":"code","source":"# top2vec = Top2Vec(documents=papers_df.text, speed=\"learn\", workers=4)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## (Recommended) Load Pre-trained Model and Pre-processed Data :)\n\nThe Top2Vec model was trained with the 'deep-learn' speed parameter and took very long to train. It will give much better results than training with 'fast-learn' or 'learn'.\n\nData is available on my [kaggle](https://www.kaggle.com/dangelov/covid19top2vec)."},{"metadata":{},"cell_type":"markdown","source":"### 1. Load pre-trained Top2Vec model "},{"metadata":{"trusted":true},"cell_type":"code","source":"top2vec = Top2Vec.load(\"../input/covid19top2vec/covid19_deep_learn_top2vec\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Load pre-processed papers"},{"metadata":{"trusted":true},"cell_type":"code","source":"papers_df = pd.read_feather(\"../input/covid19top2vec/covid19_papers_processed.feather\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Use Top2Vec for Semantic Search"},{"metadata":{},"cell_type":"markdown","source":"## 1. Search Topics "},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_st = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_st)\n\nkeywords_input_st = widgets.Text()\ndisplay(keywords_input_st)\n\nkeywords_neg_select_st = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_st)\n\nkeywords_neg_input_st = widgets.Text()\ndisplay(keywords_neg_input_st)\n\ndoc_num_select_st = widgets.Label('Choose number of topics: ')\ndisplay(doc_num_select_st)\n\ndoc_num_input_st = widgets.Text(value='5')\ndisplay(doc_num_input_st)\n\ndef display_similar_topics(*args):\n \n clear_output()\n display(keywords_select_st)\n display(keywords_input_st)\n display(keywords_neg_select_st)\n display(keywords_neg_input_st)\n display(doc_num_select_st)\n display(doc_num_input_st)\n display(keyword_btn_st)\n \n try:\n topic_words, word_scores, topic_scores, topic_nums = top2vec.search_topics(keywords=keywords_input_st.value.split(),num_topics=int(doc_num_input_st.value), keywords_neg=keywords_neg_input_st.value.split())\n for topic in topic_nums:\n top2vec.generate_topic_wordcloud(topic, background_color=\"black\")\n \n except Exception as e:\n print(e)\n \nkeyword_btn_st = widgets.Button(description=\"show topics\")\ndisplay(keyword_btn_st)\nkeyword_btn_st.on_click(display_similar_topics)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 2. Search Papers by Topic"},{"metadata":{"trusted":true},"cell_type":"code","source":"topic_num_select = widgets.Label('Select topic number: ')\ndisplay(topic_num_select)\n\ntopic_input = widgets.Text()\ndisplay(topic_input)\n\ndoc_num_select = widgets.Label('Choose number of documents: ')\ndisplay(doc_num_select)\n\ndoc_num_input = widgets.Text(value='10')\ndisplay(doc_num_input)\n\ndef display_topics(*args):\n \n clear_output()\n display(topic_num_select)\n display(topic_input)\n display(doc_num_select)\n display(doc_num_input)\n display(topic_btn)\n\n documents, document_scores, document_nums = top2vec.search_documents_by_topic(topic_num=int(topic_input.value), num_docs=int(doc_num_input.value))\n \n result_df = papers_df.loc[document_nums]\n result_df[\"document_scores\"] = document_scores\n \n for index,row in result_df.iterrows():\n print(f\"Document: {index}, Score: {row.document_scores}\")\n print(f\"Section: {row.section}\")\n print(f\"Title: {row.title}\")\n print(\"-----------\")\n print(row.text)\n print(\"-----------\")\n print()\n\ntopic_btn = widgets.Button(description=\"show documents\")\ndisplay(topic_btn)\ntopic_btn.on_click(display_topics)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 3. Search Papers by Keywords"},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_kw = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_kw)\n\nkeywords_input_kw = widgets.Text()\ndisplay(keywords_input_kw)\n\nkeywords_neg_select_kw = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_kw)\n\nkeywords_neg_input_kw = widgets.Text()\ndisplay(keywords_neg_input_kw)\n\ndoc_num_select_kw = widgets.Label('Choose number of documents: ')\ndisplay(doc_num_select_kw)\n\ndoc_num_input_kw = widgets.Text(value='10')\ndisplay(doc_num_input_kw)\n\ndef display_keywords(*args):\n \n clear_output()\n display(keywords_select_kw)\n display(keywords_input_kw)\n display(keywords_neg_select_kw)\n display(keywords_neg_input_kw)\n display(doc_num_select_kw)\n display(doc_num_input_kw)\n display(keyword_btn_kw)\n \n try:\n documents, document_scores, document_nums = top2vec.search_documents_by_keyword(keywords=keywords_input_kw.value.split(), num_docs=int(doc_num_input_kw.value), keywords_neg=keywords_neg_input_kw.value.split())\n result_df = papers_df.loc[document_nums]\n result_df[\"document_scores\"] = document_scores\n\n for index,row in result_df.iterrows():\n print(f\"Document: {index}, Score: {row.document_scores}\")\n print(f\"Section: {row.section}\")\n print(f\"Title: {row.title}\")\n print(\"-----------\")\n print(row.text)\n print(\"-----------\")\n print()\n \n except Exception as e:\n print(e)\n \n\nkeyword_btn_kw = widgets.Button(description=\"show documents\")\ndisplay(keyword_btn_kw)\nkeyword_btn_kw.on_click(display_keywords)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 4. Find Similar Words"},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_sw = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_sw)\n\nkeywords_input_sw = widgets.Text()\ndisplay(keywords_input_sw)\n\nkeywords_neg_select_sw = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_sw)\n\nkeywords_neg_input_sw = widgets.Text()\ndisplay(keywords_neg_input_sw)\n\n\ndoc_num_select_sw = widgets.Label('Choose number of words: ')\ndisplay(doc_num_select_sw)\n\ndoc_num_input_sw = widgets.Text(value='20')\ndisplay(doc_num_input_sw)\n\ndef display_similar_words(*args):\n \n clear_output()\n display(keywords_select_sw)\n display(keywords_input_sw)\n display(keywords_neg_select_sw)\n display(keywords_neg_input_sw)\n display(doc_num_select_sw)\n display(doc_num_input_sw)\n display(sim_word_btn_sw)\n \n try: \n words, word_scores = top2vec.similar_words(keywords=keywords_input_sw.value.split(), keywords_neg=keywords_neg_input_sw.value.split(), num_words=int(doc_num_input_sw.value))\n for word, score in zip(words, word_scores):\n print(f\"{word} {score}\")\n \n except Exception as e:\n print(e)\n \nsim_word_btn_sw = widgets.Button(description=\"show similar words\")\ndisplay(sim_word_btn_sw)\nsim_word_btn_sw.on_click(display_similar_words)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4}
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.20.0
2 | scikit-learn>=1.2.0
3 | pandas
4 | gensim>=4.0.0
5 | umap-learn>=0.5.1
6 | hdbscan>=0.8.27
7 | wordcloud
8 | tensorflow
9 | tensorflow_hub
10 | tensorflow_text
11 | torch
12 | sentence_transformers
13 | hnswlib
14 | transformers
15 | tqdm
16 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 |
3 | with open("README.md", "r") as fh:
4 | long_description = fh.read()
5 |
6 | setuptools.setup(
7 | name="top2vec",
8 | packages=["top2vec"],
9 | version="1.0.36",
10 | author="Dimo Angelov",
11 | author_email="dimo.angelov@gmail.com",
12 | description="Top2Vec learns jointly embedded topic, document and word vectors.",
13 | long_description=long_description,
14 | long_description_content_type="text/markdown",
15 | url="https://github.com/ddangelov/Top2Vec",
16 | keywords="topic modeling semantic search word document embedding",
17 | license="BSD",
18 | classifiers=[
19 | "Development Status :: 3 - Alpha",
20 | "Programming Language :: Python :: 3",
21 | "Intended Audience :: Science/Research",
22 | "Intended Audience :: Developers",
23 | "Topic :: Scientific/Engineering :: Artificial Intelligence",
24 | "Topic :: Scientific/Engineering :: Information Analysis",
25 | "License :: OSI Approved :: BSD License",
26 | "Operating System :: OS Independent",
27 | ],
28 | install_requires=[
29 | 'numpy >= 1.20.0',
30 | 'pandas',
31 | 'scikit-learn >= 1.2.0',
32 | 'gensim >= 4.0.0',
33 | 'umap-learn >= 0.5.1',
34 | 'hdbscan >= 0.8.27',
35 | 'wordcloud',
36 | 'transformers',
37 | 'tqdm'
38 | ],
39 | extras_require={
40 | 'sentence_encoders': [
41 | 'tensorflow',
42 | 'tensorflow_hub',
43 | 'tensorflow_text',
44 | ],
45 | 'sentence_transformers': [
46 | 'torch',
47 | 'sentence_transformers',
48 | ],
49 | 'indexing': [
50 | 'hnswlib',
51 | ],
52 | },
53 | python_requires='>=3.10',
54 | )
55 |
--------------------------------------------------------------------------------
/top2vec/__init__.py:
--------------------------------------------------------------------------------
1 | from top2vec.top2vec import Top2Vec
2 |
3 | __version__ = '1.0.36'
4 |
--------------------------------------------------------------------------------
/top2vec/embedding.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModel
2 | import torch
3 | from torch.utils.data import DataLoader
4 | from tqdm import tqdm
5 | import numpy as np
6 | from sklearn.preprocessing import normalize
7 |
8 |
9 | def average_embeddings(documents,
10 | batch_size=32,
11 | model_max_length=512,
12 | embedding_model='sentence-transformers/all-MiniLM-L6-v2'):
13 | tokenizer = AutoTokenizer.from_pretrained(embedding_model)
14 | model = AutoModel.from_pretrained(embedding_model)
15 |
16 | device = (
17 | "mps" if torch.backends.mps.is_available()
18 | else "cuda" if torch.cuda.is_available()
19 | else "cpu"
20 | )
21 | device = torch.device(device)
22 |
23 | data_loader = DataLoader(documents, batch_size=batch_size, shuffle=False)
24 |
25 | model.eval()
26 | model.to(device)
27 |
28 | average_embeddings = []
29 |
30 | with torch.no_grad():
31 | for batch in tqdm(data_loader, desc="Embedding vocabulary"):
32 | # Tokenize the batch with padding and truncation
33 | batch_inputs = tokenizer(
34 | batch,
35 | padding="max_length",
36 | max_length=model_max_length,
37 | truncation=True,
38 | return_tensors="pt"
39 | )
40 |
41 | batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}
42 | last_hidden_state = model(**batch_inputs).last_hidden_state
43 | avg_embedding = last_hidden_state.mean(dim=1)
44 | average_embeddings.append(avg_embedding.cpu().numpy())
45 |
46 | document_vectors = normalize(np.vstack(average_embeddings))
47 |
48 | return document_vectors
49 |
50 |
51 | def contextual_token_embeddings(documents,
52 | batch_size=32,
53 | model_max_length=512,
54 | embedding_model='sentence-transformers/all-MiniLM-L6-v2'):
55 | tokenizer = AutoTokenizer.from_pretrained(embedding_model)
56 | model = AutoModel.from_pretrained(embedding_model)
57 |
58 | device = (
59 | "mps" if torch.backends.mps.is_available()
60 | else "cuda" if torch.cuda.is_available()
61 | else "cpu"
62 | )
63 | device = torch.device(device)
64 |
65 | # DataLoader to process the documents in batches
66 | data_loader = DataLoader(documents, batch_size=batch_size, shuffle=False)
67 |
68 | model.eval()
69 | model.to(device)
70 |
71 | last_hidden_states = []
72 | all_attention_masks = []
73 | all_tokens = []
74 |
75 | # Embed documents batch-wise
76 | with torch.no_grad():
77 | for batch in tqdm(data_loader, desc="Embedding documents"):
78 | # Tokenize the batch with padding and truncation
79 | batch_inputs = tokenizer(
80 | batch,
81 | padding="max_length",
82 | max_length=model_max_length,
83 | truncation=True,
84 | return_tensors="pt"
85 | )
86 | all_attention_masks.extend(batch_inputs['attention_mask'])
87 | all_tokens.extend(batch_inputs['input_ids'])
88 | batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()}
89 | last_hidden_state = model(**batch_inputs).last_hidden_state
90 | last_hidden_states.append(last_hidden_state.cpu())
91 |
92 | # Concatenate the embeddings from all batches
93 | all_hidden_states = torch.cat(last_hidden_states, dim=0)
94 |
95 | document_token_embeddings = []
96 | document_tokens = []
97 | document_labels = []
98 |
99 | for ind, (hidden_state, attention_mask, tokens) in enumerate(
100 | zip(all_hidden_states, all_attention_masks, all_tokens)):
101 | embeddings = hidden_state[attention_mask.nonzero(as_tuple=True)]
102 | tokens = tokens[attention_mask.nonzero(as_tuple=True)]
103 | tokens = [tokenizer.decode(token) for token in tokens]
104 |
105 | document_token_embeddings.append(embeddings.detach().numpy())
106 | document_tokens.append(tokens)
107 | document_labels.extend([ind] * len(tokens))
108 |
109 | return document_token_embeddings, document_tokens, document_labels
110 |
111 |
112 | def sliding_window_average(document_token_embeddings, document_tokens, window_size, stride):
113 | # Store the averaged embeddings
114 | averaged_embeddings = []
115 | chunk_tokens = []
116 |
117 | # Iterate over each document
118 | for doc, tokens in tqdm(zip(document_token_embeddings, document_tokens)):
119 | doc_averages = []
120 |
121 | # Slide the window over the document with the specified stride
122 | for i in range(0, len(doc), stride):
123 |
124 | start = i
125 | end = i + window_size
126 |
127 | if start != 0 and end > len(doc):
128 | start = len(doc) - window_size
129 | end = len(doc)
130 |
131 | window = doc[start:end]
132 |
133 | # Calculate the average embedding for the current window
134 | window_average = np.mean(window, axis=0)
135 |
136 | doc_averages.append(window_average)
137 | chunk_tokens.append(" ".join(tokens[start:end]))
138 |
139 | averaged_embeddings.append(doc_averages)
140 |
141 | averaged_embeddings = np.vstack(averaged_embeddings)
142 | averaged_embeddings = normalize(averaged_embeddings)
143 |
144 | return averaged_embeddings, chunk_tokens
145 |
146 |
147 | def average_adjacent_tokens(token_embeddings, window_size):
148 | num_tokens, embedding_size = token_embeddings.shape
149 | averaged_embeddings = np.zeros_like(token_embeddings)
150 |
151 | token_embeddings = normalize(token_embeddings)
152 |
153 | # Define the range to consider based on window_size
154 | for i in range(num_tokens):
155 | start_idx = max(0, i - window_size)
156 | end_idx = min(num_tokens, i + window_size + 1)
157 |
158 | # Compute the average for the current token within the specified window
159 | averaged_embeddings[i] = np.mean(token_embeddings[start_idx:end_idx], axis=0)
160 |
161 | return averaged_embeddings
162 |
163 |
164 | def smooth_document_token_embeddings(document_token_embeddings, window_size=2):
165 | smoothed_document_embeddings = []
166 |
167 | for doc in tqdm(document_token_embeddings, desc="Smoothing document token embeddings"):
168 | smoothed_doc = average_adjacent_tokens(doc, window_size=window_size)
169 | smoothed_document_embeddings.append(smoothed_doc)
170 |
171 | return smoothed_document_embeddings
172 |
--------------------------------------------------------------------------------
/top2vec/tests/test_top2vec.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from top2vec.top2vec import Top2Vec
3 | from sklearn.datasets import fetch_20newsgroups
4 | import numpy as np
5 | import tempfile
6 | import tensorflow_hub as hub
7 |
8 | # get 20 newsgroups data
9 | newsgroups_train = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
10 | newsgroups_documents = newsgroups_train.data[0:2000]
11 |
12 | # train top2vec model without doc_ids provided
13 | top2vec = Top2Vec(documents=newsgroups_documents, speed="fast-learn", workers=8)
14 |
15 | # train top2vec model with doc_ids provided
16 | doc_ids = [str(num) for num in range(0, len(newsgroups_documents))]
17 | top2vec_docids = Top2Vec(documents=newsgroups_documents, document_ids=doc_ids, speed="fast-learn", workers=8)
18 |
19 | # train top2vec model without saving documents
20 | top2vec_no_docs = Top2Vec(documents=newsgroups_documents, keep_documents=False, speed="fast-learn", workers=8)
21 |
22 | # train top2vec model with corpus_file
23 | top2vec_corpus_file = Top2Vec(documents=newsgroups_documents, use_corpus_file=True, speed="fast-learn", workers=8)
24 |
25 | # test USE
26 | top2vec_use = Top2Vec(documents=newsgroups_documents, embedding_model='universal-sentence-encoder')
27 |
28 | # test USE with model embedding
29 | top2vec_use_model_embedding = Top2Vec(documents=newsgroups_documents,
30 | embedding_model='universal-sentence-encoder',
31 | use_embedding_model_tokenizer=True)
32 |
33 | # test USE-multilang
34 | top2vec_use_multilang = Top2Vec(documents=newsgroups_documents,
35 | embedding_model='universal-sentence-encoder-multilingual')
36 |
37 | # test Sentence Transformer-multilang
38 | top2vec_transformer_multilang = Top2Vec(documents=newsgroups_documents,
39 | embedding_model='distiluse-base-multilingual-cased')
40 |
41 | # test Sentence Transformer with model embedding
42 | top2vec_transformer_model_embedding = Top2Vec(documents=newsgroups_documents,
43 | embedding_model='distiluse-base-multilingual-cased',
44 | use_embedding_model_tokenizer=True)
45 |
46 | top2vec_transformer_use_large = Top2Vec(documents=newsgroups_documents,
47 | embedding_model='universal-sentence-encoder-large',
48 | use_embedding_model_tokenizer=True,
49 | split_documents=True)
50 |
51 | top2vec_transformer_use_multilang_large = Top2Vec(documents=newsgroups_documents,
52 | embedding_model='universal-sentence-encoder-multilingual-large',
53 | use_embedding_model_tokenizer=True,
54 | split_documents=True,
55 | document_chunker='random')
56 |
57 | top2vec_transformer_sbert_l6 = Top2Vec(documents=newsgroups_documents,
58 | embedding_model='all-MiniLM-L6-v2',
59 | use_embedding_model_tokenizer=True,
60 | split_documents=True,
61 | document_chunker='sequential')
62 |
63 | top2vec_transformer_sbert_l12 = Top2Vec(documents=newsgroups_documents,
64 | embedding_model='paraphrase-multilingual-MiniLM-L12-v2',
65 | use_embedding_model_tokenizer=True,
66 | split_documents=True,
67 | document_chunker='random'
68 | )
69 |
70 | model = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')
71 | top2vec_model_callable = Top2Vec(documents=newsgroups_documents,
72 | embedding_model=model,
73 | use_embedding_model_tokenizer=True,
74 | split_documents=True,
75 | document_chunker='random'
76 | )
77 |
78 | top2vec_ngrams = Top2Vec(documents=newsgroups_documents,
79 | speed="fast-learn",
80 | ngram_vocab=True,
81 | workers=8)
82 |
83 | top2vec_use_ngrams = Top2Vec(documents=newsgroups_documents,
84 | embedding_model='universal-sentence-encoder',
85 | ngram_vocab=True)
86 |
87 | models = [top2vec, top2vec_docids, top2vec_no_docs, top2vec_corpus_file,
88 | top2vec_use, top2vec_use_multilang, top2vec_transformer_multilang,
89 | top2vec_use_model_embedding, top2vec_transformer_model_embedding,
90 | top2vec_transformer_use_large,
91 | top2vec_transformer_use_multilang_large,
92 | top2vec_transformer_sbert_l6,
93 | top2vec_transformer_sbert_l12,
94 | top2vec_model_callable,
95 | top2vec_ngrams,
96 | top2vec_use_ngrams]
97 |
98 |
99 | @pytest.mark.parametrize('top2vec_model', models)
100 | def test_add_documents_original(top2vec_model):
101 | num_docs = top2vec_model.document_vectors.shape[0]
102 |
103 | docs_to_add = newsgroups_train.data[0:100]
104 |
105 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0])
106 |
107 | if top2vec_model.document_ids_provided is False:
108 | top2vec_model.add_documents(docs_to_add)
109 | else:
110 | doc_ids_new = [str(num) for num in range(2000, 2000 + len(docs_to_add))]
111 | top2vec_model.add_documents(docs_to_add, doc_ids_new)
112 |
113 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0])
114 | num_docs_new = top2vec_model.document_vectors.shape[0]
115 |
116 | assert topic_count_sum + len(docs_to_add) == topic_count_sum_new == num_docs + len(docs_to_add) \
117 | == num_docs_new == len(top2vec_model.doc_top)
118 |
119 | if top2vec_model.documents is not None:
120 | assert num_docs_new == len(top2vec_model.documents)
121 |
122 |
123 | @pytest.mark.parametrize('top2vec_model', models)
124 | def test_compute_topics(top2vec_model):
125 | top2vec_model.compute_topics()
126 |
127 | num_topics = top2vec_model.get_num_topics()
128 | words, word_scores, topic_nums = top2vec_model.get_topics()
129 |
130 | # check that for each topic there are words, word_scores and topic_nums
131 | assert len(words) == len(word_scores) == len(topic_nums) == num_topics
132 |
133 | # check that for each word there is a score
134 | assert len(words[0]) == len(word_scores[0])
135 |
136 | # check that topics words are returned in decreasing order
137 | topic_words_scores = word_scores[0]
138 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1))
139 |
140 |
141 | @pytest.mark.parametrize('top2vec_model', models)
142 | def test_hierarchical_topic_reduction(top2vec_model):
143 | num_topics = top2vec_model.get_num_topics()
144 |
145 | if num_topics > 10:
146 | reduced_num = 10
147 | elif num_topics - 1 > 0:
148 | reduced_num = num_topics - 1
149 |
150 | hierarchy = top2vec_model.hierarchical_topic_reduction(reduced_num)
151 |
152 | assert len(hierarchy) == reduced_num == len(top2vec_model.topic_vectors_reduced)
153 |
154 |
155 | @pytest.mark.parametrize('top2vec_model', models)
156 | def test_add_documents_post_reduce(top2vec_model):
157 | docs_to_add = newsgroups_train.data[500:600]
158 |
159 | num_docs = top2vec_model.document_vectors.shape[0]
160 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0])
161 | topic_count_reduced_sum = sum(top2vec_model.get_topic_sizes(reduced=True)[0])
162 |
163 | if top2vec_model.document_ids_provided is False:
164 | top2vec_model.add_documents(docs_to_add)
165 | else:
166 | doc_ids_new = [str(num) for num in range(2100, 2100 + len(docs_to_add))]
167 | top2vec_model.add_documents(docs_to_add, doc_ids_new)
168 |
169 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0])
170 | topic_count_reduced_sum_new = sum(top2vec_model.get_topic_sizes(reduced=True)[0])
171 |
172 | num_docs_new = top2vec_model.document_vectors.shape[0]
173 |
174 | assert topic_count_sum + len(docs_to_add) == topic_count_sum_new == topic_count_reduced_sum + len(docs_to_add) \
175 | == topic_count_reduced_sum_new == num_docs + len(docs_to_add) == num_docs_new == len(top2vec_model.doc_top) \
176 | == len(top2vec_model.doc_top_reduced)
177 |
178 | if top2vec_model.documents is not None:
179 | assert num_docs_new == len(top2vec_model.documents)
180 |
181 |
182 | @pytest.mark.parametrize('top2vec_model', models)
183 | def test_delete_documents(top2vec_model):
184 | doc_ids_to_delete = list(range(500, 550))
185 |
186 | num_docs = top2vec_model.document_vectors.shape[0]
187 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0])
188 | topic_count_reduced_sum = sum(top2vec_model.get_topic_sizes(reduced=True)[0])
189 |
190 | if top2vec_model.document_ids_provided is False:
191 | top2vec_model.delete_documents(doc_ids=doc_ids_to_delete)
192 | else:
193 | doc_ids_to_delete = [str(doc_id) for doc_id in doc_ids_to_delete]
194 | top2vec_model.delete_documents(doc_ids=doc_ids_to_delete)
195 |
196 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0])
197 | topic_count_reduced_sum_new = sum(top2vec_model.get_topic_sizes(reduced=True)[0])
198 | num_docs_new = top2vec_model.document_vectors.shape[0]
199 |
200 | assert topic_count_sum - len(doc_ids_to_delete) == topic_count_sum_new == topic_count_reduced_sum - \
201 | len(doc_ids_to_delete) == topic_count_reduced_sum_new == num_docs - len(doc_ids_to_delete) \
202 | == num_docs_new == len(top2vec_model.doc_top) == len(top2vec_model.doc_top_reduced)
203 |
204 | if top2vec_model.documents is not None:
205 | assert num_docs_new == len(top2vec_model.documents)
206 |
207 |
208 | @pytest.mark.parametrize('top2vec_model', models)
209 | def test_get_topic_hierarchy(top2vec_model):
210 | hierarchy = top2vec_model.get_topic_hierarchy()
211 |
212 | assert len(hierarchy) == len(top2vec_model.topic_vectors_reduced)
213 |
214 |
215 | @pytest.mark.parametrize('top2vec_model', models)
216 | @pytest.mark.parametrize('reduced', [False, True])
217 | def test_get_num_topics(top2vec_model, reduced):
218 | # check that there are more than 0 topics
219 | assert top2vec_model.get_num_topics(reduced=reduced) > 0
220 |
221 |
222 | @pytest.mark.parametrize('top2vec_model', models)
223 | @pytest.mark.parametrize('reduced', [False, True])
224 | def test_get_topics(top2vec_model, reduced):
225 | num_topics = top2vec_model.get_num_topics(reduced=reduced)
226 | words, word_scores, topic_nums = top2vec_model.get_topics(reduced=reduced)
227 |
228 | # check that for each topic there are words, word_scores and topic_nums
229 | assert len(words) == len(word_scores) == len(topic_nums) == num_topics
230 |
231 | # check that for each word there is a score
232 | assert len(words[0]) == len(word_scores[0])
233 |
234 | # check that topics words are returned in decreasing order
235 | topic_words_scores = word_scores[0]
236 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1))
237 |
238 |
239 | @pytest.mark.parametrize('top2vec_model', models)
240 | @pytest.mark.parametrize('reduced', [False, True])
241 | def test_get_topic_size(top2vec_model, reduced):
242 | topic_sizes, topic_nums = top2vec_model.get_topic_sizes(reduced=reduced)
243 |
244 | # check that topic sizes add up to number of documents
245 | assert sum(topic_sizes) == top2vec_model.document_vectors.shape[0]
246 |
247 | # check that topics are ordered decreasingly
248 | assert all(topic_sizes[i] >= topic_sizes[i + 1] for i in range(len(topic_sizes) - 1))
249 |
250 |
251 | @pytest.mark.parametrize('top2vec_model', models)
252 | @pytest.mark.parametrize('reduced', [False, True])
253 | def test_generate_topic_wordcloud(top2vec_model, reduced):
254 | # generate word cloud
255 | num_topics = top2vec_model.get_num_topics(reduced=reduced)
256 | top2vec_model.generate_topic_wordcloud(num_topics - 1, reduced=reduced)
257 |
258 |
259 | @pytest.mark.parametrize('top2vec_model', models)
260 | @pytest.mark.parametrize('reduced', [False, True])
261 | def test_search_documents_by_topic(top2vec_model, reduced):
262 | # get topic sizes
263 | topic_sizes, topic_nums = top2vec_model.get_topic_sizes(reduced=reduced)
264 | topic = topic_nums[0]
265 | num_docs = topic_sizes[0]
266 |
267 | # search documents by topic
268 | if top2vec_model.documents is not None:
269 | documents, document_scores, document_ids = top2vec_model.search_documents_by_topic(topic, num_docs,
270 | reduced=reduced)
271 | else:
272 | document_scores, document_ids = top2vec_model.search_documents_by_topic(topic, num_docs, reduced=reduced)
273 |
274 | # check that for each document there is a score and number
275 | if top2vec_model.documents is not None:
276 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
277 | else:
278 | assert len(document_scores) == len(document_ids) == num_docs
279 |
280 | # check that documents are returned in decreasing order
281 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
282 |
283 | # check that all documents returned are most similar to topic being searched
284 | document_indexes = [top2vec_model.doc_id2index[doc_id] for doc_id in document_ids]
285 |
286 | if reduced:
287 | doc_topics = set(np.argmax(
288 | np.inner(top2vec_model.document_vectors[document_indexes],
289 | top2vec_model.topic_vectors_reduced), axis=1))
290 | else:
291 | doc_topics = set(np.argmax(
292 | np.inner(top2vec_model.document_vectors[document_indexes],
293 | top2vec_model.topic_vectors), axis=1))
294 | assert len(doc_topics) == 1 and topic in doc_topics
295 |
296 |
297 | @pytest.mark.parametrize('top2vec_model', models)
298 | def test_search_documents_by_keywords(top2vec_model):
299 | keywords = top2vec_model.vocab
300 | keyword = keywords[-1]
301 | num_docs = 10
302 |
303 | if top2vec_model.documents is not None:
304 | documents, document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword],
305 | num_docs=num_docs)
306 | else:
307 | document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword],
308 | num_docs=num_docs)
309 |
310 | # check that for each document there is a score and number
311 | if top2vec_model.documents is not None:
312 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
313 | else:
314 | assert len(document_scores) == len(document_ids) == num_docs
315 |
316 | # check that documents are returned in decreasing order
317 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
318 |
319 |
320 | @pytest.mark.parametrize('top2vec_model', models)
321 | def test_similar_words(top2vec_model):
322 | keywords = top2vec_model.vocab
323 | keyword = keywords[-1]
324 | num_words = 20
325 |
326 | words, word_scores = top2vec_model.similar_words(keywords=[keyword], num_words=num_words)
327 |
328 | # check that there is a score for each word
329 | assert len(words) == len(word_scores) == num_words
330 |
331 | # check that words are returned in decreasing order
332 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1))
333 |
334 |
335 | @pytest.mark.parametrize('top2vec_model', models)
336 | @pytest.mark.parametrize('reduced', [False, True])
337 | def test_search_topics(top2vec_model, reduced):
338 | num_topics = top2vec_model.get_num_topics(reduced=reduced)
339 | keywords = top2vec_model.vocab
340 | keyword = keywords[-1]
341 | topic_words, word_scores, topic_scores, topic_nums = top2vec_model.search_topics(keywords=[keyword],
342 | num_topics=num_topics,
343 | reduced=reduced)
344 | # check that for each topic there are topic words, word scores, topic scores and score of topic
345 | assert len(topic_words) == len(word_scores) == len(topic_scores) == len(topic_nums) == num_topics
346 |
347 | # check that for each topic words have scores
348 | assert len(topic_words[0]) == len(word_scores[0])
349 |
350 | # check that topics are returned in decreasing order
351 | assert all(topic_scores[i] >= topic_scores[i + 1] for i in range(len(topic_scores) - 1))
352 |
353 | # check that topics words are returned in decreasing order
354 | topic_words_scores = word_scores[0]
355 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1))
356 |
357 |
358 | @pytest.mark.parametrize('top2vec_model', models)
359 | def test_search_document_by_documents(top2vec_model):
360 | doc_id = top2vec_model.document_ids[0]
361 |
362 | num_docs = 10
363 |
364 | if top2vec_model.documents is not None:
365 | documents, document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id],
366 | num_docs=num_docs)
367 | else:
368 | document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id],
369 | num_docs=num_docs)
370 |
371 | # check that for each document there is a score and number
372 | if top2vec_model.documents is not None:
373 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
374 | else:
375 | assert len(document_scores) == len(document_ids) == num_docs
376 |
377 | # check that documents are returned in decreasing order
378 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
379 |
380 |
381 | @pytest.mark.parametrize('top2vec_model', models)
382 | def test_get_documents_topics(top2vec_model):
383 | doc_ids_get = top2vec_model.document_ids[[0, 5]]
384 |
385 | if top2vec_model.hierarchy is not None:
386 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get,
387 | reduced=True)
388 | else:
389 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get)
390 |
391 | assert len(doc_topics) == len(doc_dist) == len(topic_words) == len(topic_word_scores) == len(doc_ids_get)
392 |
393 |
394 | @pytest.mark.parametrize('top2vec_model', models)
395 | def test_get_documents_topics_multiple(top2vec_model):
396 | doc_ids_get = top2vec_model.document_ids[[0, 1, 5]]
397 | num_topics = 2
398 |
399 | if top2vec_model.hierarchy is not None:
400 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get,
401 | reduced=True,
402 | num_topics=num_topics)
403 |
404 | actual_number_topics = top2vec_model.get_num_topics(reduced=True)
405 |
406 | else:
407 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get,
408 | num_topics=num_topics)
409 |
410 | actual_number_topics = top2vec_model.get_num_topics(reduced=False)
411 |
412 | assert len(doc_topics) == len(doc_dist) == len(topic_words) == len(topic_word_scores) == len(doc_ids_get)
413 |
414 | if num_topics <= actual_number_topics:
415 | assert doc_topics.shape[1] == num_topics
416 | assert doc_dist.shape[1] == num_topics
417 | assert topic_words.shape[1] == num_topics
418 | assert topic_word_scores.shape[1] == num_topics
419 |
420 |
421 | @pytest.mark.parametrize('top2vec_model', models)
422 | def test_search_documents_by_vector(top2vec_model):
423 | document_vectors = top2vec_model.document_vectors
424 | top2vec_model.search_documents_by_vector(vector=document_vectors[0], num_docs=10)
425 |
426 | num_docs = 10
427 |
428 | if top2vec_model.documents is not None:
429 | documents, document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0],
430 | num_docs=num_docs)
431 | else:
432 | document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0],
433 | num_docs=num_docs)
434 | if top2vec_model.documents is not None:
435 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
436 | else:
437 | assert len(document_scores) == len(document_ids) == num_docs
438 |
439 | # check that documents are returned in decreasing order
440 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
441 |
442 |
443 | @pytest.mark.parametrize('top2vec_model', models)
444 | def test_index_documents(top2vec_model):
445 | top2vec_model.index_document_vectors()
446 | assert top2vec_model.document_vectors.shape[1] <= top2vec_model.document_index.get_max_elements()
447 |
448 |
449 | @pytest.mark.parametrize('top2vec_model', models)
450 | def test_search_documents_by_vector_index(top2vec_model):
451 | document_vectors = top2vec_model.document_vectors
452 | top2vec_model.search_documents_by_vector(vector=document_vectors[0], num_docs=10)
453 |
454 | num_docs = 10
455 |
456 | if top2vec_model.documents is not None:
457 | documents, document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0],
458 | num_docs=num_docs,
459 | use_index=True)
460 | else:
461 | document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0],
462 | num_docs=num_docs,
463 | use_index=True)
464 | if top2vec_model.documents is not None:
465 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
466 | else:
467 | assert len(document_scores) == len(document_ids) == num_docs
468 |
469 | # check that documents are returned in decreasing order
470 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
471 |
472 |
473 | @pytest.mark.parametrize('top2vec_model', models)
474 | def test_search_documents_by_keywords_index(top2vec_model):
475 | keywords = top2vec_model.vocab
476 | keyword = keywords[-1]
477 | num_docs = 10
478 |
479 | if top2vec_model.documents is not None:
480 | documents, document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword],
481 | num_docs=num_docs,
482 | use_index=True)
483 | else:
484 | document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword],
485 | num_docs=num_docs,
486 | use_index=True)
487 |
488 | # check that for each document there is a score and number
489 | if top2vec_model.documents is not None:
490 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
491 | else:
492 | assert len(document_scores) == len(document_ids) == num_docs
493 |
494 | # check that documents are returned in decreasing order
495 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
496 |
497 |
498 | @pytest.mark.parametrize('top2vec_model', models)
499 | def test_search_document_by_documents_index(top2vec_model):
500 | doc_id = top2vec_model.document_ids[0]
501 |
502 | num_docs = 10
503 |
504 | if top2vec_model.documents is not None:
505 | documents, document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id],
506 | num_docs=num_docs,
507 | use_index=True)
508 | else:
509 | document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id],
510 | num_docs=num_docs,
511 | use_index=True)
512 |
513 | # check that for each document there is a score and number
514 | if top2vec_model.documents is not None:
515 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
516 | else:
517 | assert len(document_scores) == len(document_ids) == num_docs
518 |
519 | # check that documents are returned in decreasing order
520 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
521 |
522 |
523 | @pytest.mark.parametrize('top2vec_model', models)
524 | def test_search_words_by_vector(top2vec_model):
525 | word_vectors = top2vec_model.word_vectors
526 | top2vec_model.search_words_by_vector(vector=word_vectors[0], num_words=10)
527 |
528 | num_words = 10
529 |
530 | words, word_scores = top2vec_model.search_words_by_vector(vector=word_vectors[0],
531 | num_words=num_words)
532 |
533 | # check that there is a score for each word
534 | assert len(words) == len(word_scores) == num_words
535 |
536 | # check that words are returned in decreasing order
537 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1))
538 |
539 |
540 | @pytest.mark.parametrize('top2vec_model', models)
541 | def test_index_words(top2vec_model):
542 | top2vec_model.index_word_vectors()
543 | assert top2vec_model.word_vectors.shape[1] <= top2vec_model.word_index.get_max_elements()
544 |
545 |
546 | @pytest.mark.parametrize('top2vec_model', models)
547 | def test_similar_words_index(top2vec_model):
548 | keywords = top2vec_model.vocab
549 | keyword = keywords[-1]
550 | num_words = 20
551 |
552 | words, word_scores = top2vec_model.similar_words(keywords=[keyword], num_words=num_words, use_index=True)
553 |
554 | # check that there is a score for each word
555 | assert len(words) == len(word_scores) == num_words
556 |
557 | # check that words are returned in decreasing order
558 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1))
559 |
560 |
561 | @pytest.mark.parametrize('top2vec_model', models)
562 | def test_save_load(top2vec_model):
563 | if top2vec_model.embedding_model == "custom":
564 | model = top2vec_model.embed
565 | temp = tempfile.NamedTemporaryFile(mode='w+b')
566 | top2vec_model.save(temp.name)
567 | Top2Vec.load(temp.name)
568 | temp.close()
569 | top2vec_model.set_embedding_model(model)
570 |
571 | else:
572 | temp = tempfile.NamedTemporaryFile(mode='w+b')
573 | top2vec_model.save(temp.name)
574 | Top2Vec.load(temp.name)
575 | temp.close()
576 |
577 |
578 | @pytest.mark.parametrize('top2vec_model', models)
579 | def test_query_documents(top2vec_model):
580 | num_docs = 10
581 |
582 | if top2vec_model.documents is not None:
583 | documents, document_scores, document_ids = top2vec_model.query_documents(query="what is the meaning of life?",
584 | num_docs=num_docs)
585 | else:
586 | document_scores, document_ids = top2vec_model.query_documents(query="what is the meaning of life?",
587 | num_docs=num_docs)
588 |
589 | # check that for each document there is a score and number
590 | if top2vec_model.documents is not None:
591 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs
592 | else:
593 | assert len(document_scores) == len(document_ids) == num_docs
594 |
595 | # check that documents are returned in decreasing order
596 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1))
597 |
598 |
599 | @pytest.mark.parametrize('top2vec_model', models)
600 | def test_query_topics(top2vec_model):
601 | num_topics = top2vec_model.get_num_topics()
602 | topic_words, word_scores, topic_scores, topic_nums = top2vec_model.query_topics(query="what is the "
603 | "meaning of life?",
604 | num_topics=num_topics)
605 |
606 | # check that for each topic there are topic words, word scores, topic scores and score of topic
607 | assert len(topic_words) == len(word_scores) == len(topic_scores) == len(topic_nums) == num_topics
608 |
609 | # check that for each topic words have scores
610 | assert len(topic_words[0]) == len(word_scores[0])
611 |
612 | # check that topics are returned in decreasing order
613 | assert all(topic_scores[i] >= topic_scores[i + 1] for i in range(len(topic_scores) - 1))
614 |
615 | # check that topics words are returned in decreasing order
616 | topic_words_scores = word_scores[0]
617 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1))
618 |
--------------------------------------------------------------------------------