├── .readthedocs.yaml ├── LICENSE ├── README.md ├── docs ├── Top2Vec.md ├── api.rst ├── conf.py └── index.rst ├── images ├── doc_word_embedding.svg ├── hdbscan_docs.png ├── restful-top2vec.png ├── top2vec_logo.svg ├── topic21.png ├── topic29.png ├── topic48.png ├── topic61.png ├── topic9.png ├── topic_vector.svg ├── topic_words.svg └── umap_docs.png ├── notebooks └── CORD-19_top2vec.ipynb ├── requirements.txt ├── setup.py └── top2vec ├── __init__.py ├── embedding.py ├── tests └── test_top2vec.py └── top2vec.py /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | sphinx: 4 | configuration: docs/conf.py 5 | 6 | build: 7 | os: ubuntu-22.04 8 | tools: 9 | python: "3.10" 10 | jobs: 11 | post_create_environment: 12 | - python -m pip install sphinx_rtd_theme 13 | - python -m pip install recommonmark 14 | 15 | formats: [] 16 | 17 | python: 18 | install: 19 | - requirements: requirements.txt -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2020, Dimo Angelov 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![](https://img.shields.io/pypi/v/top2vec.svg)](https://pypi.org/project/top2vec/) 2 | [![](https://img.shields.io/pypi/l/top2vec.svg)](https://github.com/ddangelov/Top2Vec/blob/master/LICENSE) 3 | [![](https://readthedocs.org/projects/top2vec/badge/?version=latest)](https://top2vec.readthedocs.io/en/latest/?badge=latest) 4 | [![](https://img.shields.io/badge/arXiv-2008.09470-00ff00.svg)](http://arxiv.org/abs/2008.09470) 5 | 6 | 7 | 8 |

9 | 10 |

11 | 12 | # Contextual Top2Vec Overview 13 | Paper: [Topic Modeling: Contextual Token Embeddings Are All You Need](https://aclanthology.org/2024.findings-emnlp.790.pdf) 14 | 15 | The Top2Vec library now supports a new contextual version, allowing for deeper topic modeling capabilities. **Contextual Top2Vec**, enables the model to generate **contextual token embeddings** for each document, identifying multiple topics per document and even detecting topic segments within a document. This enhancement is useful for capturing a nuanced understanding of topics, especially in documents that cover multiple themes. 16 | 17 | ### Key Features of Contextual Top2Vec 18 | 19 | - **`contextual_top2vec` flag**: A new parameter, `contextual_top2vec`, is added to the Top2Vec class. When set to `True`, the model uses contextual token embeddings. Only the following embedding models are supported: 20 | - `all-MiniLM-L6-v2` 21 | - `all-mpnet-base-v2` 22 | - **Topic Spans**: C-Top2Vec automatically determines the number of topics and finds topic segments within documents, allowing for a more granular topic discovery. 23 | 24 | ### Simple Usage Example 25 | 26 | Here is a simple example of how to use Contextual Top2Vec: 27 | 28 | ```python 29 | from top2vec import Top2Vec 30 | 31 | # Create a Contextual Top2Vec model 32 | top2vec_model = Top2Vec(documents=documents, 33 | ngram_vocab=True, 34 | contextual_top2vec=True) 35 | ``` 36 | 37 | ### New Methods for Contextual Top2Vec 38 | 39 | #### `get_document_topic_distribution()` 40 | 41 | ```python 42 | get_document_topic_distribution() -> np.ndarray 43 | ``` 44 | - **Description**: Retrieves the topic distribution for each document. 45 | - **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row represents the **probability distribution of topics** for a document. 46 | 47 | #### `get_document_topic_relevance()` 48 | 49 | ```python 50 | get_document_topic_relevance() -> np.ndarray 51 | ``` 52 | - **Description**: Provides the relevance of each topic for each document. 53 | - **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row indicates the **relevance scores of topics** for a document. 54 | 55 | #### `get_document_token_topic_assignment()` 56 | 57 | ```python 58 | get_document_token_topic_assignment() -> List[Document] 59 | ``` 60 | - **Description**: Retrieves token-level topic assignments for each document. 61 | - **Returns**: A list of `Document` objects, each containing topics with **token assignments and scores** for each token. 62 | 63 | #### `get_document_tokens()` 64 | 65 | ```python 66 | get_document_tokens() -> List[List[str]] 67 | ``` 68 | - **Description**: Returns the tokens for each document. 69 | - **Returns**: A list of lists where each sublist contains the **tokens for a given document**. 70 | 71 | ### Usage Note 72 | 73 | The **contextual version** of Top2Vec requires specific embedding models, and the new methods provide insights into the distribution, relevance, and assignment of topics at both the document and token levels, allowing for a richer understanding of the data. 74 | 75 | > Warning: Contextual Top2Vec is still in **beta**. You may encounter issues or unexpected behavior, and the functionality may change in future updates. 76 | 77 | 78 | 79 | Citation 80 | -------- 81 | ``` 82 | @inproceedings{angelov-inkpen-2024-topic, 83 | title = "Topic Modeling: Contextual Token Embeddings Are All You Need", 84 | author = "Angelov, Dimo and 85 | Inkpen, Diana", 86 | editor = "Al-Onaizan, Yaser and 87 | Bansal, Mohit and 88 | Chen, Yun-Nung", 89 | booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024", 90 | month = nov, 91 | year = "2024", 92 | address = "Miami, Florida, USA", 93 | publisher = "Association for Computational Linguistics", 94 | url = "https://aclanthology.org/2024.findings-emnlp.790", 95 | pages = "13528--13539", 96 | abstract = "The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics.", 97 | } 98 | ``` 99 | 100 | Classic Top2Vec 101 | =============== 102 | 103 | Top2Vec is an algorithm for **topic modeling** and **semantic search**. It automatically detects topics present in text 104 | and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model 105 | you can: 106 | * Get number of detected topics. 107 | * Get topics. 108 | * Get topic sizes. 109 | * Get hierarchichal topics. 110 | * Search topics by keywords. 111 | * Search documents by topic. 112 | * Search documents by keywords. 113 | * Find similar words. 114 | * Find similar documents. 115 | * Expose model with [RESTful-Top2Vec](https://github.com/ddangelov/RESTful-Top2Vec) 116 | 117 | See the [paper](http://arxiv.org/abs/2008.09470) for more details on how it works. 118 | 119 | Benefits 120 | -------- 121 | 1. Automatically finds number of topics. 122 | 2. No stop word lists required. 123 | 3. No need for stemming/lemmatization. 124 | 4. Works on short text. 125 | 5. Creates jointly embedded topic, document, and word vectors. 126 | 6. Has search functions built in. 127 | 128 | How does it work? 129 | ----------------- 130 | 131 | The assumption the algorithm makes is that many semantically similar documents 132 | are indicative of an underlying topic. The first step is to create a joint embedding of 133 | document and word vectors. Once documents and words are embedded in a vector 134 | space the goal of the algorithm is to find dense clusters of documents, then identify which 135 | words attracted those documents together. Each dense area is a topic and the words that 136 | attracted the documents to the dense area are the topic words. 137 | 138 | ### The Algorithm: 139 | 140 | #### 1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html) or [Universal Sentence Encoder](https://tfhub.dev/google/collections/universal-sentence-encoder/1) or [BERT Sentence Transformer](https://www.sbert.net/). 141 | >Documents will be placed close to other similar documents and close to the most distinguishing words. 142 | 143 | 144 |

145 | 146 |

147 | 148 | #### 2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap). 149 | >Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector. 150 | 151 | 152 |

153 | 154 |

155 | 156 | #### 3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan). 157 | >The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster. 158 | 159 | 160 |

161 | 162 |

163 | 164 | #### 4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector. 165 | >The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated. 166 | 167 | 168 |

169 | 170 |

171 | 172 | #### 5. Find n-closest word vectors to the resulting topic vector. 173 | >The closest word vectors in order of proximity become the topic words. 174 | 175 | 176 |

177 | 178 |

179 | 180 | Installation 181 | ------------ 182 | 183 | The easy way to install Top2Vec is: 184 | 185 | pip install top2vec 186 | 187 | To install pre-trained universal sentence encoder options: 188 | 189 | pip install top2vec[sentence_encoders] 190 | 191 | To install pre-trained BERT sentence transformer options: 192 | 193 | pip install top2vec[sentence_transformers] 194 | 195 | To install indexing options: 196 | 197 | pip install top2vec[indexing] 198 | 199 | 200 | Usage 201 | ----- 202 | 203 | ```python 204 | 205 | from top2vec import top2vec 206 | 207 | model = Top2Vec(documents) 208 | ``` 209 | Important parameters: 210 | 211 | * ``documents``: Input corpus, should be a list of strings. 212 | 213 | * ``speed``: This parameter will determine how fast the model takes to train. 214 | The 'fast-learn' option is the fastest and will generate the lowest quality 215 | vectors. The 'learn' option will learn better quality vectors but take a longer 216 | time to train. The 'deep-learn' option will learn the best quality vectors but 217 | will take significant time to train. 218 | 219 | * ``workers``: The amount of worker threads to be used in training the model. Larger 220 | amount will lead to faster training. 221 | 222 | > Trained models can be saved and loaded. 223 | ```python 224 | 225 | model.save("filename") 226 | model = Top2Vec.load("filename") 227 | ``` 228 | 229 | For more information view the [API guide](https://top2vec.readthedocs.io/en/latest/api.html). 230 | 231 | Pretrained Embedding Models 232 | ----------------- 233 | Doc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained `embedding_model` options for generating joint word and document embeddings: 234 | 235 | * `universal-sentence-encoder` 236 | * `universal-sentence-encoder-multilingual` 237 | * `distiluse-base-multilingual-cased` 238 | 239 | ```python 240 | from top2vec import top2vec 241 | 242 | model = Top2Vec(documents, embedding_model='universal-sentence-encoder') 243 | ``` 244 | 245 | For large data sets and data sets with very unique vocabulary doc2vec could 246 | produce better results. This will train a doc2vec model from scratch. This method 247 | is language agnostic. However multiple languages will not be aligned. 248 | 249 | Using the universal sentence encoder options will be much faster since those are 250 | pre-trained and efficient models. The universal sentence encoder options are 251 | suggested for smaller data sets. They are also good options for large data sets 252 | that are in English or in languages covered by the multilingual model. It is also 253 | suggested for data sets that are multilingual. 254 | 255 | The distiluse-base-multilingual-cased pre-trained sentence transformer is suggested 256 | for multilingual datasets and languages that are not covered by the multilingual 257 | universal sentence encoder. The transformer is significantly slower than 258 | the universal sentence encoder options. 259 | 260 | More information on [universal-sentence-encoder](https://tfhub.dev/google/universal-sentence-encoder/4), [universal-sentence-encoder-multilingual](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3), and [distiluse-base-multilingual-cased](https://www.sbert.net/docs/pretrained_models.html). 261 | 262 | 263 | Citation 264 | ----------------- 265 | 266 | If you would like to cite Top2Vec in your work this is the current reference: 267 | 268 | ```bibtex 269 | @article{angelov2020top2vec, 270 | title={Top2Vec: Distributed Representations of Topics}, 271 | author={Dimo Angelov}, 272 | year={2020}, 273 | eprint={2008.09470}, 274 | archivePrefix={arXiv}, 275 | primaryClass={cs.CL} 276 | } 277 | ``` 278 | 279 | Example 280 | ------- 281 | 282 | ### Train Model 283 | Train a Top2Vec model on the 20newsgroups dataset. 284 | 285 | ```python 286 | 287 | from top2vec import top2vec 288 | from sklearn.datasets import fetch_20newsgroups 289 | 290 | newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) 291 | 292 | model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8) 293 | 294 | ``` 295 | ### Get Number of Topics 296 | This will return the number of topics that Top2Vec has found in the data. 297 | ```python 298 | 299 | >>> model.get_num_topics() 300 | 77 301 | 302 | ``` 303 | ### Get Topic Sizes 304 | This will return the number of documents most similar to each topic. Topics are 305 | in decreasing order of size. 306 | ```python 307 | topic_sizes, topic_nums = model.get_topic_sizes() 308 | ``` 309 | Returns: 310 | 311 | * ``topic_sizes``: The number of documents most similar to each topic. 312 | 313 | * ``topic_nums``: The unique index of every topic will be returned. 314 | 315 | ### Get Topics 316 | This will return the topics in decreasing size. 317 | ```python 318 | topic_words, word_scores, topic_nums = model.get_topics(77) 319 | 320 | ``` 321 | Returns: 322 | 323 | * ``topic_words``: For each topic the top 50 words are returned, in order 324 | of semantic similarity to topic. 325 | 326 | * ``word_scores``: For each topic the cosine similarity scores of the 327 | top 50 words to the topic are returned. 328 | 329 | * ``topic_nums``: The unique index of every topic will be returned. 330 | 331 | ### Search Topics 332 | We are going to search for topics most similar to **medicine**. 333 | ```python 334 | 335 | topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5) 336 | ``` 337 | Returns: 338 | * ``topic_words``: For each topic the top 50 words are returned, in order 339 | of semantic similarity to topic. 340 | 341 | * ``word_scores``: For each topic the cosine similarity scores of the 342 | top 50 words to the topic are returned. 343 | 344 | * ``topic_scores``: For each topic the cosine similarity to the search keywords will be returned. 345 | 346 | * ``topic_nums``: The unique index of every topic will be returned. 347 | 348 | ```python 349 | 350 | >>> topic_nums 351 | [21, 29, 9, 61, 48] 352 | 353 | >>> topic_scores 354 | [0.4468, 0.381, 0.2779, 0.2566, 0.2515] 355 | ``` 356 | > Topic 21 was the most similar topic to "medicine" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1) 357 | 358 | ### Generate Word Clouds 359 | 360 | Using a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our **medicine** topic search from above. 361 | ```python 362 | topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5) 363 | for topic in topic_nums: 364 | model.generate_topic_wordcloud(topic) 365 | ``` 366 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | ### Search Documents by Topic 380 | 381 | We are going to search by **topic 48**, a topic that appears to be about **science**. 382 | ```python 383 | documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5) 384 | ``` 385 | Returns: 386 | * ``documents``: The documents in a list, the most similar are first. 387 | 388 | * ``doc_scores``: Semantic similarity of document to topic. The cosine similarity of the 389 | document and topic vector. 390 | 391 | * ``doc_ids``: Unique ids of documents. If ids were not given, the index of document 392 | in the original corpus. 393 | 394 | For each of the returned documents we are going to print its content, score and document number. 395 | ```python 396 | documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5) 397 | for doc, score, doc_id in zip(documents, document_scores, document_ids): 398 | print(f"Document: {doc_id}, Score: {score}") 399 | print("-----------") 400 | print(doc) 401 | print("-----------") 402 | print() 403 | ``` 404 | 405 | 406 | Document: 15227, Score: 0.6322 407 | ----------- 408 | Evolution is both fact and theory. The THEORY of evolution represents the 409 | scientific attempt to explain the FACT of evolution. The theory of evolution 410 | does not provide facts; it explains facts. It can be safely assumed that ALL 411 | scientific theories neither provide nor become facts but rather EXPLAIN facts. 412 | I recommend that you do some appropriate reading in general science. A good 413 | starting point with regard to evolution for the layman would be "Evolution as 414 | Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay 415 | Gould. There is a great deal of other useful information in this publication. 416 | ----------- 417 | 418 | Document: 14515, Score: 0.6186 419 | ----------- 420 | Just what are these "scientific facts"? I have never heard of such a thing. 421 | Science never proves or disproves any theory - history does. 422 | 423 | -Tim 424 | ----------- 425 | 426 | Document: 9433, Score: 0.5997 427 | ----------- 428 | The same way that any theory is proven false. You examine the predicitions 429 | that the theory makes, and try to observe them. If you don't, or if you 430 | observe things that the theory predicts wouldn't happen, then you have some 431 | evidence against the theory. If the theory can't be modified to 432 | incorporate the new observations, then you say that it is false. 433 | 434 | For example, people used to believe that the earth had been created 435 | 10,000 years ago. But, as evidence showed that predictions from this 436 | theory were not true, it was abandoned. 437 | ----------- 438 | 439 | Document: 11917, Score: 0.5845 440 | ----------- 441 | The point about its being real or not is that one does not waste time with 442 | what reality might be when one wants predictions. The questions if the 443 | atoms are there or if something else is there making measurements indicate 444 | atoms is not necessary in such a system. 445 | 446 | And one does not have to write a new theory of existence everytime new 447 | models are used in Physics. 448 | ----------- 449 | 450 | ... 451 | 452 | ### Semantic Search Documents by Keywords 453 | 454 | Search documents for content semantically similar to **cryptography** and **privacy**. 455 | ```python 456 | documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5) 457 | for doc, score, doc_id in zip(documents, document_scores, document_ids): 458 | print(f"Document: {doc_id}, Score: {score}") 459 | print("-----------") 460 | print(doc) 461 | print("-----------") 462 | print() 463 | ``` 464 | Document: 16837, Score: 0.6112 465 | ----------- 466 | ... 467 | Email and account privacy, anonymity, file encryption, academic 468 | computer policies, relevant legislation and references, EFF, and 469 | other privacy and rights issues associated with use of the Internet 470 | and global networks in general. 471 | ... 472 | 473 | Document: 16254, Score: 0.5722 474 | ----------- 475 | ... 476 | The President today announced a new initiative that will bring 477 | the Federal Government together with industry in a voluntary 478 | program to improve the security and privacy of telephone 479 | communications while meeting the legitimate needs of law 480 | enforcement. 481 | ... 482 | ----------- 483 | ... 484 | 485 | ### Similar Keywords 486 | 487 | Search for similar words to **space**. 488 | ```python 489 | words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20) 490 | for word, score in zip(words, word_scores): 491 | print(f"{word} {score}") 492 | ``` 493 | space 1.0 494 | nasa 0.6589 495 | shuttle 0.5976 496 | exploration 0.5448 497 | planetary 0.5391 498 | missions 0.5069 499 | launch 0.4941 500 | telescope 0.4821 501 | astro 0.4696 502 | jsc 0.4549 503 | ames 0.4515 504 | satellite 0.446 505 | station 0.4445 506 | orbital 0.4438 507 | solar 0.4386 508 | astronomy 0.4378 509 | observatory 0.4355 510 | facility 0.4325 511 | propulsion 0.4251 512 | aerospace 0.4226 513 | -------------------------------------------------------------------------------- /docs/Top2Vec.md: -------------------------------------------------------------------------------- 1 | ../README.md -------------------------------------------------------------------------------- /docs/api.rst: -------------------------------------------------------------------------------- 1 | Top2Vec API Guide 2 | ================= 3 | 4 | .. automodule:: top2vec.top2vec 5 | :members: 6 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | import sphinx_rtd_theme 2 | from recommonmark.parser import CommonMarkParser 3 | 4 | import os 5 | import sys 6 | sys.path.insert(0, os.path.abspath('..')) 7 | 8 | # Configuration file for the Sphinx documentation builder. 9 | # 10 | # This file only contains a selection of the most common options. For a full 11 | # list see the documentation: 12 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 13 | 14 | # -- Path setup -------------------------------------------------------------- 15 | 16 | # If extensions (or modules to document with autodoc) are in another directory, 17 | # add these directories to sys.path here. If the directory is relative to the 18 | # documentation root, use os.path.abspath to make it absolute, like shown here. 19 | # 20 | # import os 21 | # import sys 22 | # sys.path.insert(0, os.path.abspath('.')) 23 | 24 | 25 | # -- Project information ----------------------------------------------------- 26 | 27 | project = 'Top2Vec' 28 | copyright = '2020, Dimo Angelov' 29 | author = 'Dimo Angelov' 30 | 31 | # The full version, including alpha/beta/rc tags 32 | release = '1.0.36' 33 | 34 | 35 | # -- General configuration --------------------------------------------------- 36 | 37 | # Add any Sphinx extension module names here, as strings. They can be 38 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 39 | # ones. 40 | extensions = ['recommonmark', 'sphinx_rtd_theme', 'sphinx.ext.autodoc', 'sphinx.ext.napoleon'] 41 | 42 | # Add any paths that contain templates here, relative to this directory. 43 | templates_path = ['_templates'] 44 | 45 | # List of patterns, relative to source directory, that match files and 46 | # directories to ignore when looking for source files. 47 | # This pattern also affects html_static_path and html_extra_path. 48 | exclude_patterns = [] 49 | 50 | 51 | # -- Options for HTML output ------------------------------------------------- 52 | 53 | # The theme to use for HTML and HTML Help pages. See the documentation for 54 | # a list of builtin themes. 55 | # 56 | #html_theme = 'alabaster' 57 | 58 | 59 | html_theme = "sphinx_rtd_theme" 60 | #html_theme_path = ["_themes", ] 61 | 62 | # Add any paths that contain custom static files (such as style sheets) here, 63 | # relative to this directory. They are copied after the builtin static files, 64 | # so a file named "default.css" will overwrite the builtin "default.css". 65 | html_static_path = ['_static'] 66 | 67 | master_doc = 'index' 68 | 69 | # source_parsers = { 70 | # '.md': CommonMarkParser, 71 | # } 72 | 73 | #source_suffix = ['.rst', '.md'] 74 | 75 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. Top2Vec documentation master file, created by 2 | sphinx-quickstart on Mon Mar 23 19:00:08 2020. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to Top2Vec's documentation! 7 | =================================== 8 | 9 | .. toctree:: 10 | :maxdepth: 2 11 | :caption: User Guide / Tutorial: 12 | 13 | Top2Vec 14 | 15 | .. toctree:: 16 | :caption: API Reference: 17 | 18 | api 19 | 20 | Indices and tables 21 | ================== 22 | 23 | * :ref:`genindex` 24 | * :ref:`modindex` 25 | * :ref:`search` 26 | -------------------------------------------------------------------------------- /images/hdbscan_docs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/hdbscan_docs.png -------------------------------------------------------------------------------- /images/restful-top2vec.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/restful-top2vec.png -------------------------------------------------------------------------------- /images/top2vec_logo.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /images/topic21.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic21.png -------------------------------------------------------------------------------- /images/topic29.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic29.png -------------------------------------------------------------------------------- /images/topic48.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic48.png -------------------------------------------------------------------------------- /images/topic61.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic61.png -------------------------------------------------------------------------------- /images/topic9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/topic9.png -------------------------------------------------------------------------------- /images/topic_vector.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /images/umap_docs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ddangelov/Top2Vec/2435731bc834f49aa22b38d46102bc37b960dffc/images/umap_docs.png -------------------------------------------------------------------------------- /notebooks/CORD-19_top2vec.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{},"cell_type":"markdown","source":"# COVID-19: Topic Modelling and Search with Top2Vec\n\n[Top2Vec](https://github.com/ddangelov/Top2Vec) is an algorithm for **topic modelling** and **semantic search**. It **automatically** detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:\n* Get number of detected topics.\n* Get topics.\n* Search topics by keywords.\n* Search documents by topic.\n* Find similar words.\n* Find similar documents.\n\nThis notebook preprocesses the [Kaggle COVID-19 Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge), it treats each section of every paper as a distinct document. A Top2Vec model is trained on those documents. \n\nOnce the model is trained you can do **semantic** search for documents by topic, searching for documents with keywords, searching for topics with keywords, and for finding similar words. These methods all leverage the joint topic, document, word embeddings distances, which represent semantic similarity. \n\n### For an interactive version of this notebook with search widgets check out my [github](https://github.com/ddangelov/Top2Vec/blob/master/notebooks/CORD-19_top2vec.ipynb) or my [kaggle](https://www.kaggle.com/dangelov/covid-19-top2vec-interactive-search)!\n\n"},{"metadata":{},"cell_type":"markdown","source":"# Import and Setup "},{"metadata":{},"cell_type":"markdown","source":"### 1. Install the [Top2Vec](https://github.com/ddangelov/Top2Vec) library"},{"metadata":{"trusted":true,"_kg_hide-output":true},"cell_type":"code","source":"!pip install top2vec","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Import Libraries"},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"import numpy as np \nimport pandas as pd \nimport json\nimport os\nimport ipywidgets as widgets\nfrom IPython.display import clear_output, display\nfrom top2vec import Top2Vec","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Pre-process Data"},{"metadata":{},"cell_type":"markdown","source":"### 1. Import Metadata"},{"metadata":{"trusted":true},"cell_type":"code","source":"metadata_df = pd.read_csv(\"../input/CORD-19-research-challenge/metadata.csv\")\nmetadata_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Pre-process Papers\n\nA document will be created for each section of every paper. This document will contain the id, title, abstract, and setion of the paper. It will also contain the text of that section."},{"metadata":{"trusted":true},"cell_type":"code","source":"def preproccess_papers():\n\n dataset_dir = \"../input/CORD-19-research-challenge/\"\n comm_dir = dataset_dir+\"comm_use_subset/comm_use_subset/\"\n noncomm_dir = dataset_dir+\"noncomm_use_subset/noncomm_use_subset/\"\n custom_dir = dataset_dir+\"custom_license/custom_license/\"\n biorxiv_dir = dataset_dir+\"biorxiv_medrxiv/biorxiv_medrxiv/\"\n directories_to_process = [comm_dir,noncomm_dir, custom_dir, biorxiv_dir]\n\n papers_with_text = list(metadata_df[metadata_df.has_full_text==True].sha)\n\n paper_ids = []\n titles = []\n abstracts = []\n sections = []\n body_texts = []\n\n for directory in directories_to_process:\n\n filenames = os.listdir(directory)\n\n for filename in filenames:\n\n file = json.load(open(directory+filename, 'rb'))\n\n #check if file contains text\n if file[\"paper_id\"] in papers_with_text:\n\n section = []\n text = []\n\n for bod in file[\"body_text\"]:\n section.append(bod[\"section\"])\n text.append(bod[\"text\"])\n\n res_df = pd.DataFrame({\"section\":section, \"text\":text}).groupby(\"section\")[\"text\"].apply(' '.join).reset_index()\n\n for index, row in res_df.iterrows():\n\n # metadata\n paper_ids.append(file[\"paper_id\"])\n\n if(len(file[\"abstract\"])):\n abstracts.append(file[\"abstract\"][0][\"text\"])\n else:\n abstracts.append(\"\")\n\n titles.append(file[\"metadata\"][\"title\"])\n\n # add section and text\n sections.append(row.section)\n body_texts.append(row.text)\n\n return pd.DataFrame({\"id\":paper_ids, \"title\": titles, \"abstract\": abstracts, \"section\": sections, \"text\": body_texts})","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# papers_df = preproccess_papers()\n# papers_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 3. Filter Short Sections"},{"metadata":{"trusted":true},"cell_type":"code","source":"def filter_short(papers_df):\n papers_df[\"token_counts\"] = papers_df[\"text\"].str.split().map(len)\n papers_df = papers_df[papers_df.token_counts>200].reset_index(drop=True)\n papers_df.drop('token_counts', axis=1, inplace=True)\n \n return papers_df\n ","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# papers_df = filter_short(papers_df)\n# papers_df.head()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Train Top2Vec Model\nParameters:\n * ``documents``: Input corpus, should be a list of strings.\n \n * ``speed``: This parameter will determine how fast the model takes to train. \n The 'fast-learn' option is the fastest and will generate the lowest quality\n vectors. The 'learn' option will learn better quality vectors but take a longer\n time to train. The 'deep-learn' option will learn the best quality vectors but \n will take significant time to train. \n \n * ``workers``: The amount of worker threads to be used in training the model. Larger\n amount will lead to faster training.\n \nSee [Documentation](https://top2vec.readthedocs.io/en/latest/README.html)."},{"metadata":{"trusted":true},"cell_type":"code","source":"# top2vec = Top2Vec(documents=papers_df.text, speed=\"learn\", workers=4)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## (Recommended) Load Pre-trained Model and Pre-processed Data :)\n\nThe Top2Vec model was trained with the 'deep-learn' speed parameter and took very long to train. It will give much better results than training with 'fast-learn' or 'learn'.\n\nData is available on my [kaggle](https://www.kaggle.com/dangelov/covid19top2vec)."},{"metadata":{},"cell_type":"markdown","source":"### 1. Load pre-trained Top2Vec model "},{"metadata":{"trusted":true},"cell_type":"code","source":"top2vec = Top2Vec.load(\"../input/covid19top2vec/covid19_deep_learn_top2vec\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### 2. Load pre-processed papers"},{"metadata":{"trusted":true},"cell_type":"code","source":"papers_df = pd.read_feather(\"../input/covid19top2vec/covid19_papers_processed.feather\")","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Use Top2Vec for Semantic Search"},{"metadata":{},"cell_type":"markdown","source":"## 1. Search Topics "},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_st = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_st)\n\nkeywords_input_st = widgets.Text()\ndisplay(keywords_input_st)\n\nkeywords_neg_select_st = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_st)\n\nkeywords_neg_input_st = widgets.Text()\ndisplay(keywords_neg_input_st)\n\ndoc_num_select_st = widgets.Label('Choose number of topics: ')\ndisplay(doc_num_select_st)\n\ndoc_num_input_st = widgets.Text(value='5')\ndisplay(doc_num_input_st)\n\ndef display_similar_topics(*args):\n \n clear_output()\n display(keywords_select_st)\n display(keywords_input_st)\n display(keywords_neg_select_st)\n display(keywords_neg_input_st)\n display(doc_num_select_st)\n display(doc_num_input_st)\n display(keyword_btn_st)\n \n try:\n topic_words, word_scores, topic_scores, topic_nums = top2vec.search_topics(keywords=keywords_input_st.value.split(),num_topics=int(doc_num_input_st.value), keywords_neg=keywords_neg_input_st.value.split())\n for topic in topic_nums:\n top2vec.generate_topic_wordcloud(topic, background_color=\"black\")\n \n except Exception as e:\n print(e)\n \nkeyword_btn_st = widgets.Button(description=\"show topics\")\ndisplay(keyword_btn_st)\nkeyword_btn_st.on_click(display_similar_topics)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 2. Search Papers by Topic"},{"metadata":{"trusted":true},"cell_type":"code","source":"topic_num_select = widgets.Label('Select topic number: ')\ndisplay(topic_num_select)\n\ntopic_input = widgets.Text()\ndisplay(topic_input)\n\ndoc_num_select = widgets.Label('Choose number of documents: ')\ndisplay(doc_num_select)\n\ndoc_num_input = widgets.Text(value='10')\ndisplay(doc_num_input)\n\ndef display_topics(*args):\n \n clear_output()\n display(topic_num_select)\n display(topic_input)\n display(doc_num_select)\n display(doc_num_input)\n display(topic_btn)\n\n documents, document_scores, document_nums = top2vec.search_documents_by_topic(topic_num=int(topic_input.value), num_docs=int(doc_num_input.value))\n \n result_df = papers_df.loc[document_nums]\n result_df[\"document_scores\"] = document_scores\n \n for index,row in result_df.iterrows():\n print(f\"Document: {index}, Score: {row.document_scores}\")\n print(f\"Section: {row.section}\")\n print(f\"Title: {row.title}\")\n print(\"-----------\")\n print(row.text)\n print(\"-----------\")\n print()\n\ntopic_btn = widgets.Button(description=\"show documents\")\ndisplay(topic_btn)\ntopic_btn.on_click(display_topics)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 3. Search Papers by Keywords"},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_kw = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_kw)\n\nkeywords_input_kw = widgets.Text()\ndisplay(keywords_input_kw)\n\nkeywords_neg_select_kw = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_kw)\n\nkeywords_neg_input_kw = widgets.Text()\ndisplay(keywords_neg_input_kw)\n\ndoc_num_select_kw = widgets.Label('Choose number of documents: ')\ndisplay(doc_num_select_kw)\n\ndoc_num_input_kw = widgets.Text(value='10')\ndisplay(doc_num_input_kw)\n\ndef display_keywords(*args):\n \n clear_output()\n display(keywords_select_kw)\n display(keywords_input_kw)\n display(keywords_neg_select_kw)\n display(keywords_neg_input_kw)\n display(doc_num_select_kw)\n display(doc_num_input_kw)\n display(keyword_btn_kw)\n \n try:\n documents, document_scores, document_nums = top2vec.search_documents_by_keyword(keywords=keywords_input_kw.value.split(), num_docs=int(doc_num_input_kw.value), keywords_neg=keywords_neg_input_kw.value.split())\n result_df = papers_df.loc[document_nums]\n result_df[\"document_scores\"] = document_scores\n\n for index,row in result_df.iterrows():\n print(f\"Document: {index}, Score: {row.document_scores}\")\n print(f\"Section: {row.section}\")\n print(f\"Title: {row.title}\")\n print(\"-----------\")\n print(row.text)\n print(\"-----------\")\n print()\n \n except Exception as e:\n print(e)\n \n\nkeyword_btn_kw = widgets.Button(description=\"show documents\")\ndisplay(keyword_btn_kw)\nkeyword_btn_kw.on_click(display_keywords)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## 4. Find Similar Words"},{"metadata":{"trusted":true},"cell_type":"code","source":"keywords_select_sw = widgets.Label('Enter keywords seperated by space: ')\ndisplay(keywords_select_sw)\n\nkeywords_input_sw = widgets.Text()\ndisplay(keywords_input_sw)\n\nkeywords_neg_select_sw = widgets.Label('Enter negative keywords seperated by space: ')\ndisplay(keywords_neg_select_sw)\n\nkeywords_neg_input_sw = widgets.Text()\ndisplay(keywords_neg_input_sw)\n\n\ndoc_num_select_sw = widgets.Label('Choose number of words: ')\ndisplay(doc_num_select_sw)\n\ndoc_num_input_sw = widgets.Text(value='20')\ndisplay(doc_num_input_sw)\n\ndef display_similar_words(*args):\n \n clear_output()\n display(keywords_select_sw)\n display(keywords_input_sw)\n display(keywords_neg_select_sw)\n display(keywords_neg_input_sw)\n display(doc_num_select_sw)\n display(doc_num_input_sw)\n display(sim_word_btn_sw)\n \n try: \n words, word_scores = top2vec.similar_words(keywords=keywords_input_sw.value.split(), keywords_neg=keywords_neg_input_sw.value.split(), num_words=int(doc_num_input_sw.value))\n for word, score in zip(words, word_scores):\n print(f\"{word} {score}\")\n \n except Exception as e:\n print(e)\n \nsim_word_btn_sw = widgets.Button(description=\"show similar words\")\ndisplay(sim_word_btn_sw)\nsim_word_btn_sw.on_click(display_similar_words)","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat":4,"nbformat_minor":4} -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.20.0 2 | scikit-learn>=1.2.0 3 | pandas 4 | gensim>=4.0.0 5 | umap-learn>=0.5.1 6 | hdbscan>=0.8.27 7 | wordcloud 8 | tensorflow 9 | tensorflow_hub 10 | tensorflow_text 11 | torch 12 | sentence_transformers 13 | hnswlib 14 | transformers 15 | tqdm 16 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md", "r") as fh: 4 | long_description = fh.read() 5 | 6 | setuptools.setup( 7 | name="top2vec", 8 | packages=["top2vec"], 9 | version="1.0.36", 10 | author="Dimo Angelov", 11 | author_email="dimo.angelov@gmail.com", 12 | description="Top2Vec learns jointly embedded topic, document and word vectors.", 13 | long_description=long_description, 14 | long_description_content_type="text/markdown", 15 | url="https://github.com/ddangelov/Top2Vec", 16 | keywords="topic modeling semantic search word document embedding", 17 | license="BSD", 18 | classifiers=[ 19 | "Development Status :: 3 - Alpha", 20 | "Programming Language :: Python :: 3", 21 | "Intended Audience :: Science/Research", 22 | "Intended Audience :: Developers", 23 | "Topic :: Scientific/Engineering :: Artificial Intelligence", 24 | "Topic :: Scientific/Engineering :: Information Analysis", 25 | "License :: OSI Approved :: BSD License", 26 | "Operating System :: OS Independent", 27 | ], 28 | install_requires=[ 29 | 'numpy >= 1.20.0', 30 | 'pandas', 31 | 'scikit-learn >= 1.2.0', 32 | 'gensim >= 4.0.0', 33 | 'umap-learn >= 0.5.1', 34 | 'hdbscan >= 0.8.27', 35 | 'wordcloud', 36 | 'transformers', 37 | 'tqdm' 38 | ], 39 | extras_require={ 40 | 'sentence_encoders': [ 41 | 'tensorflow', 42 | 'tensorflow_hub', 43 | 'tensorflow_text', 44 | ], 45 | 'sentence_transformers': [ 46 | 'torch', 47 | 'sentence_transformers', 48 | ], 49 | 'indexing': [ 50 | 'hnswlib', 51 | ], 52 | }, 53 | python_requires='>=3.10', 54 | ) 55 | -------------------------------------------------------------------------------- /top2vec/__init__.py: -------------------------------------------------------------------------------- 1 | from top2vec.top2vec import Top2Vec 2 | 3 | __version__ = '1.0.36' 4 | -------------------------------------------------------------------------------- /top2vec/embedding.py: -------------------------------------------------------------------------------- 1 | from transformers import AutoTokenizer, AutoModel 2 | import torch 3 | from torch.utils.data import DataLoader 4 | from tqdm import tqdm 5 | import numpy as np 6 | from sklearn.preprocessing import normalize 7 | 8 | 9 | def average_embeddings(documents, 10 | batch_size=32, 11 | model_max_length=512, 12 | embedding_model='sentence-transformers/all-MiniLM-L6-v2'): 13 | tokenizer = AutoTokenizer.from_pretrained(embedding_model) 14 | model = AutoModel.from_pretrained(embedding_model) 15 | 16 | device = ( 17 | "mps" if torch.backends.mps.is_available() 18 | else "cuda" if torch.cuda.is_available() 19 | else "cpu" 20 | ) 21 | device = torch.device(device) 22 | 23 | data_loader = DataLoader(documents, batch_size=batch_size, shuffle=False) 24 | 25 | model.eval() 26 | model.to(device) 27 | 28 | average_embeddings = [] 29 | 30 | with torch.no_grad(): 31 | for batch in tqdm(data_loader, desc="Embedding vocabulary"): 32 | # Tokenize the batch with padding and truncation 33 | batch_inputs = tokenizer( 34 | batch, 35 | padding="max_length", 36 | max_length=model_max_length, 37 | truncation=True, 38 | return_tensors="pt" 39 | ) 40 | 41 | batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()} 42 | last_hidden_state = model(**batch_inputs).last_hidden_state 43 | avg_embedding = last_hidden_state.mean(dim=1) 44 | average_embeddings.append(avg_embedding.cpu().numpy()) 45 | 46 | document_vectors = normalize(np.vstack(average_embeddings)) 47 | 48 | return document_vectors 49 | 50 | 51 | def contextual_token_embeddings(documents, 52 | batch_size=32, 53 | model_max_length=512, 54 | embedding_model='sentence-transformers/all-MiniLM-L6-v2'): 55 | tokenizer = AutoTokenizer.from_pretrained(embedding_model) 56 | model = AutoModel.from_pretrained(embedding_model) 57 | 58 | device = ( 59 | "mps" if torch.backends.mps.is_available() 60 | else "cuda" if torch.cuda.is_available() 61 | else "cpu" 62 | ) 63 | device = torch.device(device) 64 | 65 | # DataLoader to process the documents in batches 66 | data_loader = DataLoader(documents, batch_size=batch_size, shuffle=False) 67 | 68 | model.eval() 69 | model.to(device) 70 | 71 | last_hidden_states = [] 72 | all_attention_masks = [] 73 | all_tokens = [] 74 | 75 | # Embed documents batch-wise 76 | with torch.no_grad(): 77 | for batch in tqdm(data_loader, desc="Embedding documents"): 78 | # Tokenize the batch with padding and truncation 79 | batch_inputs = tokenizer( 80 | batch, 81 | padding="max_length", 82 | max_length=model_max_length, 83 | truncation=True, 84 | return_tensors="pt" 85 | ) 86 | all_attention_masks.extend(batch_inputs['attention_mask']) 87 | all_tokens.extend(batch_inputs['input_ids']) 88 | batch_inputs = {k: v.to(device) for k, v in batch_inputs.items()} 89 | last_hidden_state = model(**batch_inputs).last_hidden_state 90 | last_hidden_states.append(last_hidden_state.cpu()) 91 | 92 | # Concatenate the embeddings from all batches 93 | all_hidden_states = torch.cat(last_hidden_states, dim=0) 94 | 95 | document_token_embeddings = [] 96 | document_tokens = [] 97 | document_labels = [] 98 | 99 | for ind, (hidden_state, attention_mask, tokens) in enumerate( 100 | zip(all_hidden_states, all_attention_masks, all_tokens)): 101 | embeddings = hidden_state[attention_mask.nonzero(as_tuple=True)] 102 | tokens = tokens[attention_mask.nonzero(as_tuple=True)] 103 | tokens = [tokenizer.decode(token) for token in tokens] 104 | 105 | document_token_embeddings.append(embeddings.detach().numpy()) 106 | document_tokens.append(tokens) 107 | document_labels.extend([ind] * len(tokens)) 108 | 109 | return document_token_embeddings, document_tokens, document_labels 110 | 111 | 112 | def sliding_window_average(document_token_embeddings, document_tokens, window_size, stride): 113 | # Store the averaged embeddings 114 | averaged_embeddings = [] 115 | chunk_tokens = [] 116 | 117 | # Iterate over each document 118 | for doc, tokens in tqdm(zip(document_token_embeddings, document_tokens)): 119 | doc_averages = [] 120 | 121 | # Slide the window over the document with the specified stride 122 | for i in range(0, len(doc), stride): 123 | 124 | start = i 125 | end = i + window_size 126 | 127 | if start != 0 and end > len(doc): 128 | start = len(doc) - window_size 129 | end = len(doc) 130 | 131 | window = doc[start:end] 132 | 133 | # Calculate the average embedding for the current window 134 | window_average = np.mean(window, axis=0) 135 | 136 | doc_averages.append(window_average) 137 | chunk_tokens.append(" ".join(tokens[start:end])) 138 | 139 | averaged_embeddings.append(doc_averages) 140 | 141 | averaged_embeddings = np.vstack(averaged_embeddings) 142 | averaged_embeddings = normalize(averaged_embeddings) 143 | 144 | return averaged_embeddings, chunk_tokens 145 | 146 | 147 | def average_adjacent_tokens(token_embeddings, window_size): 148 | num_tokens, embedding_size = token_embeddings.shape 149 | averaged_embeddings = np.zeros_like(token_embeddings) 150 | 151 | token_embeddings = normalize(token_embeddings) 152 | 153 | # Define the range to consider based on window_size 154 | for i in range(num_tokens): 155 | start_idx = max(0, i - window_size) 156 | end_idx = min(num_tokens, i + window_size + 1) 157 | 158 | # Compute the average for the current token within the specified window 159 | averaged_embeddings[i] = np.mean(token_embeddings[start_idx:end_idx], axis=0) 160 | 161 | return averaged_embeddings 162 | 163 | 164 | def smooth_document_token_embeddings(document_token_embeddings, window_size=2): 165 | smoothed_document_embeddings = [] 166 | 167 | for doc in tqdm(document_token_embeddings, desc="Smoothing document token embeddings"): 168 | smoothed_doc = average_adjacent_tokens(doc, window_size=window_size) 169 | smoothed_document_embeddings.append(smoothed_doc) 170 | 171 | return smoothed_document_embeddings 172 | -------------------------------------------------------------------------------- /top2vec/tests/test_top2vec.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from top2vec.top2vec import Top2Vec 3 | from sklearn.datasets import fetch_20newsgroups 4 | import numpy as np 5 | import tempfile 6 | import tensorflow_hub as hub 7 | 8 | # get 20 newsgroups data 9 | newsgroups_train = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) 10 | newsgroups_documents = newsgroups_train.data[0:2000] 11 | 12 | # train top2vec model without doc_ids provided 13 | top2vec = Top2Vec(documents=newsgroups_documents, speed="fast-learn", workers=8) 14 | 15 | # train top2vec model with doc_ids provided 16 | doc_ids = [str(num) for num in range(0, len(newsgroups_documents))] 17 | top2vec_docids = Top2Vec(documents=newsgroups_documents, document_ids=doc_ids, speed="fast-learn", workers=8) 18 | 19 | # train top2vec model without saving documents 20 | top2vec_no_docs = Top2Vec(documents=newsgroups_documents, keep_documents=False, speed="fast-learn", workers=8) 21 | 22 | # train top2vec model with corpus_file 23 | top2vec_corpus_file = Top2Vec(documents=newsgroups_documents, use_corpus_file=True, speed="fast-learn", workers=8) 24 | 25 | # test USE 26 | top2vec_use = Top2Vec(documents=newsgroups_documents, embedding_model='universal-sentence-encoder') 27 | 28 | # test USE with model embedding 29 | top2vec_use_model_embedding = Top2Vec(documents=newsgroups_documents, 30 | embedding_model='universal-sentence-encoder', 31 | use_embedding_model_tokenizer=True) 32 | 33 | # test USE-multilang 34 | top2vec_use_multilang = Top2Vec(documents=newsgroups_documents, 35 | embedding_model='universal-sentence-encoder-multilingual') 36 | 37 | # test Sentence Transformer-multilang 38 | top2vec_transformer_multilang = Top2Vec(documents=newsgroups_documents, 39 | embedding_model='distiluse-base-multilingual-cased') 40 | 41 | # test Sentence Transformer with model embedding 42 | top2vec_transformer_model_embedding = Top2Vec(documents=newsgroups_documents, 43 | embedding_model='distiluse-base-multilingual-cased', 44 | use_embedding_model_tokenizer=True) 45 | 46 | top2vec_transformer_use_large = Top2Vec(documents=newsgroups_documents, 47 | embedding_model='universal-sentence-encoder-large', 48 | use_embedding_model_tokenizer=True, 49 | split_documents=True) 50 | 51 | top2vec_transformer_use_multilang_large = Top2Vec(documents=newsgroups_documents, 52 | embedding_model='universal-sentence-encoder-multilingual-large', 53 | use_embedding_model_tokenizer=True, 54 | split_documents=True, 55 | document_chunker='random') 56 | 57 | top2vec_transformer_sbert_l6 = Top2Vec(documents=newsgroups_documents, 58 | embedding_model='all-MiniLM-L6-v2', 59 | use_embedding_model_tokenizer=True, 60 | split_documents=True, 61 | document_chunker='sequential') 62 | 63 | top2vec_transformer_sbert_l12 = Top2Vec(documents=newsgroups_documents, 64 | embedding_model='paraphrase-multilingual-MiniLM-L12-v2', 65 | use_embedding_model_tokenizer=True, 66 | split_documents=True, 67 | document_chunker='random' 68 | ) 69 | 70 | model = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4') 71 | top2vec_model_callable = Top2Vec(documents=newsgroups_documents, 72 | embedding_model=model, 73 | use_embedding_model_tokenizer=True, 74 | split_documents=True, 75 | document_chunker='random' 76 | ) 77 | 78 | top2vec_ngrams = Top2Vec(documents=newsgroups_documents, 79 | speed="fast-learn", 80 | ngram_vocab=True, 81 | workers=8) 82 | 83 | top2vec_use_ngrams = Top2Vec(documents=newsgroups_documents, 84 | embedding_model='universal-sentence-encoder', 85 | ngram_vocab=True) 86 | 87 | models = [top2vec, top2vec_docids, top2vec_no_docs, top2vec_corpus_file, 88 | top2vec_use, top2vec_use_multilang, top2vec_transformer_multilang, 89 | top2vec_use_model_embedding, top2vec_transformer_model_embedding, 90 | top2vec_transformer_use_large, 91 | top2vec_transformer_use_multilang_large, 92 | top2vec_transformer_sbert_l6, 93 | top2vec_transformer_sbert_l12, 94 | top2vec_model_callable, 95 | top2vec_ngrams, 96 | top2vec_use_ngrams] 97 | 98 | 99 | @pytest.mark.parametrize('top2vec_model', models) 100 | def test_add_documents_original(top2vec_model): 101 | num_docs = top2vec_model.document_vectors.shape[0] 102 | 103 | docs_to_add = newsgroups_train.data[0:100] 104 | 105 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0]) 106 | 107 | if top2vec_model.document_ids_provided is False: 108 | top2vec_model.add_documents(docs_to_add) 109 | else: 110 | doc_ids_new = [str(num) for num in range(2000, 2000 + len(docs_to_add))] 111 | top2vec_model.add_documents(docs_to_add, doc_ids_new) 112 | 113 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0]) 114 | num_docs_new = top2vec_model.document_vectors.shape[0] 115 | 116 | assert topic_count_sum + len(docs_to_add) == topic_count_sum_new == num_docs + len(docs_to_add) \ 117 | == num_docs_new == len(top2vec_model.doc_top) 118 | 119 | if top2vec_model.documents is not None: 120 | assert num_docs_new == len(top2vec_model.documents) 121 | 122 | 123 | @pytest.mark.parametrize('top2vec_model', models) 124 | def test_compute_topics(top2vec_model): 125 | top2vec_model.compute_topics() 126 | 127 | num_topics = top2vec_model.get_num_topics() 128 | words, word_scores, topic_nums = top2vec_model.get_topics() 129 | 130 | # check that for each topic there are words, word_scores and topic_nums 131 | assert len(words) == len(word_scores) == len(topic_nums) == num_topics 132 | 133 | # check that for each word there is a score 134 | assert len(words[0]) == len(word_scores[0]) 135 | 136 | # check that topics words are returned in decreasing order 137 | topic_words_scores = word_scores[0] 138 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1)) 139 | 140 | 141 | @pytest.mark.parametrize('top2vec_model', models) 142 | def test_hierarchical_topic_reduction(top2vec_model): 143 | num_topics = top2vec_model.get_num_topics() 144 | 145 | if num_topics > 10: 146 | reduced_num = 10 147 | elif num_topics - 1 > 0: 148 | reduced_num = num_topics - 1 149 | 150 | hierarchy = top2vec_model.hierarchical_topic_reduction(reduced_num) 151 | 152 | assert len(hierarchy) == reduced_num == len(top2vec_model.topic_vectors_reduced) 153 | 154 | 155 | @pytest.mark.parametrize('top2vec_model', models) 156 | def test_add_documents_post_reduce(top2vec_model): 157 | docs_to_add = newsgroups_train.data[500:600] 158 | 159 | num_docs = top2vec_model.document_vectors.shape[0] 160 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0]) 161 | topic_count_reduced_sum = sum(top2vec_model.get_topic_sizes(reduced=True)[0]) 162 | 163 | if top2vec_model.document_ids_provided is False: 164 | top2vec_model.add_documents(docs_to_add) 165 | else: 166 | doc_ids_new = [str(num) for num in range(2100, 2100 + len(docs_to_add))] 167 | top2vec_model.add_documents(docs_to_add, doc_ids_new) 168 | 169 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0]) 170 | topic_count_reduced_sum_new = sum(top2vec_model.get_topic_sizes(reduced=True)[0]) 171 | 172 | num_docs_new = top2vec_model.document_vectors.shape[0] 173 | 174 | assert topic_count_sum + len(docs_to_add) == topic_count_sum_new == topic_count_reduced_sum + len(docs_to_add) \ 175 | == topic_count_reduced_sum_new == num_docs + len(docs_to_add) == num_docs_new == len(top2vec_model.doc_top) \ 176 | == len(top2vec_model.doc_top_reduced) 177 | 178 | if top2vec_model.documents is not None: 179 | assert num_docs_new == len(top2vec_model.documents) 180 | 181 | 182 | @pytest.mark.parametrize('top2vec_model', models) 183 | def test_delete_documents(top2vec_model): 184 | doc_ids_to_delete = list(range(500, 550)) 185 | 186 | num_docs = top2vec_model.document_vectors.shape[0] 187 | topic_count_sum = sum(top2vec_model.get_topic_sizes()[0]) 188 | topic_count_reduced_sum = sum(top2vec_model.get_topic_sizes(reduced=True)[0]) 189 | 190 | if top2vec_model.document_ids_provided is False: 191 | top2vec_model.delete_documents(doc_ids=doc_ids_to_delete) 192 | else: 193 | doc_ids_to_delete = [str(doc_id) for doc_id in doc_ids_to_delete] 194 | top2vec_model.delete_documents(doc_ids=doc_ids_to_delete) 195 | 196 | topic_count_sum_new = sum(top2vec_model.get_topic_sizes()[0]) 197 | topic_count_reduced_sum_new = sum(top2vec_model.get_topic_sizes(reduced=True)[0]) 198 | num_docs_new = top2vec_model.document_vectors.shape[0] 199 | 200 | assert topic_count_sum - len(doc_ids_to_delete) == topic_count_sum_new == topic_count_reduced_sum - \ 201 | len(doc_ids_to_delete) == topic_count_reduced_sum_new == num_docs - len(doc_ids_to_delete) \ 202 | == num_docs_new == len(top2vec_model.doc_top) == len(top2vec_model.doc_top_reduced) 203 | 204 | if top2vec_model.documents is not None: 205 | assert num_docs_new == len(top2vec_model.documents) 206 | 207 | 208 | @pytest.mark.parametrize('top2vec_model', models) 209 | def test_get_topic_hierarchy(top2vec_model): 210 | hierarchy = top2vec_model.get_topic_hierarchy() 211 | 212 | assert len(hierarchy) == len(top2vec_model.topic_vectors_reduced) 213 | 214 | 215 | @pytest.mark.parametrize('top2vec_model', models) 216 | @pytest.mark.parametrize('reduced', [False, True]) 217 | def test_get_num_topics(top2vec_model, reduced): 218 | # check that there are more than 0 topics 219 | assert top2vec_model.get_num_topics(reduced=reduced) > 0 220 | 221 | 222 | @pytest.mark.parametrize('top2vec_model', models) 223 | @pytest.mark.parametrize('reduced', [False, True]) 224 | def test_get_topics(top2vec_model, reduced): 225 | num_topics = top2vec_model.get_num_topics(reduced=reduced) 226 | words, word_scores, topic_nums = top2vec_model.get_topics(reduced=reduced) 227 | 228 | # check that for each topic there are words, word_scores and topic_nums 229 | assert len(words) == len(word_scores) == len(topic_nums) == num_topics 230 | 231 | # check that for each word there is a score 232 | assert len(words[0]) == len(word_scores[0]) 233 | 234 | # check that topics words are returned in decreasing order 235 | topic_words_scores = word_scores[0] 236 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1)) 237 | 238 | 239 | @pytest.mark.parametrize('top2vec_model', models) 240 | @pytest.mark.parametrize('reduced', [False, True]) 241 | def test_get_topic_size(top2vec_model, reduced): 242 | topic_sizes, topic_nums = top2vec_model.get_topic_sizes(reduced=reduced) 243 | 244 | # check that topic sizes add up to number of documents 245 | assert sum(topic_sizes) == top2vec_model.document_vectors.shape[0] 246 | 247 | # check that topics are ordered decreasingly 248 | assert all(topic_sizes[i] >= topic_sizes[i + 1] for i in range(len(topic_sizes) - 1)) 249 | 250 | 251 | @pytest.mark.parametrize('top2vec_model', models) 252 | @pytest.mark.parametrize('reduced', [False, True]) 253 | def test_generate_topic_wordcloud(top2vec_model, reduced): 254 | # generate word cloud 255 | num_topics = top2vec_model.get_num_topics(reduced=reduced) 256 | top2vec_model.generate_topic_wordcloud(num_topics - 1, reduced=reduced) 257 | 258 | 259 | @pytest.mark.parametrize('top2vec_model', models) 260 | @pytest.mark.parametrize('reduced', [False, True]) 261 | def test_search_documents_by_topic(top2vec_model, reduced): 262 | # get topic sizes 263 | topic_sizes, topic_nums = top2vec_model.get_topic_sizes(reduced=reduced) 264 | topic = topic_nums[0] 265 | num_docs = topic_sizes[0] 266 | 267 | # search documents by topic 268 | if top2vec_model.documents is not None: 269 | documents, document_scores, document_ids = top2vec_model.search_documents_by_topic(topic, num_docs, 270 | reduced=reduced) 271 | else: 272 | document_scores, document_ids = top2vec_model.search_documents_by_topic(topic, num_docs, reduced=reduced) 273 | 274 | # check that for each document there is a score and number 275 | if top2vec_model.documents is not None: 276 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 277 | else: 278 | assert len(document_scores) == len(document_ids) == num_docs 279 | 280 | # check that documents are returned in decreasing order 281 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 282 | 283 | # check that all documents returned are most similar to topic being searched 284 | document_indexes = [top2vec_model.doc_id2index[doc_id] for doc_id in document_ids] 285 | 286 | if reduced: 287 | doc_topics = set(np.argmax( 288 | np.inner(top2vec_model.document_vectors[document_indexes], 289 | top2vec_model.topic_vectors_reduced), axis=1)) 290 | else: 291 | doc_topics = set(np.argmax( 292 | np.inner(top2vec_model.document_vectors[document_indexes], 293 | top2vec_model.topic_vectors), axis=1)) 294 | assert len(doc_topics) == 1 and topic in doc_topics 295 | 296 | 297 | @pytest.mark.parametrize('top2vec_model', models) 298 | def test_search_documents_by_keywords(top2vec_model): 299 | keywords = top2vec_model.vocab 300 | keyword = keywords[-1] 301 | num_docs = 10 302 | 303 | if top2vec_model.documents is not None: 304 | documents, document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword], 305 | num_docs=num_docs) 306 | else: 307 | document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword], 308 | num_docs=num_docs) 309 | 310 | # check that for each document there is a score and number 311 | if top2vec_model.documents is not None: 312 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 313 | else: 314 | assert len(document_scores) == len(document_ids) == num_docs 315 | 316 | # check that documents are returned in decreasing order 317 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 318 | 319 | 320 | @pytest.mark.parametrize('top2vec_model', models) 321 | def test_similar_words(top2vec_model): 322 | keywords = top2vec_model.vocab 323 | keyword = keywords[-1] 324 | num_words = 20 325 | 326 | words, word_scores = top2vec_model.similar_words(keywords=[keyword], num_words=num_words) 327 | 328 | # check that there is a score for each word 329 | assert len(words) == len(word_scores) == num_words 330 | 331 | # check that words are returned in decreasing order 332 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1)) 333 | 334 | 335 | @pytest.mark.parametrize('top2vec_model', models) 336 | @pytest.mark.parametrize('reduced', [False, True]) 337 | def test_search_topics(top2vec_model, reduced): 338 | num_topics = top2vec_model.get_num_topics(reduced=reduced) 339 | keywords = top2vec_model.vocab 340 | keyword = keywords[-1] 341 | topic_words, word_scores, topic_scores, topic_nums = top2vec_model.search_topics(keywords=[keyword], 342 | num_topics=num_topics, 343 | reduced=reduced) 344 | # check that for each topic there are topic words, word scores, topic scores and score of topic 345 | assert len(topic_words) == len(word_scores) == len(topic_scores) == len(topic_nums) == num_topics 346 | 347 | # check that for each topic words have scores 348 | assert len(topic_words[0]) == len(word_scores[0]) 349 | 350 | # check that topics are returned in decreasing order 351 | assert all(topic_scores[i] >= topic_scores[i + 1] for i in range(len(topic_scores) - 1)) 352 | 353 | # check that topics words are returned in decreasing order 354 | topic_words_scores = word_scores[0] 355 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1)) 356 | 357 | 358 | @pytest.mark.parametrize('top2vec_model', models) 359 | def test_search_document_by_documents(top2vec_model): 360 | doc_id = top2vec_model.document_ids[0] 361 | 362 | num_docs = 10 363 | 364 | if top2vec_model.documents is not None: 365 | documents, document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id], 366 | num_docs=num_docs) 367 | else: 368 | document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id], 369 | num_docs=num_docs) 370 | 371 | # check that for each document there is a score and number 372 | if top2vec_model.documents is not None: 373 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 374 | else: 375 | assert len(document_scores) == len(document_ids) == num_docs 376 | 377 | # check that documents are returned in decreasing order 378 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 379 | 380 | 381 | @pytest.mark.parametrize('top2vec_model', models) 382 | def test_get_documents_topics(top2vec_model): 383 | doc_ids_get = top2vec_model.document_ids[[0, 5]] 384 | 385 | if top2vec_model.hierarchy is not None: 386 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get, 387 | reduced=True) 388 | else: 389 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get) 390 | 391 | assert len(doc_topics) == len(doc_dist) == len(topic_words) == len(topic_word_scores) == len(doc_ids_get) 392 | 393 | 394 | @pytest.mark.parametrize('top2vec_model', models) 395 | def test_get_documents_topics_multiple(top2vec_model): 396 | doc_ids_get = top2vec_model.document_ids[[0, 1, 5]] 397 | num_topics = 2 398 | 399 | if top2vec_model.hierarchy is not None: 400 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get, 401 | reduced=True, 402 | num_topics=num_topics) 403 | 404 | actual_number_topics = top2vec_model.get_num_topics(reduced=True) 405 | 406 | else: 407 | doc_topics, doc_dist, topic_words, topic_word_scores = top2vec_model.get_documents_topics(doc_ids=doc_ids_get, 408 | num_topics=num_topics) 409 | 410 | actual_number_topics = top2vec_model.get_num_topics(reduced=False) 411 | 412 | assert len(doc_topics) == len(doc_dist) == len(topic_words) == len(topic_word_scores) == len(doc_ids_get) 413 | 414 | if num_topics <= actual_number_topics: 415 | assert doc_topics.shape[1] == num_topics 416 | assert doc_dist.shape[1] == num_topics 417 | assert topic_words.shape[1] == num_topics 418 | assert topic_word_scores.shape[1] == num_topics 419 | 420 | 421 | @pytest.mark.parametrize('top2vec_model', models) 422 | def test_search_documents_by_vector(top2vec_model): 423 | document_vectors = top2vec_model.document_vectors 424 | top2vec_model.search_documents_by_vector(vector=document_vectors[0], num_docs=10) 425 | 426 | num_docs = 10 427 | 428 | if top2vec_model.documents is not None: 429 | documents, document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0], 430 | num_docs=num_docs) 431 | else: 432 | document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0], 433 | num_docs=num_docs) 434 | if top2vec_model.documents is not None: 435 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 436 | else: 437 | assert len(document_scores) == len(document_ids) == num_docs 438 | 439 | # check that documents are returned in decreasing order 440 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 441 | 442 | 443 | @pytest.mark.parametrize('top2vec_model', models) 444 | def test_index_documents(top2vec_model): 445 | top2vec_model.index_document_vectors() 446 | assert top2vec_model.document_vectors.shape[1] <= top2vec_model.document_index.get_max_elements() 447 | 448 | 449 | @pytest.mark.parametrize('top2vec_model', models) 450 | def test_search_documents_by_vector_index(top2vec_model): 451 | document_vectors = top2vec_model.document_vectors 452 | top2vec_model.search_documents_by_vector(vector=document_vectors[0], num_docs=10) 453 | 454 | num_docs = 10 455 | 456 | if top2vec_model.documents is not None: 457 | documents, document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0], 458 | num_docs=num_docs, 459 | use_index=True) 460 | else: 461 | document_scores, document_ids = top2vec_model.search_documents_by_vector(vector=document_vectors[0], 462 | num_docs=num_docs, 463 | use_index=True) 464 | if top2vec_model.documents is not None: 465 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 466 | else: 467 | assert len(document_scores) == len(document_ids) == num_docs 468 | 469 | # check that documents are returned in decreasing order 470 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 471 | 472 | 473 | @pytest.mark.parametrize('top2vec_model', models) 474 | def test_search_documents_by_keywords_index(top2vec_model): 475 | keywords = top2vec_model.vocab 476 | keyword = keywords[-1] 477 | num_docs = 10 478 | 479 | if top2vec_model.documents is not None: 480 | documents, document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword], 481 | num_docs=num_docs, 482 | use_index=True) 483 | else: 484 | document_scores, document_ids = top2vec_model.search_documents_by_keywords(keywords=[keyword], 485 | num_docs=num_docs, 486 | use_index=True) 487 | 488 | # check that for each document there is a score and number 489 | if top2vec_model.documents is not None: 490 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 491 | else: 492 | assert len(document_scores) == len(document_ids) == num_docs 493 | 494 | # check that documents are returned in decreasing order 495 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 496 | 497 | 498 | @pytest.mark.parametrize('top2vec_model', models) 499 | def test_search_document_by_documents_index(top2vec_model): 500 | doc_id = top2vec_model.document_ids[0] 501 | 502 | num_docs = 10 503 | 504 | if top2vec_model.documents is not None: 505 | documents, document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id], 506 | num_docs=num_docs, 507 | use_index=True) 508 | else: 509 | document_scores, document_ids = top2vec_model.search_documents_by_documents(doc_ids=[doc_id], 510 | num_docs=num_docs, 511 | use_index=True) 512 | 513 | # check that for each document there is a score and number 514 | if top2vec_model.documents is not None: 515 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 516 | else: 517 | assert len(document_scores) == len(document_ids) == num_docs 518 | 519 | # check that documents are returned in decreasing order 520 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 521 | 522 | 523 | @pytest.mark.parametrize('top2vec_model', models) 524 | def test_search_words_by_vector(top2vec_model): 525 | word_vectors = top2vec_model.word_vectors 526 | top2vec_model.search_words_by_vector(vector=word_vectors[0], num_words=10) 527 | 528 | num_words = 10 529 | 530 | words, word_scores = top2vec_model.search_words_by_vector(vector=word_vectors[0], 531 | num_words=num_words) 532 | 533 | # check that there is a score for each word 534 | assert len(words) == len(word_scores) == num_words 535 | 536 | # check that words are returned in decreasing order 537 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1)) 538 | 539 | 540 | @pytest.mark.parametrize('top2vec_model', models) 541 | def test_index_words(top2vec_model): 542 | top2vec_model.index_word_vectors() 543 | assert top2vec_model.word_vectors.shape[1] <= top2vec_model.word_index.get_max_elements() 544 | 545 | 546 | @pytest.mark.parametrize('top2vec_model', models) 547 | def test_similar_words_index(top2vec_model): 548 | keywords = top2vec_model.vocab 549 | keyword = keywords[-1] 550 | num_words = 20 551 | 552 | words, word_scores = top2vec_model.similar_words(keywords=[keyword], num_words=num_words, use_index=True) 553 | 554 | # check that there is a score for each word 555 | assert len(words) == len(word_scores) == num_words 556 | 557 | # check that words are returned in decreasing order 558 | assert all(word_scores[i] >= word_scores[i + 1] for i in range(len(word_scores) - 1)) 559 | 560 | 561 | @pytest.mark.parametrize('top2vec_model', models) 562 | def test_save_load(top2vec_model): 563 | if top2vec_model.embedding_model == "custom": 564 | model = top2vec_model.embed 565 | temp = tempfile.NamedTemporaryFile(mode='w+b') 566 | top2vec_model.save(temp.name) 567 | Top2Vec.load(temp.name) 568 | temp.close() 569 | top2vec_model.set_embedding_model(model) 570 | 571 | else: 572 | temp = tempfile.NamedTemporaryFile(mode='w+b') 573 | top2vec_model.save(temp.name) 574 | Top2Vec.load(temp.name) 575 | temp.close() 576 | 577 | 578 | @pytest.mark.parametrize('top2vec_model', models) 579 | def test_query_documents(top2vec_model): 580 | num_docs = 10 581 | 582 | if top2vec_model.documents is not None: 583 | documents, document_scores, document_ids = top2vec_model.query_documents(query="what is the meaning of life?", 584 | num_docs=num_docs) 585 | else: 586 | document_scores, document_ids = top2vec_model.query_documents(query="what is the meaning of life?", 587 | num_docs=num_docs) 588 | 589 | # check that for each document there is a score and number 590 | if top2vec_model.documents is not None: 591 | assert len(documents) == len(document_scores) == len(document_ids) == num_docs 592 | else: 593 | assert len(document_scores) == len(document_ids) == num_docs 594 | 595 | # check that documents are returned in decreasing order 596 | assert all(document_scores[i] >= document_scores[i + 1] for i in range(len(document_scores) - 1)) 597 | 598 | 599 | @pytest.mark.parametrize('top2vec_model', models) 600 | def test_query_topics(top2vec_model): 601 | num_topics = top2vec_model.get_num_topics() 602 | topic_words, word_scores, topic_scores, topic_nums = top2vec_model.query_topics(query="what is the " 603 | "meaning of life?", 604 | num_topics=num_topics) 605 | 606 | # check that for each topic there are topic words, word scores, topic scores and score of topic 607 | assert len(topic_words) == len(word_scores) == len(topic_scores) == len(topic_nums) == num_topics 608 | 609 | # check that for each topic words have scores 610 | assert len(topic_words[0]) == len(word_scores[0]) 611 | 612 | # check that topics are returned in decreasing order 613 | assert all(topic_scores[i] >= topic_scores[i + 1] for i in range(len(topic_scores) - 1)) 614 | 615 | # check that topics words are returned in decreasing order 616 | topic_words_scores = word_scores[0] 617 | assert all(topic_words_scores[i] >= topic_words_scores[i + 1] for i in range(len(topic_words_scores) - 1)) 618 | --------------------------------------------------------------------------------