├── .gitattributes ├── .gitignore ├── README.md ├── data ├── demo.csv └── vol7.json ├── images ├── LeeTopic.png ├── demo-new.JPG ├── demo-search.jpg ├── demo.png ├── leet-demo.png └── leettopic-logo.png ├── leet_topic ├── __init__.py ├── __pycache__ │ └── leet_topic.cpython-38.pyc ├── bokeh_app.py └── leet_topic.py └── setup.py /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | annoy_index.ann 3 | demo.html 4 | embeddings.npy 5 | leet_topic.egg-info/dependency_links.txt 6 | leet_topic.egg-info/PKG-INFO 7 | leet_topic.egg-info/requires.txt 8 | leet_topic.egg-info/SOURCES.txt 9 | leet_topic.egg-info/top_level.txt 10 | leet_topic/__pycache__/__init__.cpython-310.pyc 11 | leet_topic/__pycache__/bokeh_app.cpython-310.pyc 12 | testing.ipynb 13 | leet_topic/__pycache__/leet_topic.cpython-310.pyc 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![PyPI - PyPi](https://img.shields.io/pypi/v/leet-topic)](https://pypi.org/project/leet-topic/) 2 | 3 | ![Leet Topic Logo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/LeeTopic.png) 4 | 5 | LeetTopic builds upon [Top2Vec](https://github.com/ddangelov/Top2Vec), [BerTopic](https://github.com/MaartenGr/BERTopic) and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics. 6 | 7 | It also lets you turn any DataFrame into a [Bokeh](https://bokeh.org/) application for exploring your documents and topics. As of 0.0.10, LeetTopic also allows users to generate an [Annoy](https://github.com/spotify/annoy) Index as part of the LeetTopic pipeline. This allows users to construct a query their data. 8 | 9 | # Installation 10 | 11 | ```python 12 | pip install leet-topic 13 | ``` 14 | 15 | # Parameters 16 | - df => a Pandas DataFrame that contains the documents that you want to model 17 | - document_field => the DataFrame column name where your documents sit 18 | - html_filename => the filename used to generate the Bokeh application 19 | - extra_fields => a list of extra columns to include in the Bokeh application 20 | - max_distance => the maximum distance between a document and the nearest topic vector to be considered for outliers 21 | 22 | # Usage 23 | 24 | ```python 25 | import pandas as pd 26 | from leet_topic import leet_topic 27 | 28 | df = pd.read_json("data/vol7.json") 29 | leet_df, topic_data = leet_topic.LeetTopic(df, 30 | document_field="descriptions", 31 | html_filename="demo.html", 32 | extra_fields=["names", "hdbscan_labels"], 33 | max_distance=.5) 34 | ``` 35 | 36 | ## Multilingual Support 37 | With LeetTopic, you can work with texts in any language supported by spaCy for lemmatization and any model from HuggingFace via Sentence Transformers. 38 | 39 | Here is an example working with Croatian 40 | 41 | ```python 42 | import pandas as pd 43 | from leet_topic import leet_topic 44 | 45 | df = pd.DataFrame(["Bok. Kako ste?", "Drago mi je"]*20, columns=["text"]) 46 | leet_df, topic_data = leet_topic.LeetTopic(df, 47 | document_field="text", 48 | html_filename="demo.html", 49 | extra_fields=["hdbscan_labels"], 50 | spacy_model="hr_core_news_sm", 51 | max_distance=.5) 52 | ``` 53 | 54 | ## Custom UMAP and HDBScan Parameters 55 | It is often necessary to control how your embeddings are flattened with UMAP and clustered with HDBScan. As of 0.0.9, you can control these parameters with dictionaries. 56 | 57 | ```python 58 | import pandas as pd 59 | from leet_topic import leet_topic 60 | 61 | df = pd.read_json("data/vol7.json") 62 | leet_df, topic_data = leet_topic.LeetTopic(df, 63 | document_field="descriptions", 64 | html_filename="demo.html", 65 | extra_fields=["names", "hdbscan_labels"], 66 | umap_params={"n_neighbors": 15, "min_dist": 0.01, "metric": 'correlation'}, 67 | hdbscan_params={"min_samples": 10, "min_cluster_size": 5}, 68 | max_distance=.5) 69 | ``` 70 | 71 | ## Create an Annoy Index 72 | As of 0.0.10, users can also return an Annoy Index. 73 | 74 | ```python 75 | import pandas as pd 76 | from leet_topic import leet_topic 77 | 78 | df = pd.read_json("data/vol7.json") 79 | leet_df, topic_data, annoy_index = leet_topic.LeetTopic(df, "descriptions", 80 | "demo.html", 81 | build_annoy=True) 82 | ``` 83 | 84 | To leverage the Annoy Index, one can easily create a semantic search engine. One can query the index, for example, by encoding a new text with the same model. 85 | 86 | ```python 87 | import pandas as pd 88 | from leet_topic import leet_topic 89 | from sentence_transformers import SentenceTransformer 90 | 91 | 92 | model = SentenceTransformer('all-MiniLM-L6-v2') 93 | 94 | emb = model.encode("An individual who was arrested.") 95 | 96 | res = annoy_index.get_nns_by_vector(emb, 10) 97 | 98 | print(df.iloc[res].descriptions.tolist()) 99 | 100 | ``` 101 | 102 | 103 | # Outputs 104 | This code above will generate a new DataFrame with the UMAP Projection (x, y), hdbscan_labels, and leet_labels, and top-n words for each document. It will also output data about each topic including the central plot of each vector, the documents assigned to it, top-n words associated with it. 105 | 106 | Finally, the output will create an HTML file that is a self-contained Bokeh application like the image below. 107 | 108 | ![demo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/demo-new.JPG) 109 | 110 | # Steps 111 | 112 | LeetTopic takes an input DataFrame and converts the document field (texts to model) into embeddings via a transformer model. Next, UMAP is used to reduce the embeddings to 2 dimensions. HDBScan is then used to assign documents to topics. Like BerTopic and Top2Vec, at this stage, there are many outliers (topics assigned to -1). 113 | 114 | LeetTopic, like Top2Vec, then calculates the centroid for each topic vector based on the HDBScan labels while ignoring topic -1 (outlier). Next, all outlier documents are assigned to nearest topic centroid. Unlike Top2Vec, LeetTopic gives the user the ability to set a max distance so that outliers that are significantly away from a topic vector, they are not assigned to a nearest vector. At the same time, the output DataFrame contains information about the original HDBScan topics, meaning users know if a document was originally an outlier. 115 | 116 | 117 | 118 | # Future Roadmap 119 | ## 0.0.9 120 | - Control UMAP parameters 121 | - Control HDBScan parameters 122 | - Multilingual support for lemmatization 123 | - Multilingual support for embedding 124 | - Add support for custom App Titles 125 | 126 | ## 0.0.10 127 | - Output an Annoy Index so that the data can be queried 128 | 129 | ## 0.0.11 130 | - Support for embedding text, images, or both via CLIP and displaying the results in the same bokeh application 131 | -------------------------------------------------------------------------------- /images/LeeTopic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/LeeTopic.png -------------------------------------------------------------------------------- /images/demo-new.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo-new.JPG -------------------------------------------------------------------------------- /images/demo-search.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo-search.jpg -------------------------------------------------------------------------------- /images/demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo.png -------------------------------------------------------------------------------- /images/leet-demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/leet-demo.png -------------------------------------------------------------------------------- /images/leettopic-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/leettopic-logo.png -------------------------------------------------------------------------------- /leet_topic/__init__.py: -------------------------------------------------------------------------------- 1 | from leet_topic.leet_topic import * 2 | -------------------------------------------------------------------------------- /leet_topic/__pycache__/leet_topic.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/leet_topic/__pycache__/leet_topic.cpython-38.pyc -------------------------------------------------------------------------------- /leet_topic/bokeh_app.py: -------------------------------------------------------------------------------- 1 | from bokeh.layouts import row, column 2 | from bokeh.models import ColumnDataSource, CustomJS, DataTable, TableColumn, MultiChoice, HTMLTemplateFormatter, TextAreaInput, Div, TextInput 3 | from bokeh.plotting import figure, output_file, show 4 | 5 | from bokeh.palettes import Category10, Cividis256, Turbo256 6 | from bokeh.transform import linear_cmap 7 | 8 | import pandas as pd 9 | import numpy as np 10 | 11 | from typing import Tuple, Optional 12 | import bokeh 13 | import bokeh.transform 14 | 15 | 16 | 17 | #From Bulk Library 18 | def get_color_mapping( 19 | df: pd.DataFrame, 20 | topic_field, 21 | ) -> Tuple[Optional[bokeh.transform.transform], pd.DataFrame]: 22 | """Creates a color mapping""" 23 | 24 | color_datatype = str(df[topic_field].dtype) 25 | if color_datatype == "object": 26 | df[topic_field] = df[topic_field].apply( 27 | lambda x: str(x) if not (type(x) == float and np.isnan(x)) else x 28 | ) 29 | all_values = list(df[topic_field].dropna().unique()) 30 | if len(all_values) == 2: 31 | all_values.extend([""]) 32 | elif len(all_values) > len(Category10) + 2: 33 | raise ValueError( 34 | f"Too many classes defined, the limit for visualisation is {len(Category10) + 2}. " 35 | f"Got {len(all_values)}." 36 | ) 37 | mapper = factor_cmap( 38 | field_name=topic_field, 39 | palette=Category10[len(all_values)], 40 | factors=all_values, 41 | nan_color="grey", 42 | ) 43 | elif color_datatype.startswith("float") or color_datatype.startswith("int"): 44 | all_values = df[topic_field].dropna().values 45 | mapper = linear_cmap( 46 | field_name=topic_field, 47 | palette=Turbo256, 48 | low=all_values.min(), 49 | high=all_values.max(), 50 | nan_color="grey", 51 | ) 52 | else: 53 | raise TypeError( 54 | f"We currently only support the following type for 'color' column: 'int*', 'float*', 'object'. " 55 | f"Got {color_datatype}." 56 | ) 57 | return mapper, df 58 | 59 | 60 | def create_html(df, document_field, topic_field, html_filename, topic_data, tf_idf, extra_fields=[], app_name=""): 61 | fields = ["x", "y", document_field, topic_field, "selected"] 62 | fields = fields+extra_fields 63 | output_file(html_filename) 64 | 65 | mapper, df = get_color_mapping(df, topic_field) 66 | df['selected'] = False 67 | categories = df[topic_field].unique() 68 | categories = [str(x) for x in categories] 69 | 70 | 71 | 72 | s1 = ColumnDataSource(df) 73 | 74 | 75 | columns = [ 76 | TableColumn(field=topic_field, title=topic_field, width=10), 77 | TableColumn(field=document_field, title=document_field, width=500), 78 | ] 79 | for field in extra_fields: 80 | columns.append(TableColumn(field=field, title=field, width=100)) 81 | 82 | 83 | p1 = figure(width=500, height=500, tools="pan,tap,wheel_zoom,lasso_select,box_zoom,box_select,reset", active_scroll="wheel_zoom", title="Select Here", x_range=(df.x.min(), df.x.max()), y_range=(df.y.min(), df.y.max())) 84 | circle_kwargs = {"x": "x", "y": "y", 85 | "size": 3, 86 | "source": s1, 87 | "color": mapper 88 | } 89 | scatter = p1.circle(**circle_kwargs) 90 | 91 | s2 = ColumnDataSource(data=dict(x=[], y=[], leet_labels=[])) 92 | p2 = figure(width=500, height=500, tools="pan,tap,lasso_select,wheel_zoom,box_zoom,box_select,reset", active_scroll="wheel_zoom", title="Analyze Selection", x_range=(df.x.min(), df.x.max()), y_range=(df.y.min(), df.y.max())) 93 | 94 | circle_kwargs2 = {"x": "x", "y": "y", 95 | "size": 3, 96 | "source": s2, 97 | "color": mapper 98 | } 99 | scatter2 = p2.circle(**circle_kwargs2) 100 | 101 | multi_choice = MultiChoice(value=[], options=categories, width = 500, title='Selection:') 102 | data_table = DataTable(source=s2, 103 | columns=columns, 104 | width=700, 105 | height=500, 106 | sortable=True, 107 | autosize_mode='none') 108 | selected_texts = TextAreaInput(value = "", title = "Selected texts", width = 700, height=500) 109 | top_search_results = TextAreaInput(value = "", title = "Search Results", width = 250, height=500) 110 | top_search = TextInput(title="Topic Search") 111 | doc_search_results = TextAreaInput(value = "", title = "Search Results", width = 250, height=500) 112 | doc_search = TextInput(title="Document Search") 113 | topic_desc = TextAreaInput(value = "", title = "Topic Descriptions", width = 500, height=500) 114 | 115 | def field_string(field): 116 | return """d2['"""+field+"""'] = []\n""" 117 | 118 | def push_string(field): 119 | return """d2['"""+field+"""'].push(d1['"""+field+"""'][inds[i]])\n""" 120 | 121 | def indices_string(field): 122 | return """d2['"""+field+"""'].push(d1['"""+field+"""'][s1.selected.indices[i]])\n""" 123 | 124 | def push_string2(field): 125 | return """d2['"""+field+"""'].push(d1['"""+field+"""'][i])\n""" 126 | 127 | def list_creator(fields, str_type=""): 128 | main_str = "" 129 | for field in fields: 130 | if str_type == "field": 131 | main_str=main_str+field_string(field) 132 | elif str_type == "push": 133 | main_str=main_str+push_string(field) 134 | elif str_type == "indices": 135 | main_str=main_str+indices_string(field) 136 | elif str_type == "push2": 137 | main_str=main_str+push_string2(field) 138 | return main_str 139 | 140 | s1.selected.js_on_change('indices', CustomJS(args=dict(s1=s1, s2=s2, s4=multi_choice), code=""" 141 | const inds = cb_obj.indices; 142 | const d1 = s1.data; 143 | const d2 = s2.data; 144 | const d4 = s4;"""+list_creator(fields=fields, str_type="field")+ 145 | """for (let i = 0; i < inds.length; i++) {"""+ 146 | list_creator(fields=fields, str_type="push")+ 147 | """} 148 | const res = [...new Set(d2['"""+topic_field+"""'])]; 149 | d4.value = res.map(function(e){return e.toString()}); 150 | s1.change.emit(); 151 | s2.change.emit(); 152 | """) 153 | ) 154 | 155 | 156 | multi_choice.js_on_change('value', CustomJS(args=dict(s1=s1, s2=s2, scatter=scatter, topic_desc=topic_desc, topic_data=topic_data, tf_idf=tf_idf), code=""" 157 | let values = cb_obj.value; 158 | let unchange_values = cb_obj.value; 159 | const d1 = s1.data; 160 | const d2 = s2.data; 161 | const plot = scatter; 162 | s2.selected.indices = []; 163 | for (let i = 0; i < s1.selected.indices.length; i++) { 164 | for (let j =0; j < values.length; j++) { 165 | if (d1."""+topic_field+"""[s1.selected.indices[i]] == values[j]) { 166 | values = values.filter(item => item !== values[j]); 167 | } 168 | } 169 | } 170 | """+list_creator(fields=fields, str_type="field")+ 171 | """ 172 | for (let i = 0; i < s1.selected.indices.length; i++) { 173 | if (unchange_values.includes(String(d1."""+topic_field+"""[s1.selected.indices[i]]))) { 174 | """+ 175 | list_creator(fields=fields, str_type="indices")+ 176 | """ 177 | } 178 | } 179 | for (let i = 0; i < d1."""+topic_field+""".length; i++) { 180 | if (values.includes(String(d1."""+topic_field+"""[i]))) { 181 | """+ 182 | list_creator(fields=fields, str_type="push2")+ 183 | """ 184 | } 185 | } 186 | if (tf_idf) { 187 | let data = []; 188 | for (const key of Object.keys(topic_data)) { 189 | for (let i=0; i < unchange_values.length; i++) { 190 | if (key == unchange_values[i]) { 191 | let keywords = topic_data[key]["key_words"]; 192 | data.push("Topic " + key + ": "); 193 | for (let i=0; i < keywords.length; i++) { 194 | data.push(keywords[i][0] + " " + keywords[i][1]); 195 | } 196 | data.push("\\r\\n"); 197 | } 198 | } 199 | } 200 | topic_desc.value = data.join("\\r\\n"); 201 | s2.change.emit(); 202 | } 203 | """) 204 | ) 205 | 206 | 207 | s2.selected.js_on_change('indices', CustomJS(args=dict(s1=s1, s2=s2, s_texts=selected_texts), code=""" 208 | const inds = cb_obj.indices; 209 | const d1 = s1.data; 210 | const d2 = s2.data; 211 | const texts = s_texts.value; 212 | s_texts.value = ""; 213 | const data = []; 214 | for (let i = 0; i < inds.length; i++) { 215 | data.push(" (Topic: " + d2['"""+topic_field+"""'][inds[i]] + ")") 216 | data.push("Document: " + d2['"""+document_field+"""'][inds[i]]) 217 | data.push("\\r\\n") 218 | } 219 | s2.change.emit(); 220 | s_texts.value = data.join("\\r\\n") 221 | s_texts.change.emit(); 222 | """) 223 | ) 224 | 225 | top_search.js_on_change('value', CustomJS(args=dict(topic_data=topic_data, top_search_results=top_search_results, s4=multi_choice, s1=s1), code=""" 226 | s1.selected.indices = [] 227 | const search_term = cb_obj.value; 228 | let hits = []; 229 | let counter = 0; 230 | for (const key of Object.keys(topic_data)) { 231 | const keywords = topic_data[key]["key_words"]; 232 | for (let i=0; i < keywords.length; i++) { 233 | if (keywords[i][0] == search_term) { 234 | hits.push([key, i]); 235 | } 236 | } 237 | } 238 | hits.sort(function(a, b) { 239 | return a[1] - b[1]; 240 | }); 241 | 242 | const data = []; 243 | if (hits.length) { 244 | for (let i = 0; i < hits.length; i++) { 245 | data.push('Topic ' + hits[i][0] + ' has "' + search_term + '" as number ' + hits[i][1] + ' in its keyword list.'); 246 | data.push("\\r\\n"); 247 | } 248 | } else if (search_term != "") { 249 | data.push('No keyword matches with any topic for "' + search_term + '".'); 250 | } 251 | 252 | top_search_results.value = data.join("\\r\\n"); 253 | 254 | let inds = []; 255 | for (let i=0; i < hits.length; i++) { 256 | inds.push(hits[i][0]); 257 | } 258 | 259 | const res = [...new Set(inds)]; 260 | 261 | s4.value = res.map(function(e){return e.toString()}); 262 | 263 | """) 264 | ) 265 | 266 | doc_search.js_on_change('value', CustomJS(args=dict(s1=s1, s2=s2, df=df.to_dict(), doc_search_results=doc_search_results, s4=multi_choice), code=""" 267 | s1.selected.indices = [] 268 | const search_term = cb_obj.value; 269 | let hits = []; 270 | let counter = 0; 271 | let id_count = 0; 272 | for (let i = 0; i < s1.data.top_words.length; i++) { 273 | for (let j = 0; j {app_name}') 333 | layout = column(title, app_row, sizing_mode='scale_width') 334 | else: 335 | layout=app_row 336 | show(layout) -------------------------------------------------------------------------------- /leet_topic/leet_topic.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sentence_transformers import SentenceTransformer 3 | import umap 4 | import hdbscan 5 | import math 6 | import spacy 7 | from sklearn.feature_extraction.text import TfidfVectorizer 8 | import numpy as np 9 | import string 10 | import logging 11 | import warnings 12 | from annoy import AnnoyIndex 13 | 14 | from .bokeh_app import create_html 15 | 16 | warnings.filterwarnings("ignore") 17 | logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO) 18 | 19 | 20 | def create_labels(df, document_field, encoding_model, 21 | umap_params={"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'}, 22 | hdbscan_params={"min_samples": 10, "min_cluster_size": 50}, 23 | doc_embeddings=None): 24 | # print(type(doc_embeddings)) 25 | if str(type(doc_embeddings)) == "": 26 | #Load Transformer Model 27 | model = SentenceTransformer(encoding_model) 28 | 29 | #Create Document Embeddings 30 | logging.info("Encoding Documents") 31 | doc_embeddings = model.encode(df[document_field]) 32 | logging.info("Saving Embeddings") 33 | np.save("embeddings", doc_embeddings) 34 | 35 | #Create UMAP Projection 36 | logging.info("Creating UMAP Projections") 37 | umap_proj = umap.UMAP(**umap_params).fit_transform(doc_embeddings) 38 | 39 | #Create HDBScan Label 40 | logging.info("Finding Clusters with HDBScan") 41 | hdbscan_labels = hdbscan.HDBSCAN(**hdbscan_params).fit_predict(umap_proj) 42 | df["x"] = umap_proj[:,0] 43 | df["y"] = umap_proj[:,1] 44 | df["hdbscan_labels"] = hdbscan_labels 45 | 46 | return df, doc_embeddings 47 | 48 | def find_centers(df): 49 | #Get coordinates for each document in each topic 50 | topic_data = {} 51 | for i, topic in enumerate(df.hdbscan_labels.tolist()): 52 | if topic != -1: 53 | if topic not in topic_data: 54 | topic_data[topic] = {"center": [], "coords": []} 55 | topic_data[topic]["coords"].append((df["x"][i], df["y"][i])) 56 | 57 | #Calculate the center of the topic 58 | for topic, data in topic_data.items(): 59 | x = [coord[0] for coord in data["coords"]] 60 | y = [coord[1] for coord in data["coords"]] 61 | c = (x, y) 62 | topic_data[topic]["center"] = (sum(c[0])/len(c[0]),sum(c[1])/len(c[1])) 63 | return topic_data 64 | 65 | def get_leet_labels(df, topic_data, max_distance): 66 | # Get New Topic Numbers 67 | leet_labels = [] 68 | for i, topic in enumerate(df.hdbscan_labels.tolist()): 69 | if topic == -1: 70 | closest = -1 71 | distance = max_distance 72 | for topic_num, coords in topic_data.items(): 73 | center = coords["center"] 74 | current_distance = math.dist(center, (df["x"][i], df["y"][i])) 75 | if current_distance < max_distance and current_distance < distance: 76 | closest = topic_num 77 | distance = current_distance 78 | leet_labels.append(closest) 79 | else: 80 | leet_labels.append(topic) 81 | df["leet_labels"] = leet_labels 82 | logging.info(f"{df.hdbscan_labels.tolist().count(-1)} Outliers reduced to {leet_labels.count(-1)}") 83 | return df 84 | 85 | 86 | def create_tfidf(df, topic_data, document_field, spacy_model): 87 | nlp = spacy.load(spacy_model, disable=["ner", "attribute_ruler", "tagger", "parser"]) 88 | lemma_docs = [" ".join([token.lemma_.lower() for token in nlp(text) if token.text not in string.punctuation]) for text in df[document_field].tolist()] 89 | 90 | vectorizer = TfidfVectorizer(stop_words="english") 91 | vectors = vectorizer.fit_transform(lemma_docs) 92 | feature_names = vectorizer.get_feature_names_out() 93 | dense = vectors.todense() 94 | denselist = dense.tolist() 95 | tfidf_df = pd.DataFrame(denselist, columns=feature_names) 96 | 97 | top_n = 10 98 | tfidf_words = [] 99 | for vector in vectors: 100 | top_words = (sorted(list(zip(vectorizer.get_feature_names_out(), vector.sum(0).getA1())), key=lambda x: x[1], reverse=True)[:top_n]) 101 | tfidf_words.append(top_words) 102 | df["top_words"] = tfidf_words 103 | 104 | if df.leet_labels.tolist().count(-1) > 0: 105 | topic_data[-1] = {} 106 | for leet_label, lemmas in zip(df.leet_labels.tolist(), lemma_docs): 107 | if "doc_lemmas" not in topic_data[leet_label]: 108 | topic_data[leet_label]["doc_lemmas"] = [] 109 | topic_data[leet_label]["doc_lemmas"].append(lemmas) 110 | 111 | for leet_label, data in topic_data.items(): 112 | # Apply the transformation using the already fitted vectorizer 113 | X = vectorizer.transform(data["doc_lemmas"]) 114 | words = (sorted(list(zip(vectorizer.get_feature_names_out(), X.sum(0).getA1())), key=lambda x: x[1], reverse=True)[:top_n]) 115 | topic_data[leet_label]["key_words"] = words 116 | 117 | return df, topic_data 118 | 119 | 120 | def calculate_topic_relevance(df, topic_data): 121 | rel2topic = [] 122 | for idx, row in df.iterrows(): 123 | topic_num = row.leet_labels 124 | if topic_num != -1: 125 | if "relevance_docs" not in topic_data[topic_num]: 126 | topic_data[topic_num]["relevance_docs"] = [] 127 | score = math.dist(topic_data[topic_num]["center"], (row["x"], row["y"])) 128 | rel2topic.append(score) 129 | topic_data[topic_num]["relevance_docs"].append((idx, score)) 130 | else: 131 | rel2topic.append((idx, 0)) 132 | for topic_num, data in topic_data.items(): 133 | if topic_num != -1: 134 | data["relevance_docs"].sort(key = lambda x: x[1]) 135 | data["relevance_docs"].reverse() 136 | return df, topic_data 137 | 138 | 139 | def download_spacy_model(spacy_model): 140 | try: 141 | nlp = spacy.load(spacy_model) 142 | except OSError: 143 | print(f'Downloading language model ({spacy_model}) for the spaCy POS tagger\n' 144 | "(don't worry, this will only happen once)") 145 | from spacy.cli import download 146 | download(spacy_model) 147 | 148 | def create_annoy(doc_embeddings, 149 | annoy_filename="annoy_index.ann", 150 | annoy_branches=10, 151 | annoy_metric="angular" 152 | ): 153 | 154 | t = AnnoyIndex(doc_embeddings.shape[1], annoy_metric) 155 | for idx, embedding in enumerate(doc_embeddings): 156 | t.add_item(idx, embedding) 157 | 158 | t.build(annoy_branches) 159 | if ".ann" not in annoy_filename: 160 | annoy_filename = annoy_filename+".ann" 161 | t.save(annoy_filename) 162 | 163 | return t 164 | 165 | 166 | def LeetTopic(df: pd.DataFrame, 167 | document_field: str, 168 | html_filename: str, 169 | extra_fields=[], 170 | max_distance=.5, 171 | tf_idf = False, 172 | spacy_model="en_core_web_sm", 173 | encoding_model='all-MiniLM-L6-v2', 174 | save_embeddings=True, 175 | doc_embeddings = None, 176 | umap_params={"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'}, 177 | hdbscan_params={"min_samples": 10, "min_cluster_size": 50}, 178 | app_name="", 179 | build_annoy=False, 180 | annoy_filename="annoy_index.ann", 181 | annoy_branches=10, 182 | annoy_metric="angular" 183 | ): 184 | """ 185 | Parameters 186 | ---------- 187 | df: pd.DataFrame 188 | DataFrame that contains at least one field that are the documents you wish to model 189 | document_field: str 190 | a string that is the name of the column in which the documents in the DataFrame sit 191 | html_filename: str 192 | the name of the html file that will be created by the LeetTopic pipeline 193 | extra_fields: list of str (Optional) 194 | These are the names of the columns you wish to include in the Bokeh application. 195 | max_distance: float (Optional default .5) 196 | The maximum distance an outlier document can be to the nearest topic vector to be assigned 197 | spacy_model: str (Optional default en_core_web_sm) 198 | the spaCy language model you will use for lemmatization 199 | encoding_model: str (Optional default all-MiniLM-L6-v2) 200 | the sentence transformers model that you wish to use to encode your documents 201 | umap_params: dict (Optional default {"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'}) 202 | dictionary of keys to UMAP params and values for those params 203 | hdbscan_params: dict (Optional default {"min_samples": 10, "min_cluster_size": 50}) 204 | dictionary of keys to HBDscan params and values for those params 205 | app_name: str (Optional) 206 | title of your Bokeh application 207 | Returns 208 | ---------- 209 | df: pd.DataFrame 210 | This is the new dataframe that contains the metadata generated from the LeetTopic pipeline 211 | topic_data: dict 212 | This is topic-centric data generated by the LeetTopic pipeline 213 | """ 214 | 215 | download_spacy_model(spacy_model) 216 | 217 | df, doc_embeddings = create_labels(df, document_field, 218 | encoding_model, doc_embeddings=doc_embeddings, 219 | umap_params=umap_params, hdbscan_params=hdbscan_params) 220 | logging.info("Calculating the Center of the Topic Clusters") 221 | topic_data = find_centers(df) 222 | logging.info(f"Recalculating clusters based on a max distance of {max_distance} from any topic vector") 223 | df = get_leet_labels(df, topic_data, max_distance) 224 | 225 | 226 | if tf_idf==True: 227 | logging.info("Creating TF-IDF representation for documents") 228 | df, topic_data = create_tfidf(df, topic_data, document_field, spacy_model) 229 | 230 | logging.info("Creating Topic Relevance") 231 | df, topic_data = calculate_topic_relevance(df, topic_data) 232 | 233 | logging.info("Generating custom Bokeh application") 234 | create_html(df, 235 | document_field=document_field, 236 | topic_field="leet_labels", 237 | html_filename=html_filename, 238 | topic_data=topic_data, 239 | tf_idf=tf_idf, 240 | extra_fields=extra_fields, 241 | app_name=app_name, 242 | ) 243 | df = df.drop("selected", axis=1) 244 | 245 | if build_annoy == True: 246 | logging.info(f"Building an Annoy Index and saving it to {annoy_filename}") 247 | annoy_index = create_annoy(doc_embeddings, 248 | annoy_filename=annoy_filename, 249 | annoy_branches=annoy_branches, 250 | annoy_metric=annoy_metric) 251 | return df, topic_data, annoy_index 252 | 253 | return df, topic_data -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | with open("README.md", "r", encoding="utf-8") as f: 4 | LONG_DESCRIPTION = f.read() 5 | 6 | VERSION = '0.0.10' 7 | DESCRIPTION = 'A new transformer-based topic modeling library.' 8 | 9 | setup( 10 | name="leet_topic", 11 | author="WJB Mattingly, Joel Lee", 12 | version=VERSION, 13 | description=DESCRIPTION, 14 | long_description=LONG_DESCRIPTION, 15 | long_description_content_type='text/markdown', 16 | packages=find_packages(), 17 | install_requires=["pandas>=1.0.0,<2.0.0", 18 | "bokeh>=2.4.0, <2.4.3", 19 | "sentence_transformers>=2.0.0", 20 | "umap-learn>=0.5.0", 21 | "hdbscan>=0.8.0", 22 | "protobuf>=4.24.2", 23 | "wrapt==1.14.0", 24 | "tensorflow>=2.8.0", 25 | "spacy>=3.3.0", 26 | "gensim>=4.3.0", 27 | "annoy>=1.17.0" 28 | ], 29 | ) 30 | --------------------------------------------------------------------------------