├── .gitattributes
├── .gitignore
├── README.md
├── data
    ├── demo.csv
    └── vol7.json
├── images
    ├── LeeTopic.png
    ├── demo-new.JPG
    ├── demo-search.jpg
    ├── demo.png
    ├── leet-demo.png
    └── leettopic-logo.png
├── leet_topic
    ├── __init__.py
    ├── __pycache__
    │   └── leet_topic.cpython-38.pyc
    ├── bokeh_app.py
    └── leet_topic.py
└── setup.py


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | annoy_index.ann
 3 | demo.html
 4 | embeddings.npy
 5 | leet_topic.egg-info/dependency_links.txt
 6 | leet_topic.egg-info/PKG-INFO
 7 | leet_topic.egg-info/requires.txt
 8 | leet_topic.egg-info/SOURCES.txt
 9 | leet_topic.egg-info/top_level.txt
10 | leet_topic/__pycache__/__init__.cpython-310.pyc
11 | leet_topic/__pycache__/bokeh_app.cpython-310.pyc
12 | testing.ipynb
13 | leet_topic/__pycache__/leet_topic.cpython-310.pyc
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![PyPI - PyPi](https://img.shields.io/pypi/v/leet-topic)](https://pypi.org/project/leet-topic/)
  2 | 
  3 | ![Leet Topic Logo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/LeeTopic.png)
  4 | 
  5 | LeetTopic builds upon [Top2Vec](https://github.com/ddangelov/Top2Vec), [BerTopic](https://github.com/MaartenGr/BERTopic) and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.
  6 | 
  7 | It also lets you turn any DataFrame into a [Bokeh](https://bokeh.org/) application for exploring your documents and topics. As of 0.0.10, LeetTopic also allows users to generate an [Annoy](https://github.com/spotify/annoy) Index as part of the LeetTopic pipeline. This allows users to construct a query their data.
  8 | 
  9 | # Installation
 10 | 
 11 | ```python
 12 | pip install leet-topic
 13 | ```
 14 | 
 15 | # Parameters
 16 | - df => a Pandas DataFrame that contains the documents that you want to model
 17 | - document_field => the DataFrame column name where your documents sit
 18 | - html_filename => the filename used to generate the Bokeh application
 19 | - extra_fields => a list of extra columns to include in the Bokeh application
 20 | - max_distance => the maximum distance between a document and the nearest topic vector to be considered for outliers
 21 | 
 22 | # Usage
 23 | 
 24 | ```python
 25 | import pandas as pd
 26 | from leet_topic import leet_topic
 27 | 
 28 | df = pd.read_json("data/vol7.json")
 29 | leet_df, topic_data = leet_topic.LeetTopic(df,
 30 |                                           document_field="descriptions",
 31 |                                           html_filename="demo.html",
 32 |                                           extra_fields=["names", "hdbscan_labels"],
 33 |                                           max_distance=.5)
 34 | ```
 35 | 
 36 | ## Multilingual Support
 37 | With LeetTopic, you can work with texts in any language supported by spaCy for lemmatization and any model from HuggingFace via Sentence Transformers.
 38 | 
 39 | Here is an example working with Croatian
 40 | 
 41 | ```python
 42 | import pandas as pd
 43 | from leet_topic import leet_topic
 44 | 
 45 | df = pd.DataFrame(["Bok. Kako ste?", "Drago mi je"]*20, columns=["text"])
 46 | leet_df, topic_data = leet_topic.LeetTopic(df,
 47 |                                           document_field="text",
 48 |                                           html_filename="demo.html",
 49 |                                           extra_fields=["hdbscan_labels"],
 50 |                                           spacy_model="hr_core_news_sm",
 51 |                                           max_distance=.5)
 52 | ```
 53 | 
 54 | ## Custom UMAP and HDBScan Parameters
 55 | It is often necessary to control how your embeddings are flattened with UMAP and clustered with HDBScan. As of 0.0.9, you can control these parameters with dictionaries.
 56 | 
 57 | ```python
 58 | import pandas as pd
 59 | from leet_topic import leet_topic
 60 | 
 61 | df = pd.read_json("data/vol7.json")
 62 | leet_df, topic_data = leet_topic.LeetTopic(df,
 63 |                                           document_field="descriptions",
 64 |                                           html_filename="demo.html",
 65 |                                           extra_fields=["names", "hdbscan_labels"],
 66 |                                           umap_params={"n_neighbors": 15, "min_dist": 0.01, "metric": 'correlation'},
 67 |                                           hdbscan_params={"min_samples": 10, "min_cluster_size": 5},
 68 |                                           max_distance=.5)
 69 | ```
 70 | 
 71 | ## Create an Annoy Index
 72 | As of 0.0.10, users can also return an Annoy Index.
 73 | 
 74 | ```python
 75 | import pandas as pd
 76 | from leet_topic import leet_topic
 77 | 
 78 | df = pd.read_json("data/vol7.json")
 79 | leet_df, topic_data, annoy_index = leet_topic.LeetTopic(df, "descriptions",
 80 |             "demo.html",
 81 |             build_annoy=True)
 82 | ```
 83 | 
 84 | To leverage the Annoy Index, one can easily create a semantic search engine. One can query the index, for example, by encoding a new text with the same model.
 85 | 
 86 | ```python
 87 | import pandas as pd
 88 | from leet_topic import leet_topic
 89 | from sentence_transformers import SentenceTransformer
 90 | 
 91 | 
 92 | model = SentenceTransformer('all-MiniLM-L6-v2')
 93 | 
 94 | emb = model.encode("An individual who was arrested.")
 95 | 
 96 | res = annoy_index.get_nns_by_vector(emb, 10)
 97 | 
 98 | print(df.iloc[res].descriptions.tolist())
 99 | 
100 | ```
101 | 
102 | 
103 | # Outputs
104 | This code above will generate a new DataFrame with the UMAP Projection (x, y), hdbscan_labels, and leet_labels, and top-n words for each document. It will also output data about each topic including the central plot of each vector, the documents assigned to it, top-n words associated with it.
105 | 
106 | Finally, the output will create an HTML file that is a self-contained Bokeh application like the image below.
107 | 
108 | ![demo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/demo-new.JPG)
109 | 
110 | # Steps
111 | 
112 | LeetTopic takes an input DataFrame and converts the document field (texts to model) into embeddings via a transformer model. Next, UMAP is used to reduce the embeddings to 2 dimensions. HDBScan is then used to assign documents to topics. Like BerTopic and Top2Vec, at this stage, there are many outliers (topics assigned to -1).
113 | 
114 | LeetTopic, like Top2Vec, then calculates the centroid for each topic vector based on the HDBScan labels while ignoring topic -1 (outlier). Next, all outlier documents are assigned to nearest topic centroid. Unlike Top2Vec, LeetTopic gives the user the ability to set a max distance so that outliers that are significantly away from a topic vector, they are not assigned to a nearest vector. At the same time, the output DataFrame contains information about the original HDBScan topics, meaning users know if a document was originally an outlier.
115 | 
116 | 
117 | 
118 | # Future Roadmap
119 | ## 0.0.9
120 | - Control UMAP parameters
121 | - Control HDBScan parameters
122 | - Multilingual support for lemmatization
123 | - Multilingual support for embedding
124 | - Add support for custom App Titles
125 | 
126 | ## 0.0.10
127 | - Output an Annoy Index so that the data can be queried
128 | 
129 | ## 0.0.11
130 | - Support for embedding text, images, or both via CLIP and displaying the results in the same bokeh application
131 | 


--------------------------------------------------------------------------------
/images/LeeTopic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/LeeTopic.png


--------------------------------------------------------------------------------
/images/demo-new.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo-new.JPG


--------------------------------------------------------------------------------
/images/demo-search.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo-search.jpg


--------------------------------------------------------------------------------
/images/demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/demo.png


--------------------------------------------------------------------------------
/images/leet-demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/leet-demo.png


--------------------------------------------------------------------------------
/images/leettopic-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/images/leettopic-logo.png


--------------------------------------------------------------------------------
/leet_topic/__init__.py:
--------------------------------------------------------------------------------
1 | from leet_topic.leet_topic import *
2 | 


--------------------------------------------------------------------------------
/leet_topic/__pycache__/leet_topic.cpython-38.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wjbmattingly/LeetTopic/fd0d71713c8570c53e3e3a0327b14a5183e20112/leet_topic/__pycache__/leet_topic.cpython-38.pyc


--------------------------------------------------------------------------------
/leet_topic/bokeh_app.py:
--------------------------------------------------------------------------------
  1 | from bokeh.layouts import row, column
  2 | from bokeh.models import ColumnDataSource, CustomJS, DataTable, TableColumn, MultiChoice, HTMLTemplateFormatter, TextAreaInput, Div, TextInput
  3 | from bokeh.plotting import figure, output_file, show
  4 | 
  5 | from bokeh.palettes import Category10, Cividis256, Turbo256
  6 | from bokeh.transform import linear_cmap
  7 | 
  8 | import pandas as pd
  9 | import numpy as np
 10 | 
 11 | from typing import Tuple, Optional
 12 | import bokeh
 13 | import bokeh.transform
 14 | 
 15 | 
 16 | 
 17 | #From Bulk Library
 18 | def get_color_mapping(
 19 |     df: pd.DataFrame,
 20 |     topic_field,
 21 | ) -> Tuple[Optional[bokeh.transform.transform], pd.DataFrame]:
 22 |     """Creates a color mapping"""
 23 | 
 24 |     color_datatype = str(df[topic_field].dtype)
 25 |     if color_datatype == "object":
 26 |         df[topic_field] = df[topic_field].apply(
 27 |             lambda x: str(x) if not (type(x) == float and np.isnan(x)) else x
 28 |         )
 29 |         all_values = list(df[topic_field].dropna().unique())
 30 |         if len(all_values) == 2:
 31 |             all_values.extend([""])
 32 |         elif len(all_values) > len(Category10) + 2:
 33 |             raise ValueError(
 34 |                 f"Too many classes defined, the limit for visualisation is {len(Category10) + 2}. "
 35 |                 f"Got {len(all_values)}."
 36 |             )
 37 |         mapper = factor_cmap(
 38 |             field_name=topic_field,
 39 |             palette=Category10[len(all_values)],
 40 |             factors=all_values,
 41 |             nan_color="grey",
 42 |         )
 43 |     elif color_datatype.startswith("float") or color_datatype.startswith("int"):
 44 |         all_values = df[topic_field].dropna().values
 45 |         mapper = linear_cmap(
 46 |             field_name=topic_field,
 47 |             palette=Turbo256,
 48 |             low=all_values.min(),
 49 |             high=all_values.max(),
 50 |             nan_color="grey",
 51 |         )
 52 |     else:
 53 |         raise TypeError(
 54 |             f"We currently only support the following type for 'color' column: 'int*', 'float*', 'object'. "
 55 |             f"Got {color_datatype}."
 56 |         )
 57 |     return mapper, df
 58 | 
 59 | 
 60 | def create_html(df, document_field, topic_field, html_filename, topic_data, tf_idf, extra_fields=[], app_name=""):
 61 |     fields = ["x", "y", document_field, topic_field, "selected"]
 62 |     fields = fields+extra_fields
 63 |     output_file(html_filename)
 64 | 
 65 |     mapper, df = get_color_mapping(df, topic_field)
 66 |     df['selected'] = False
 67 |     categories = df[topic_field].unique()
 68 |     categories = [str(x) for x in categories]
 69 | 
 70 | 
 71 | 
 72 |     s1 = ColumnDataSource(df)
 73 | 
 74 | 
 75 |     columns = [
 76 |             TableColumn(field=topic_field, title=topic_field, width=10),
 77 |             TableColumn(field=document_field, title=document_field, width=500),
 78 |     ]
 79 |     for field in extra_fields:
 80 |         columns.append(TableColumn(field=field, title=field, width=100))
 81 | 
 82 | 
 83 |     p1 = figure(width=500, height=500, tools="pan,tap,wheel_zoom,lasso_select,box_zoom,box_select,reset", active_scroll="wheel_zoom", title="Select Here", x_range=(df.x.min(), df.x.max()), y_range=(df.y.min(), df.y.max()))
 84 |     circle_kwargs = {"x": "x", "y": "y",
 85 |                         "size": 3,
 86 |                         "source": s1,
 87 |                          "color": mapper
 88 |                         }
 89 |     scatter = p1.circle(**circle_kwargs)
 90 | 
 91 |     s2 = ColumnDataSource(data=dict(x=[], y=[], leet_labels=[]))
 92 |     p2 = figure(width=500, height=500, tools="pan,tap,lasso_select,wheel_zoom,box_zoom,box_select,reset", active_scroll="wheel_zoom", title="Analyze Selection", x_range=(df.x.min(), df.x.max()), y_range=(df.y.min(), df.y.max()))
 93 | 
 94 |     circle_kwargs2 = {"x": "x", "y": "y",
 95 |                         "size": 3,
 96 |                         "source": s2,
 97 |                          "color": mapper
 98 |                         }
 99 |     scatter2 = p2.circle(**circle_kwargs2)
100 | 
101 |     multi_choice = MultiChoice(value=[], options=categories, width = 500, title='Selection:')
102 |     data_table = DataTable(source=s2,
103 |                            columns=columns,
104 |                            width=700,
105 |                            height=500,
106 |                           sortable=True,
107 |                           autosize_mode='none')
108 |     selected_texts = TextAreaInput(value = "", title = "Selected texts", width = 700, height=500)
109 |     top_search_results = TextAreaInput(value = "", title = "Search Results", width = 250, height=500)
110 |     top_search = TextInput(title="Topic Search")
111 |     doc_search_results = TextAreaInput(value = "", title = "Search Results", width = 250, height=500)
112 |     doc_search = TextInput(title="Document Search")
113 |     topic_desc = TextAreaInput(value = "", title = "Topic Descriptions", width = 500, height=500)
114 |     
115 |     def field_string(field):
116 |         return """d2['"""+field+"""'] = []\n"""
117 | 
118 |     def push_string(field):
119 |         return """d2['"""+field+"""'].push(d1['"""+field+"""'][inds[i]])\n"""
120 | 
121 |     def indices_string(field):
122 |         return """d2['"""+field+"""'].push(d1['"""+field+"""'][s1.selected.indices[i]])\n"""
123 | 
124 |     def push_string2(field):
125 |         return """d2['"""+field+"""'].push(d1['"""+field+"""'][i])\n"""
126 | 
127 |     def list_creator(fields, str_type=""):
128 |         main_str = ""
129 |         for field in fields:
130 |             if str_type == "field":
131 |                 main_str=main_str+field_string(field)
132 |             elif str_type == "push":
133 |                 main_str=main_str+push_string(field)
134 |             elif str_type == "indices":
135 |                 main_str=main_str+indices_string(field)
136 |             elif str_type == "push2":
137 |                 main_str=main_str+push_string2(field)
138 |         return main_str
139 | 
140 |     s1.selected.js_on_change('indices', CustomJS(args=dict(s1=s1, s2=s2, s4=multi_choice), code="""
141 |             const inds = cb_obj.indices;
142 |             const d1 = s1.data;
143 |             const d2 = s2.data;
144 |             const d4 = s4;"""+list_creator(fields=fields, str_type="field")+
145 |             """for (let i = 0; i < inds.length; i++) {"""+
146 |             list_creator(fields=fields, str_type="push")+
147 |             """}
148 |             const res = [...new Set(d2['"""+topic_field+"""'])];
149 |             d4.value = res.map(function(e){return e.toString()});
150 |             s1.change.emit();
151 |             s2.change.emit();
152 |         """)
153 |     )
154 | 
155 | 
156 |     multi_choice.js_on_change('value', CustomJS(args=dict(s1=s1, s2=s2, scatter=scatter, topic_desc=topic_desc, topic_data=topic_data, tf_idf=tf_idf), code="""
157 |             let values = cb_obj.value;
158 |             let unchange_values = cb_obj.value;
159 |             const d1 = s1.data;
160 |             const d2 = s2.data;
161 |             const plot = scatter;
162 |             s2.selected.indices = [];
163 |             for (let i = 0; i < s1.selected.indices.length; i++) {
164 |                 for (let j =0; j < values.length; j++) {
165 |                     if (d1."""+topic_field+"""[s1.selected.indices[i]] == values[j]) {
166 |                         values = values.filter(item => item !== values[j]);
167 |                     }
168 |                 }
169 |             }
170 |             """+list_creator(fields=fields, str_type="field")+
171 |             """
172 |             for (let i = 0; i < s1.selected.indices.length; i++) {
173 |                 if (unchange_values.includes(String(d1."""+topic_field+"""[s1.selected.indices[i]]))) {
174 |                     """+
175 |                     list_creator(fields=fields, str_type="indices")+
176 |                     """
177 |                 }
178 |             }
179 |             for (let i = 0; i < d1."""+topic_field+""".length; i++) {
180 |                 if (values.includes(String(d1."""+topic_field+"""[i]))) {
181 |                         """+
182 |                         list_creator(fields=fields, str_type="push2")+
183 |                         """
184 |                 }
185 |             }
186 |             if (tf_idf) {
187 |                 let data = [];
188 |                 for (const key of Object.keys(topic_data)) {
189 |                     for (let i=0; i < unchange_values.length; i++) {
190 |                         if (key == unchange_values[i]) {
191 |                             let keywords = topic_data[key]["key_words"];
192 |                             data.push("Topic " + key + ": ");
193 |                             for (let i=0; i < keywords.length; i++) {
194 |                                 data.push(keywords[i][0] + " " + keywords[i][1]);
195 |                             }
196 |                             data.push("\\r\\n");
197 |                         }
198 |                     }
199 |                 }
200 |                 topic_desc.value = data.join("\\r\\n");
201 |                 s2.change.emit();
202 |             }
203 |         """)
204 |     )
205 | 
206 | 
207 |     s2.selected.js_on_change('indices', CustomJS(args=dict(s1=s1, s2=s2, s_texts=selected_texts), code="""
208 |             const inds = cb_obj.indices;
209 |             const d1 = s1.data;
210 |             const d2 = s2.data;
211 |             const texts = s_texts.value;
212 |             s_texts.value = "";
213 |             const data = [];
214 |             for (let i = 0; i < inds.length; i++) {
215 |                 data.push(" (Topic: " + d2['"""+topic_field+"""'][inds[i]] + ")")
216 |                 data.push("Document: " + d2['"""+document_field+"""'][inds[i]])
217 |                 data.push("\\r\\n")
218 |             }
219 |             s2.change.emit();
220 |             s_texts.value = data.join("\\r\\n")
221 |             s_texts.change.emit();
222 |         """)
223 |     )
224 |     
225 |     top_search.js_on_change('value', CustomJS(args=dict(topic_data=topic_data, top_search_results=top_search_results, s4=multi_choice, s1=s1), code="""
226 |         s1.selected.indices = []
227 |         const search_term = cb_obj.value;
228 |         let hits = [];
229 |         let counter = 0;
230 |         for (const key of Object.keys(topic_data)) {
231 |             const keywords = topic_data[key]["key_words"];
232 |             for (let i=0; i < keywords.length; i++) {
233 |                 if (keywords[i][0] == search_term) {
234 |                     hits.push([key, i]);
235 |                 }
236 |             }
237 |         }
238 |         hits.sort(function(a, b) {
239 |             return a[1] - b[1];
240 |         });
241 |         
242 |         const data = [];
243 |         if (hits.length) {
244 |             for (let i = 0; i < hits.length; i++) { 
245 |                 data.push('Topic ' + hits[i][0] + ' has "' + search_term + '" as number ' + hits[i][1] + ' in its keyword list.');
246 |                 data.push("\\r\\n");
247 |             }
248 |         } else if (search_term != "") {
249 |             data.push('No keyword matches with any topic for "' + search_term + '".');
250 |         }
251 |         
252 |         top_search_results.value = data.join("\\r\\n");
253 |         
254 |         let inds = [];
255 |         for (let i=0; i < hits.length; i++) {
256 |             inds.push(hits[i][0]);
257 |         }
258 |         
259 |         const res = [...new Set(inds)];
260 |         
261 |         s4.value = res.map(function(e){return e.toString()});
262 |     
263 |     """)
264 |     )
265 |     
266 |     doc_search.js_on_change('value', CustomJS(args=dict(s1=s1, s2=s2, df=df.to_dict(), doc_search_results=doc_search_results, s4=multi_choice), code="""
267 |         s1.selected.indices = []
268 |         const search_term = cb_obj.value;
269 |         let hits = [];
270 |         let counter = 0;
271 |         let id_count = 0;
272 |         for (let i = 0; i < s1.data.top_words.length; i++) {
273 |             for (let j = 0; j <s1.data.top_words[i].length; j++) { 
274 |                 if (search_term == s1.data.top_words[i][j][0]) { 
275 |                     hits.push([id_count, j]);
276 |                 }
277 |                 
278 |             }
279 |             id_count = id_count + 1;
280 |         }
281 |         
282 |         hits.sort(function(a, b) {
283 |             return a[1] - b[1];
284 |         });
285 |         
286 |         const data = [];
287 |         if (hits.length) {
288 |             for (let i = 0; i < hits.length; i++) { 
289 |                 data.push('Document ' + hits[i][0] + ' has "' + search_term + '" as number ' + hits[i][1] + ' in its top_words list.');
290 |                 data.push("\\r\\n");
291 |             }
292 |         } else if (search_term != "") {
293 |             data.push('No keyword matches with any document for "' + search_term + '".');
294 |         }
295 |         
296 |         doc_search_results.value = data.join("\\r\\n");
297 |         
298 |         let inds = [];
299 |         for (let i=0; i <hits.length; i++) {
300 |             inds.push(hits[i][0]);
301 |             s1.selected.indices.push(hits[i][0]);
302 |         }
303 |         
304 |         
305 |         
306 |         const d1 = s1.data;
307 |         const d2 = s2.data;
308 |         const d4 = s4;"""+list_creator(fields=fields, str_type="field")+
309 |         """for (let i = 0; i < inds.length; i++) {"""+
310 |         list_creator(fields=fields, str_type="push")+
311 |         """}
312 |         const res = [...new Set(d2['"""+topic_field+"""'])];
313 |         s4.value = res.map(function(e){return e.toString()});
314 |         s1.change.emit();
315 |         s2.change.emit();
316 |         
317 |     
318 |     """)
319 |     )
320 | 
321 |     if tf_idf:
322 |         col1 = column(p1, multi_choice, topic_desc) 
323 |     else:
324 |         col1 = column(p1, multi_choice) 
325 |     col2 = column(data_table, selected_texts)
326 |     if tf_idf:
327 |         col3 = column(p2, row(column(doc_search, doc_search_results), column(top_search, top_search_results)))
328 |     else:
329 |         col3 = column(p2)
330 |     app_row = row(col1, col2, col3)
331 |     if app_name != "":
332 |         title = Div(text=f'<h1 style="text-align: center">{app_name}</h1>')
333 |         layout = column(title, app_row, sizing_mode='scale_width')
334 |     else:
335 |         layout=app_row
336 |     show(layout)


--------------------------------------------------------------------------------
/leet_topic/leet_topic.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | from sentence_transformers import SentenceTransformer
  3 | import umap
  4 | import hdbscan
  5 | import math
  6 | import spacy
  7 | from sklearn.feature_extraction.text import TfidfVectorizer
  8 | import numpy as np
  9 | import string
 10 | import logging
 11 | import warnings
 12 | from annoy import AnnoyIndex
 13 | 
 14 | from .bokeh_app import create_html
 15 | 
 16 | warnings.filterwarnings("ignore")
 17 | logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
 18 | 
 19 | 
 20 | def create_labels(df, document_field, encoding_model,
 21 |                   umap_params={"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'},
 22 |                   hdbscan_params={"min_samples": 10, "min_cluster_size": 50},
 23 |                   doc_embeddings=None):
 24 |     # print(type(doc_embeddings))
 25 |     if str(type(doc_embeddings)) == "<class 'NoneType'>":
 26 |         #Load Transformer Model
 27 |         model = SentenceTransformer(encoding_model)
 28 | 
 29 |         #Create Document Embeddings
 30 |         logging.info("Encoding Documents")
 31 |         doc_embeddings = model.encode(df[document_field])
 32 |         logging.info("Saving Embeddings")
 33 |         np.save("embeddings", doc_embeddings)
 34 | 
 35 |     #Create UMAP Projection
 36 |     logging.info("Creating UMAP Projections")
 37 |     umap_proj = umap.UMAP(**umap_params).fit_transform(doc_embeddings)
 38 | 
 39 |     #Create HDBScan Label
 40 |     logging.info("Finding Clusters with HDBScan")
 41 |     hdbscan_labels = hdbscan.HDBSCAN(**hdbscan_params).fit_predict(umap_proj)
 42 |     df["x"] = umap_proj[:,0]
 43 |     df["y"] = umap_proj[:,1]
 44 |     df["hdbscan_labels"] = hdbscan_labels
 45 | 
 46 |     return df, doc_embeddings
 47 | 
 48 | def find_centers(df):
 49 |     #Get coordinates for each document in each topic
 50 |     topic_data = {}
 51 |     for i, topic in enumerate(df.hdbscan_labels.tolist()):
 52 |         if topic != -1:
 53 |             if topic not in topic_data:
 54 |                 topic_data[topic] = {"center": [], "coords": []}
 55 |             topic_data[topic]["coords"].append((df["x"][i], df["y"][i]))
 56 | 
 57 |     #Calculate the center of the topic
 58 |     for topic, data in topic_data.items():
 59 |         x = [coord[0] for coord in data["coords"]]
 60 |         y = [coord[1] for coord in data["coords"]]
 61 |         c = (x, y)
 62 |         topic_data[topic]["center"] = (sum(c[0])/len(c[0]),sum(c[1])/len(c[1]))
 63 |     return topic_data
 64 | 
 65 | def get_leet_labels(df, topic_data, max_distance):
 66 |     # Get New Topic Numbers
 67 |     leet_labels = []
 68 |     for i, topic in enumerate(df.hdbscan_labels.tolist()):
 69 |         if topic == -1:
 70 |             closest = -1
 71 |             distance = max_distance
 72 |             for topic_num, coords in topic_data.items():
 73 |                 center = coords["center"]
 74 |                 current_distance = math.dist(center, (df["x"][i], df["y"][i]))
 75 |                 if current_distance < max_distance and current_distance < distance:
 76 |                     closest = topic_num
 77 |                     distance = current_distance
 78 |             leet_labels.append(closest)
 79 |         else:
 80 |             leet_labels.append(topic)
 81 |     df["leet_labels"] = leet_labels
 82 |     logging.info(f"{df.hdbscan_labels.tolist().count(-1)} Outliers reduced to {leet_labels.count(-1)}")
 83 |     return df
 84 | 
 85 | 
 86 | def create_tfidf(df, topic_data, document_field, spacy_model):
 87 |     nlp = spacy.load(spacy_model, disable=["ner", "attribute_ruler", "tagger", "parser"])
 88 |     lemma_docs = [" ".join([token.lemma_.lower() for token in nlp(text) if token.text not in string.punctuation]) for text in df[document_field].tolist()]
 89 | 
 90 |     vectorizer = TfidfVectorizer(stop_words="english")
 91 |     vectors = vectorizer.fit_transform(lemma_docs)
 92 |     feature_names = vectorizer.get_feature_names_out()
 93 |     dense = vectors.todense()
 94 |     denselist = dense.tolist()
 95 |     tfidf_df = pd.DataFrame(denselist, columns=feature_names)
 96 | 
 97 |     top_n = 10
 98 |     tfidf_words = []
 99 |     for vector in vectors:
100 |         top_words = (sorted(list(zip(vectorizer.get_feature_names_out(), vector.sum(0).getA1())), key=lambda x: x[1], reverse=True)[:top_n])
101 |         tfidf_words.append(top_words)
102 |     df["top_words"] = tfidf_words
103 | 
104 |     if df.leet_labels.tolist().count(-1) > 0:
105 |         topic_data[-1] = {}
106 |     for leet_label, lemmas in zip(df.leet_labels.tolist(), lemma_docs):
107 |         if "doc_lemmas" not in topic_data[leet_label]:
108 |             topic_data[leet_label]["doc_lemmas"] = []
109 |         topic_data[leet_label]["doc_lemmas"].append(lemmas)
110 | 
111 |     for leet_label, data in topic_data.items():
112 |         # Apply the transformation using the already fitted vectorizer
113 |         X = vectorizer.transform(data["doc_lemmas"])
114 |         words = (sorted(list(zip(vectorizer.get_feature_names_out(), X.sum(0).getA1())), key=lambda x: x[1], reverse=True)[:top_n])
115 |         topic_data[leet_label]["key_words"] = words
116 | 
117 |     return df, topic_data
118 | 
119 | 
120 | def calculate_topic_relevance(df, topic_data):
121 |     rel2topic = []
122 |     for idx, row in df.iterrows():
123 |         topic_num = row.leet_labels
124 |         if topic_num != -1:
125 |             if "relevance_docs" not in topic_data[topic_num]:
126 |                 topic_data[topic_num]["relevance_docs"] = []
127 |             score = math.dist(topic_data[topic_num]["center"], (row["x"], row["y"]))
128 |             rel2topic.append(score)
129 |             topic_data[topic_num]["relevance_docs"].append((idx, score))
130 |         else:
131 |             rel2topic.append((idx, 0))
132 |     for topic_num, data in topic_data.items():
133 |         if topic_num != -1:
134 |             data["relevance_docs"].sort(key = lambda x: x[1])
135 |             data["relevance_docs"].reverse()
136 |     return df, topic_data
137 | 
138 | 
139 | def download_spacy_model(spacy_model):
140 |     try:
141 |         nlp = spacy.load(spacy_model)
142 |     except OSError:
143 |         print(f'Downloading language model ({spacy_model}) for the spaCy POS tagger\n'
144 |             "(don't worry, this will only happen once)")
145 |         from spacy.cli import download
146 |         download(spacy_model)
147 | 
148 | def create_annoy(doc_embeddings,
149 |                 annoy_filename="annoy_index.ann",
150 |                 annoy_branches=10,
151 |                 annoy_metric="angular"
152 |                 ):
153 | 
154 |     t = AnnoyIndex(doc_embeddings.shape[1], annoy_metric)
155 |     for idx, embedding in enumerate(doc_embeddings):
156 |         t.add_item(idx, embedding)
157 | 
158 |     t.build(annoy_branches)
159 |     if ".ann" not in annoy_filename:
160 |         annoy_filename = annoy_filename+".ann"
161 |     t.save(annoy_filename)
162 | 
163 |     return t
164 | 
165 | 
166 | def LeetTopic(df: pd.DataFrame,
167 |             document_field: str,
168 |             html_filename: str,
169 |             extra_fields=[],
170 |             max_distance=.5,
171 |             tf_idf = False,
172 |             spacy_model="en_core_web_sm",
173 |             encoding_model='all-MiniLM-L6-v2',
174 |             save_embeddings=True,
175 |             doc_embeddings = None,
176 |             umap_params={"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'},
177 |             hdbscan_params={"min_samples": 10, "min_cluster_size": 50},
178 |             app_name="",
179 |             build_annoy=False,
180 |             annoy_filename="annoy_index.ann",
181 |             annoy_branches=10,
182 |             annoy_metric="angular"
183 |             ):
184 |     """
185 |     Parameters
186 |     ----------
187 |     df: pd.DataFrame
188 |         DataFrame that contains at least one field that are the documents you wish to model
189 |     document_field: str
190 |         a string that is the name of the column in which the documents in the DataFrame sit
191 |     html_filename: str
192 |         the name of the html file that will be created by the LeetTopic pipeline
193 |     extra_fields: list of str (Optional)
194 |         These are the names of the columns you wish to include in the Bokeh application.
195 |     max_distance: float (Optional default .5)
196 |         The maximum distance an outlier document can be to the nearest topic vector to be assigned
197 |     spacy_model: str (Optional default en_core_web_sm)
198 |         the spaCy language model you will use for lemmatization
199 |     encoding_model: str (Optional default all-MiniLM-L6-v2)
200 |         the sentence transformers model that you wish to use to encode your documents
201 |     umap_params: dict (Optional default {"n_neighbors": 50, "min_dist": 0.01, "metric": 'correlation'})
202 |         dictionary of keys to UMAP params and values for those params
203 |     hdbscan_params: dict (Optional default {"min_samples": 10, "min_cluster_size": 50})
204 |         dictionary of keys to HBDscan params and values for those params
205 |     app_name: str (Optional)
206 |         title of your Bokeh application
207 |     Returns
208 |     ----------
209 |     df: pd.DataFrame
210 |         This is the new dataframe that contains the metadata generated from the LeetTopic pipeline
211 |     topic_data: dict
212 |         This is topic-centric data generated by the LeetTopic pipeline
213 |     """
214 | 
215 |     download_spacy_model(spacy_model)
216 | 
217 |     df, doc_embeddings = create_labels(df, document_field,
218 |                     encoding_model, doc_embeddings=doc_embeddings,
219 |                     umap_params=umap_params, hdbscan_params=hdbscan_params)
220 |     logging.info("Calculating the Center of the Topic Clusters")
221 |     topic_data = find_centers(df)
222 |     logging.info(f"Recalculating clusters based on a max distance of {max_distance} from any topic vector")
223 |     df = get_leet_labels(df, topic_data, max_distance)
224 | 
225 | 
226 |     if tf_idf==True:
227 |         logging.info("Creating TF-IDF representation for documents")
228 |         df, topic_data = create_tfidf(df, topic_data, document_field, spacy_model)
229 | 
230 |     logging.info("Creating Topic Relevance")
231 |     df, topic_data = calculate_topic_relevance(df, topic_data)
232 | 
233 |     logging.info("Generating custom Bokeh application")
234 |     create_html(df,
235 |                 document_field=document_field,
236 |                 topic_field="leet_labels",
237 |                 html_filename=html_filename,
238 |                 topic_data=topic_data,
239 |                 tf_idf=tf_idf,
240 |                 extra_fields=extra_fields,
241 |                 app_name=app_name,
242 |                )
243 |     df = df.drop("selected", axis=1)
244 | 
245 |     if build_annoy == True:
246 |         logging.info(f"Building an Annoy Index and saving it to {annoy_filename}")
247 |         annoy_index = create_annoy(doc_embeddings,
248 |                     annoy_filename=annoy_filename,
249 |                     annoy_branches=annoy_branches,
250 |                     annoy_metric=annoy_metric)
251 |         return df, topic_data, annoy_index
252 | 
253 |     return df, topic_data


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 | 
 3 | with open("README.md", "r", encoding="utf-8") as f:
 4 |     LONG_DESCRIPTION = f.read()
 5 | 
 6 | VERSION = '0.0.10'
 7 | DESCRIPTION = 'A new transformer-based topic modeling library.'
 8 | 
 9 | setup(
10 |     name="leet_topic",
11 |     author="WJB Mattingly, Joel Lee",
12 |     version=VERSION,
13 |     description=DESCRIPTION,
14 |     long_description=LONG_DESCRIPTION,
15 |     long_description_content_type='text/markdown',
16 |     packages=find_packages(),
17 |     install_requires=["pandas>=1.0.0,<2.0.0",
18 |                      "bokeh>=2.4.0, <2.4.3",
19 |                      "sentence_transformers>=2.0.0",
20 |                      "umap-learn>=0.5.0",
21 |                      "hdbscan>=0.8.0",
22 |                      "protobuf>=4.24.2",
23 |                      "wrapt==1.14.0",
24 |                      "tensorflow>=2.8.0",
25 |                      "spacy>=3.3.0",
26 |                      "gensim>=4.3.0",
27 |                      "annoy>=1.17.0"
28 |                      ],
29 | )
30 | 


--------------------------------------------------------------------------------