├── .gitignore
├── LICENSE
├── Presentation.pdf
├── Presentation.pptx
├── README.md
├── ami.png
├── clusterUsersUniversalSentenceEncoder.py
├── demo.ipynb
├── ed.png
├── methodology_diagram.png
├── src
    ├── AR_STOPWORDS.pkl
    ├── __pycache__
    │   ├── clustering.cpython-36.pyc
    │   ├── encoder.cpython-36.pyc
    │   ├── preprocessing.cpython-36.pyc
    │   ├── projection.cpython-36.pyc
    │   └── top_terms.cpython-36.pyc
    ├── clustering.py
    ├── encoder.py
    ├── mutual_information.py
    ├── preprocessing.py
    ├── projection.py
    ├── top_terms.py
    └── turkish_normalizer.py
├── trials
    ├── 0.0_30.png
    ├── 0.1_60.png
    └── hm.png
└── wc.png


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .ipynb_checkpoints/
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Ammar Rashed
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Presentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pdf


--------------------------------------------------------------------------------
/Presentation.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pptx


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Embeddings-Based Unsupervised Stance Detection
  2 | 
  3 | This repository contains the implementation of an unsupervised method for target-specific stance detection using embeddings-based clustering, as presented in our ICWSM 2021 paper.
  4 | 
  5 | ## Publications
  6 | 
  7 | - **Paper (ICWSM'21)**: [Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey](https://ojs.aaai.org/index.php/ICWSM/article/view/18082)
  8 | - **Paper Presentation**: [PaperTalk ICWSM'21](https://papertalk.org/papertalks/31537)
  9 | - **Thesis (MSc August 2020)**: [Embeddings-Based Clustering For Target Specific Stances](https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=fl0Kw4p1rmMDotyKRdYv1AZv-bsnninllPXAXfoe9S1sXEDBPXspE5WeUtqcCjlk)
 10 | 
 11 | ## Overview
 12 | 
 13 | We propose an unsupervised method for stance detection that can capture fine-grained divergences across various topics in polarized communities. Our approach overcomes the limitations of previous methods by:
 14 | 
 15 | - Not requiring platform-specific features (like retweets)
 16 | - Working effectively with limited data
 17 | - Supporting hierarchical clustering without specifying the number of clusters
 18 | - Using pre-trained language models to handle morphologically rich languages
 19 | 
 20 | ## Methodology
 21 | 
 22 | The method consists of five main steps:
 23 | 
 24 | 1. **Data Collection**: Collect tweets related to specific topics or targets
 25 | 2. **Feature Extraction**: Encode tweets using pre-trained universal sentence encoders
 26 | 3. **User Representation**: Average tweet vectors per user to create user embeddings
 27 | 4. **Projection**: Project user vectors to lower dimensional space using UMAP
 28 | 5. **Clustering**: Cluster the projected vectors using HDBSCAN
 29 | 
 30 | ![Methodology Diagram](methodology_diagram.png)
 31 | 
 32 | ## Key Features
 33 | 
 34 | ### Fine-grained Stance Detection
 35 | 
 36 | Our method can automatically detect stances down to the party-affiliation level in a completely unsupervised manner, outperforming previous approaches.
 37 | 
 38 | ![Fine-grained Stance Detection](ed.png)
 39 | 
 40 | ### Cross-Topic Mutual Information
 41 | 
 42 | Using our clustering method, we can analyze the correlations between user stances across different topics, allowing for deeper insight into the structure of polarization.
 43 | 
 44 | ![Mutual Information Heatmap](ami.png)
 45 | 
 46 | ### Semantic Analysis Between Clusters
 47 | 
 48 | We identify the most prominent terms in each cluster to show how different groups talk about the same issues in different contexts, revealing semantic divergences between polarized groups.
 49 | 
 50 | ![Word Clouds of Prominent Terms](wc.png)
 51 | 
 52 | ## Performance
 53 | 
 54 | Our method achieves:
 55 | - 90% precision in identifying user stances
 56 | - Over 80% recall
 57 | - Competitive performance with supervised methods, while being completely unsupervised
 58 | - Ability to detect fine-grained sub-groups that previous methods couldn't identify
 59 | 
 60 | ## Installation
 61 | 
 62 | ```bash
 63 | # Clone this repository
 64 | git clone https://github.com/AmmarRashed/UnsupervisedStanceDetection.git
 65 | cd UnsupervisedStanceDetection
 66 | 
 67 | # Create and activate a virtual environment (recommended)
 68 | python -m venv venv
 69 | source venv/bin/activate  # On Windows: venv\Scripts\activate
 70 | 
 71 | # Install dependencies
 72 | pip install -r requirements.txt
 73 | ```
 74 | 
 75 | ### Requirements
 76 | 
 77 | > **Note**: This work was tested using specific versions of packages. Newer versions might not work as expected.
 78 | 
 79 | - [umap-learn 0.3.x](https://pypi.org/project/umap-learn/0.3.10/)
 80 | - [hdbscan 0.8.x](https://pypi.org/project/hdbscan/0.8.26/)
 81 | - [tensorflow-hub 0.8.x](https://pypi.org/project/tensorflow-hub/0.8.0/)
 82 | - [tensorflow-text 2.2.x](https://pypi.org/project/tensorflow-text/2.2.1/)
 83 | - matplotlib
 84 | - numpy
 85 | - pandas
 86 | - tqdm
 87 | 
 88 | ## Usage
 89 | 
 90 | ```python
 91 | # Basic usage example
 92 | python clusterUsersUniversalSentenceEncoder.py your_data.tsv
 93 | ```
 94 | 
 95 | The input file should be a tab-separated file with:
 96 | - First column: UserIDs
 97 | - Second column: Tweets
 98 | 
 99 | ### Code Sample
100 | 
101 | ```python
102 | from clusterUsersUniversalSentenceEncoder import cluster_users, plot_clusters_no_labels
103 | import tensorflow_hub as hub
104 | import pandas as pd
105 | 
106 | # Load the universal sentence encoder
107 | embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')
108 | 
109 | # Load and prepare your data
110 | df_text = pd.read_csv('your_data.tsv', header=None, usecols=[0, 1], sep='\t')
111 | df_text.columns = ['User', 'Text']
112 | df_text = df_text.apply(lambda s: s.str.strip())
113 | 
114 | # Cluster users based on their tweets
115 | cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at='results.npz')
116 | 
117 | # Visualize the clusters
118 | plot_clusters_no_labels('results.npz.cluster')
119 | ```
120 | 
121 | ## Customization Options
122 | 
123 | The method can be customized with different parameters:
124 | 
125 | - **Sentence Encoder**: Different pre-trained models can be used (multilingual, transformer-based, etc.)
126 | - **UMAP Parameters**: Adjust `min_dist` and `n_neighbors` to control projection characteristics
127 | - **HDBSCAN Parameters**: Modify `min_cluster_size` and `min_samples` to control clustering sensitivity
128 | 
129 | ## Applications
130 | 
131 | This method has been successfully applied to:
132 | - Political polarization analysis
133 | - Election stance detection
134 | - Sports fan sentiment analysis
135 | - Cross-cultural stance detection
136 | 
137 | ## Citation
138 | 
139 | If you use this code in your research, please cite our paper:
140 | 
141 | ```
142 | Rashed, A., Kutlu, M., Darwish, K., Elsayed, T., & Bayrak, C. (2021). Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 537-548. https://doi.org/10.1609/icwsm.v15i1.18082
143 | ```
144 | 
145 | BibTeX format:
146 | 
147 | ```bibtex
148 | @article{rashed2021embeddings,
149 |   title={Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey},
150 |   author={Rashed, Ammar and Kutlu, Mucahid and Darwish, Kareem and Elsayed, Tamer and Bayrak, Cansın},
151 |   journal={Proceedings of the International AAAI Conference on Web and Social Media},
152 |   volume={15},
153 |   number={1},
154 |   pages={537--548},
155 |   year={2021},
156 |   doi={10.1609/icwsm.v15i1.18082}
157 | }
158 | ```
159 | 
160 | ## Contributing
161 | 
162 | Contributions are welcome! Please feel free to submit a Pull Request.
163 | 
164 | ## License
165 | 
166 | This project is licensed under the MIT License - see the LICENSE file for details.
167 | 
168 | ## Contact
169 | 
170 | - Ammar Rashed (ammar.rasid@ozu.edu.tr)
171 | - Kareem Darwish (kdarwish@hbku.edu.qa)
172 | 


--------------------------------------------------------------------------------
/ami.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ami.png


--------------------------------------------------------------------------------
/clusterUsersUniversalSentenceEncoder.py:
--------------------------------------------------------------------------------
  1 | ###############################################################################
  2 | #   Code written by Ammar Rashid (Özyeğin University)
  3 | #   ammar.rasid@ozu.edu.tr
  4 | #   and modified by Kareem Darwish (Qatar Computing Research Institute)
  5 | #   kdarwish@hbku.edu.qa
  6 | #   The code is provided for research purposes ONLY
  7 | ###############################################################################
  8 | 
  9 | ###############################################################################
 10 | # sys.argv[1] is a tab separated file with first column containing UserIDs
 11 | # and second column containing tweets
 12 | ###############################################################################
 13 | # there are many options for the universal sentence encoder including multilingual
 14 | # models, Transformer model (slow), and CNN model (fast)
 15 | # check out: https://tfhub.dev/google/universal-sentence-encoder/1
 16 | # for options
 17 | ###############################################################################
 18 | 
 19 | import ntpath
 20 | import sys
 21 | from typing import Callable
 22 | 
 23 | import matplotlib.pyplot as plt
 24 | import numpy as np
 25 | import pandas as pd
 26 | import tensorflow_hub as hub
 27 | from hdbscan import HDBSCAN
 28 | from tqdm import tqdm
 29 | from umap import UMAP
 30 | 
 31 | 
 32 | def cluster_users(df, encoder: Callable, min_tweets=3, user_col="username",
 33 |                   tweet_col="norm_tweet", save_at="temp.npz",
 34 |                   min_dist=0.0, n_neighbors=90, **kwargs):
 35 |     gs = df.groupby(user_col)
 36 |     users = list()
 37 |     vectors = list()
 38 |     for user, frame in tqdm(gs):
 39 |         if len(frame) < min_tweets:
 40 |             continue
 41 |         try:
 42 |             tweets = frame[tweet_col]
 43 |             vec = np.mean(np.array(encoder(tweets.tolist())), axis=0)
 44 |             users.append(user)
 45 |             vectors.append(vec)
 46 |         except Exception as e:
 47 |             print(f"ERROR at:{user}")
 48 |             print(e)
 49 |             print()
 50 | 
 51 |     users: np.ndarray = users
 52 |     vectors: np.ndarray = vectors
 53 | 
 54 |     standard_embeddings = UMAP(
 55 |         random_state=42,
 56 |         n_components=2,
 57 |         n_neighbors=n_neighbors,
 58 |         min_dist=min_dist,
 59 |         metric='cosine', **kwargs
 60 |     ).fit_transform(vectors)
 61 |     print("Projection complete")
 62 | 
 63 |     params = dict()
 64 | 
 65 |     clusterer = cluster_embeddings(standard_embeddings, **kwargs)
 66 |     params['clusters'] = clusterer.labels_
 67 |     params["allow_pickle"] = True
 68 |     np.savez(open(save_at + '.cluster', 'wb'), users=np.array(users), vectors=np.array(vectors),
 69 |              umap=np.array(standard_embeddings), clusters=np.array(clusterer.labels_))
 70 | 
 71 |     output_file = open(save_at + '.clusters.txt', mode='w')
 72 |     for i in range(len(clusterer.labels_)):
 73 |         output_file.write(str(users[i]) + '\t' + str(clusterer.labels_[i]) + '\n')
 74 |     output_file.close()
 75 | 
 76 | 
 77 | def plot_clusters_no_labels(embeddings_path, clusters_col="clusters", green_label="pro", red_label='anti', align=False,
 78 |                             title=None, include_ratio=True, labeled_only=False):
 79 |     if title is None:
 80 |         title = ntpath.basename(embeddings_path).split('.')[0]
 81 |     f = np.load(embeddings_path)
 82 |     users = f["users"]
 83 |     clusters = f[clusters_col]
 84 |     cluster_ratio = round(sum(clusters >= 0) * 100 / len(clusters), 2)
 85 |     em = f["umap"]
 86 | 
 87 |     ind = clusters >= 0
 88 |     users = users[ind]
 89 |     clusters = clusters[ind]
 90 |     em = em[ind, :]
 91 |     c = ['red', 'blue', 'green', 'black', 'orange', 'teal']
 92 |     if align:
 93 |         d = align_clusters_with_labels(
 94 |             pd.DataFrame({"username": users, "clusters": clusters})
 95 |         )
 96 |         c = ['red', 'blue', 'green', 'black', 'orange', 'teal', 'olive', 'yellow']
 97 |     else:
 98 |         labels_dict = {}
 99 | 
100 |     cmap = list()
101 |     for i in range(len(clusters)):
102 |         cmap.append(c[clusters[i] - 1])
103 | 
104 |     fig = plt.figure()
105 |     ax = fig.add_subplot(111)
106 |     scatter = plt.scatter(em[:, 0], em[:, 1], c=cmap,
107 |                           s=0.5, cmap='Spectral')
108 |     ax.set_title(title, fontsize=22)
109 |     plt.show()
110 |     return scatter
111 | 
112 | 
113 | def align_clusters_with_labels(df, allow_multiple_clusters=True):
114 |     df = df[df.clusters >= 0]
115 |     g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False)
116 | 
117 |     d = {}
118 |     while len(g) > 0:
119 |         label, cluster = g.index[0]
120 |         d[cluster] = label
121 |         g = g.reset_index()
122 |         g = g[(g.label != label) & (g.clusters != cluster)] \
123 |             .set_index(["label", "clusters"]) \
124 |             .sort_values("username", ascending=False)
125 |     unlabeled_clusters = set(df.clusters) - set(d.keys())
126 |     if allow_multiple_clusters and len(unlabeled_clusters) > 0:
127 |         g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index()
128 |         for c in unlabeled_clusters:
129 |             l = g.set_index("clusters").loc[c].label
130 |             if isinstance(l, pd.Series):
131 |                 l = l.iloc[0]
132 |             d[c] = l
133 | 
134 |             g = g[g.clusters != c]
135 | 
136 |     return d
137 | 
138 | 
139 | def cluster_embeddings(standard_embedding,
140 |                        min_cluster_size=None,
141 |                        min_samples=None,
142 |                        plot_tree=False,
143 |                        min_samples_div=1000,
144 |                        min_cluster_size_div=100,
145 |                        **kwargs):
146 |     if min_cluster_size is None:
147 |         min_cluster_size = max(10, len(standard_embedding) // min_cluster_size_div)
148 |     if min_samples is None:
149 |         min_samples = max(10, len(standard_embedding) // min_samples_div)
150 |     clusterer = HDBSCAN(
151 |         min_samples=min_samples,
152 |         min_cluster_size=min_cluster_size, **kwargs
153 |     ).fit(standard_embedding)
154 |     if plot_tree:
155 |         clusterer.condensed_tree_.plot()
156 |     # return clusterer.labels_, clusterer.condensed_tree_
157 |     return clusterer
158 | 
159 | 
160 | if __name__ == "__main__":
161 |     embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')  # You can use different encoders here
162 | 
163 |     inputFile = sys.argv[1]  # ex. trump.tsv
164 |     df_text = pd.read_csv(inputFile, header=None, usecols=[0, 1], error_bad_lines=False, sep='\t')
165 |     df_text.columns = ['User', 'Text']
166 |     df_text = df_text.apply(lambda s: s.str.strip())
167 |     cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at=inputFile + '.npz')
168 |     plot_clusters_no_labels(inputFile + '.npz.cluster')
169 | 


--------------------------------------------------------------------------------
/demo.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Preprocessing\n",
  8 |     "- Remove URLs and Mentions\n",
  9 |     "- Separate composite camel case words (e.g. BlackLives --> black lives)\n",
 10 |     "- Remove non-alphanumeric characters\n",
 11 |     "- Replace numbers with the token \"_number_\"\n",
 12 |     "- Lowercase everything"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "code",
 17 |    "execution_count": 1,
 18 |    "metadata": {
 19 |     "ExecuteTime": {
 20 |      "end_time": "2020-06-07T23:45:10.994786Z",
 21 |      "start_time": "2020-06-07T23:45:10.990361Z"
 22 |     }
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "from src.preprocessing import clean"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 3,
 32 |    "metadata": {
 33 |     "ExecuteTime": {
 34 |      "end_time": "2020-06-07T23:46:12.708646Z",
 35 |      "start_time": "2020-06-07T23:46:12.680332Z"
 36 |     }
 37 |    },
 38 |    "outputs": [
 39 |     {
 40 |      "data": {
 41 |       "text/plain": [
 42 |        "'black lives matter'"
 43 |       ]
 44 |      },
 45 |      "execution_count": 3,
 46 |      "metadata": {},
 47 |      "output_type": "execute_result"
 48 |     }
 49 |    ],
 50 |    "source": [
 51 |     "clean(\"#BlackLivesMatter https://www.google.com/ @realDonaldTrump\")"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "# Encoding\n",
 59 |     "## Universal Sentence Encoder"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 1,
 65 |    "metadata": {
 66 |     "ExecuteTime": {
 67 |      "end_time": "2020-06-07T23:55:34.962132Z",
 68 |      "start_time": "2020-06-07T23:55:33.625304Z"
 69 |     }
 70 |    },
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "from src.encoder import Encoder"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 2,
 79 |    "metadata": {
 80 |     "ExecuteTime": {
 81 |      "end_time": "2020-06-07T23:55:40.034473Z",
 82 |      "start_time": "2020-06-07T23:55:37.473873Z"
 83 |     }
 84 |    },
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "# default encoder is USE for English only.\n",
 88 |     "# But you can use multilingual as well, like ...\n",
 89 |     "encoder = Encoder(model_url=\"https://tfhub.dev/google/universal-sentence-encoder-multilingual/3\")"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 5,
 95 |    "metadata": {
 96 |     "ExecuteTime": {
 97 |      "end_time": "2020-06-07T23:55:57.481777Z",
 98 |      "start_time": "2020-06-07T23:55:57.448134Z"
 99 |     }
100 |    },
101 |    "outputs": [
102 |     {
103 |      "data": {
104 |       "text/plain": [
105 |        "array([[ 9.02929604e-02,  2.53139641e-02, -8.63993599e-04,\n",
106 |        "         3.37017924e-02, -6.26476333e-02, -4.42366041e-02,\n",
107 |        "         2.18325537e-02,  5.37963435e-02, -8.38939548e-02,\n",
108 |        "        -9.51755140e-03, -3.12121455e-02, -5.35302460e-02,\n",
109 |        "        -4.03429270e-02, -6.45988435e-02, -4.22783829e-02,\n",
110 |        "         6.87545631e-03,  2.68412735e-02,  1.69232395e-02,\n",
111 |        "         4.52055521e-02, -7.21441209e-02,  7.80028552e-02,\n",
112 |        "         7.60580525e-02, -4.91863601e-02, -3.33283916e-02,\n",
113 |        "        -6.48475764e-03,  5.31073436e-02,  5.94470128e-02,\n",
114 |        "         4.97598015e-02, -5.83836809e-02,  4.62118129e-04,\n",
115 |        "        -2.54417248e-02, -4.07968946e-02,  2.24085082e-03,\n",
116 |        "        -5.71764819e-02,  3.96157652e-02, -5.56416325e-02,\n",
117 |        "         1.06351763e-01, -2.11038422e-02, -4.97004427e-02,\n",
118 |        "         1.37671484e-02,  2.52124630e-02,  6.93862326e-03,\n",
119 |        "        -8.78239796e-03, -4.25839275e-02, -7.41932988e-02,\n",
120 |        "         3.93395983e-02, -5.14756478e-02, -4.80900072e-02,\n",
121 |        "         2.03796737e-02,  4.60575111e-02, -5.39578963e-03,\n",
122 |        "         5.13799861e-02,  4.98849079e-02, -1.53071098e-02,\n",
123 |        "        -2.55209878e-02, -9.37783793e-02,  6.80431351e-02,\n",
124 |        "         5.42037040e-02,  5.88915544e-03,  3.77027579e-02,\n",
125 |        "         2.97001610e-03, -2.73854788e-02, -2.25164257e-02,\n",
126 |        "         2.94775404e-02, -4.49141003e-02, -1.22707179e-02,\n",
127 |        "         3.51123661e-02, -9.62661114e-03, -4.74585360e-03,\n",
128 |        "        -6.34014159e-02,  2.71070562e-02, -8.06257129e-03,\n",
129 |        "        -6.70747459e-02, -5.07746078e-02, -4.76036221e-02,\n",
130 |        "         7.18707684e-03,  3.65909301e-02, -3.67699936e-02,\n",
131 |        "         3.70868184e-02,  1.12690397e-01, -1.10753492e-01,\n",
132 |        "         1.88780073e-02,  6.59464002e-02,  4.53360453e-02,\n",
133 |        "         5.63650019e-02,  4.43356484e-02,  1.30171627e-02,\n",
134 |        "        -2.71456428e-02, -2.89043244e-02,  1.64611451e-02,\n",
135 |        "        -1.92087479e-02, -5.28771989e-02, -6.49906620e-02,\n",
136 |        "         3.41170616e-02, -1.70326754e-02, -3.64219025e-02,\n",
137 |        "        -6.40340447e-02, -4.20621075e-02,  5.49546070e-02,\n",
138 |        "        -2.71569863e-02,  1.50894774e-02, -8.34458917e-02,\n",
139 |        "         8.04520994e-02, -1.96695887e-02, -1.00700237e-01,\n",
140 |        "        -4.36882209e-03,  3.58170122e-02, -6.73646182e-02,\n",
141 |        "        -3.49581279e-02, -3.14205624e-02, -3.03178281e-02,\n",
142 |        "         4.55205292e-02, -4.74548228e-02, -3.40684615e-02,\n",
143 |        "         1.98236890e-02,  2.36651860e-02, -2.66083088e-02,\n",
144 |        "        -7.57225677e-02, -1.35216266e-02, -1.82724686e-03,\n",
145 |        "         2.93097384e-02, -3.48166339e-02,  2.47215275e-02,\n",
146 |        "         3.22892033e-02,  2.67713480e-02, -5.79769351e-02,\n",
147 |        "        -2.56844629e-02,  8.45318958e-02, -2.28709877e-02,\n",
148 |        "        -9.08638537e-03,  1.39732165e-02,  2.45238561e-02,\n",
149 |        "         3.46098132e-02,  5.28965704e-02, -7.04566389e-02,\n",
150 |        "        -4.86870706e-02,  2.34722588e-02,  2.37552300e-02,\n",
151 |        "        -5.92066869e-02,  8.27051178e-02, -4.00973856e-03,\n",
152 |        "        -3.28391939e-02, -6.41322583e-02,  5.77677647e-03,\n",
153 |        "        -8.03356245e-02,  2.29892787e-02, -9.79695190e-03,\n",
154 |        "        -6.78975321e-03, -8.75867438e-03, -4.90265042e-02,\n",
155 |        "         2.29285266e-02, -4.98827323e-02, -1.09793276e-01,\n",
156 |        "        -3.85776274e-02,  4.46549704e-04,  1.88573524e-02,\n",
157 |        "        -5.65416738e-02,  4.27371226e-02, -7.73091055e-03,\n",
158 |        "         9.66695976e-03, -8.21083859e-02,  6.33048778e-03,\n",
159 |        "        -3.02886646e-02,  1.40163992e-02, -5.77669404e-02,\n",
160 |        "         1.12527721e-01, -1.80120803e-02,  2.36992892e-02,\n",
161 |        "        -1.32897524e-02, -2.57620830e-02,  6.66455459e-03,\n",
162 |        "         4.19999138e-02, -4.12883908e-02, -9.73793678e-03,\n",
163 |        "         5.12045547e-02, -4.39418741e-02, -2.89999340e-02,\n",
164 |        "        -3.28261144e-02,  4.97053796e-03, -3.94377932e-02,\n",
165 |        "         7.66094103e-02, -1.74061339e-02, -4.58508246e-02,\n",
166 |        "         2.20106803e-02,  1.01143029e-02,  3.49179357e-02,\n",
167 |        "        -2.76056118e-02,  4.00061607e-02, -3.07031441e-02,\n",
168 |        "        -9.87935532e-03, -3.51552591e-02,  6.12977035e-02,\n",
169 |        "        -1.34940445e-02,  1.69758976e-03, -9.62444022e-03,\n",
170 |        "        -1.15804393e-02,  3.89520489e-02,  8.71613845e-02,\n",
171 |        "         6.13522753e-02, -3.57693098e-02,  5.38780093e-02,\n",
172 |        "        -9.76623152e-04, -3.23415212e-02,  6.76710904e-02,\n",
173 |        "         9.33619123e-03, -8.11213255e-03,  6.82704076e-02,\n",
174 |        "         5.05042709e-02, -8.47424865e-02, -5.89879490e-02,\n",
175 |        "         7.25789368e-02, -2.18459424e-02,  4.00722250e-02,\n",
176 |        "        -7.63654150e-03,  1.03146099e-02,  5.40494919e-02,\n",
177 |        "         1.61888842e-02,  4.32131365e-02,  5.60503006e-02,\n",
178 |        "        -8.37420970e-02, -1.66589953e-02, -1.09322891e-02,\n",
179 |        "         3.10896002e-02,  2.87623964e-02,  7.79771879e-02,\n",
180 |        "        -3.36286873e-02,  7.37195835e-02, -1.33916633e-02,\n",
181 |        "         4.63935211e-02,  6.50074799e-03,  3.98444422e-02,\n",
182 |        "        -3.78602743e-02,  4.35293913e-02, -2.90157180e-02,\n",
183 |        "         1.25429835e-02,  3.68853062e-02, -1.06367087e-02,\n",
184 |        "        -7.20745325e-02, -2.14768406e-02, -5.44496030e-02,\n",
185 |        "        -6.13776930e-02,  1.05972447e-01,  1.43837687e-02,\n",
186 |        "         1.63943495e-03, -7.05509707e-02, -2.42533088e-02,\n",
187 |        "         3.51534374e-02,  1.08488565e-02, -3.42009105e-02,\n",
188 |        "        -6.08277731e-02,  9.20248553e-02,  2.36441251e-02,\n",
189 |        "         6.30925670e-02,  6.67787269e-02, -6.49841651e-02,\n",
190 |        "         3.18379910e-03,  1.96745917e-02, -1.01103224e-02,\n",
191 |        "         1.94480140e-02, -8.43841955e-02, -8.75772089e-02,\n",
192 |        "        -3.86252701e-02,  1.45352371e-02,  9.57477372e-03,\n",
193 |        "        -4.48818831e-03, -3.39164175e-02, -3.36552039e-02,\n",
194 |        "        -1.33386850e-02, -1.90982111e-02,  4.97365110e-02,\n",
195 |        "         2.88681649e-02,  5.77684073e-03, -7.05776513e-02,\n",
196 |        "         6.44142255e-02,  6.41829073e-02, -5.80542684e-02,\n",
197 |        "        -2.47256700e-02, -8.52649808e-02,  1.60062127e-02,\n",
198 |        "        -1.22919763e-02,  2.45415065e-02,  1.95066840e-03,\n",
199 |        "        -1.10592833e-02,  1.55704357e-02,  1.52007127e-02,\n",
200 |        "        -4.29791175e-02, -3.13911252e-02, -2.85093561e-02,\n",
201 |        "        -3.46784219e-02,  1.07909925e-02, -5.69052845e-02,\n",
202 |        "        -6.56142086e-02, -2.42444035e-03, -2.36847496e-04,\n",
203 |        "        -2.00943090e-02, -1.87727269e-02,  2.44390406e-02,\n",
204 |        "        -3.73762026e-02, -4.07696702e-02,  6.48761019e-02,\n",
205 |        "         3.38231586e-02,  6.30460605e-02,  5.82951354e-03,\n",
206 |        "        -2.62612291e-02,  6.19867910e-03, -2.10380126e-02,\n",
207 |        "         1.71352222e-04,  1.86081007e-02,  4.34052311e-02,\n",
208 |        "        -4.80737984e-02,  6.99277669e-02,  4.66579907e-02,\n",
209 |        "        -1.48551473e-02,  3.29916701e-02, -9.36777145e-03,\n",
210 |        "         7.43718967e-02, -5.12492396e-02, -6.02555387e-02,\n",
211 |        "        -6.58557117e-02, -3.25691588e-02, -9.58766192e-02,\n",
212 |        "         5.89718446e-02, -9.34590474e-02, -2.11967360e-02,\n",
213 |        "        -5.53228594e-02,  7.27902120e-03, -5.82117960e-03,\n",
214 |        "         1.51520390e-02,  3.10048033e-02,  2.35684924e-02,\n",
215 |        "        -1.24157164e-02, -3.03980522e-02, -1.04722142e-01,\n",
216 |        "         1.43642910e-02,  4.62585362e-03, -7.37912394e-03,\n",
217 |        "        -5.35621382e-02,  3.15730758e-02, -8.77389833e-02,\n",
218 |        "         5.22329099e-02, -9.48735885e-03,  4.54171449e-02,\n",
219 |        "        -8.38277936e-02, -3.25404741e-02,  7.16998801e-03,\n",
220 |        "         6.80265725e-02, -2.07673144e-02, -4.05646153e-02,\n",
221 |        "         1.34903835e-02,  3.22747529e-02, -4.12309058e-02,\n",
222 |        "         2.79887812e-03,  7.98721611e-03,  6.17843941e-02,\n",
223 |        "         4.60151024e-03,  9.92045365e-03,  5.00864871e-02,\n",
224 |        "        -5.63305654e-02, -3.88379730e-02,  3.02622397e-03,\n",
225 |        "         2.20519323e-02,  1.54148676e-02,  4.85269316e-02,\n",
226 |        "        -5.63364588e-02, -3.73017862e-02, -3.11127473e-02,\n",
227 |        "         1.61838830e-02, -6.77759647e-02, -9.11579654e-02,\n",
228 |        "         4.67085131e-02, -4.00679782e-02, -3.72959077e-02,\n",
229 |        "        -3.94075289e-02, -1.12072146e-02,  1.26367714e-02,\n",
230 |        "         4.40460369e-02, -7.77020901e-02, -2.46636514e-02,\n",
231 |        "        -1.49408458e-02,  5.86274220e-03, -7.11899400e-02,\n",
232 |        "        -1.15099251e-02,  3.33920382e-02,  5.09477453e-03,\n",
233 |        "         1.51081178e-02,  1.04949502e-02, -5.80682866e-02,\n",
234 |        "        -3.40924747e-02, -3.48201320e-02,  2.49468200e-02,\n",
235 |        "         4.42005768e-02, -2.37165336e-02,  3.79255484e-03,\n",
236 |        "        -8.86938721e-02, -1.56422518e-02,  4.10543345e-02,\n",
237 |        "         4.47053164e-02,  5.43537475e-02,  5.49245300e-03,\n",
238 |        "         7.09640309e-02,  1.93180814e-02, -3.05815432e-02,\n",
239 |        "        -6.89341733e-03, -3.62095423e-02, -3.08503956e-03,\n",
240 |        "         6.30579367e-02,  4.35884781e-02,  1.84933823e-02,\n",
241 |        "        -7.83578958e-03,  2.59191096e-02, -5.52807143e-03,\n",
242 |        "        -4.72284009e-05,  2.06883010e-02, -1.38790896e-02,\n",
243 |        "         5.72590455e-02,  3.44927758e-02,  2.15114728e-02,\n",
244 |        "         2.95498725e-02, -5.41498102e-02, -8.79013725e-03,\n",
245 |        "         7.38454312e-02,  1.96587350e-02,  1.34385750e-02,\n",
246 |        "        -5.90348169e-02, -5.32622188e-02,  3.93599793e-02,\n",
247 |        "        -4.86550853e-02, -3.91548872e-02,  4.74032760e-02,\n",
248 |        "         1.50756622e-02,  6.87927082e-02, -3.02066337e-02,\n",
249 |        "        -2.66485778e-03, -2.01581307e-02,  5.31393997e-02,\n",
250 |        "         1.00522246e-02, -1.83966588e-02, -4.26581167e-02,\n",
251 |        "        -2.71374499e-03,  9.05769784e-03, -4.29850779e-02,\n",
252 |        "        -1.37065900e-02, -6.19315952e-02, -6.49061725e-02,\n",
253 |        "         4.96972874e-02, -9.45900939e-03,  7.37345219e-02,\n",
254 |        "         5.60122356e-02, -2.91699544e-02, -1.58697236e-02,\n",
255 |        "        -6.30429089e-02,  4.82642651e-02,  5.28050645e-04,\n",
256 |        "         3.94601114e-02, -7.31267557e-02,  3.35745700e-02,\n",
257 |        "         2.48057507e-02, -4.80459072e-02, -1.18432520e-02,\n",
258 |        "        -4.43868563e-02,  3.98386568e-02, -5.34126982e-02,\n",
259 |        "         5.74409105e-02,  1.32571915e-02, -2.18527261e-02,\n",
260 |        "         1.10984361e-02, -1.16096223e-02,  6.81838691e-02,\n",
261 |        "         3.42932194e-02, -8.47309604e-02, -4.01029214e-02,\n",
262 |        "        -3.77797745e-02, -6.41229227e-02,  3.81232128e-02,\n",
263 |        "         2.52712779e-02,  1.10559305e-02,  9.84640513e-03,\n",
264 |        "        -1.20055480e-02, -9.66666546e-03, -5.53334281e-02,\n",
265 |        "         2.41286773e-02,  1.00961186e-01, -1.27077922e-02,\n",
266 |        "        -4.23806421e-02, -1.07549950e-02, -3.54763754e-02,\n",
267 |        "         5.58016337e-02, -7.87500739e-02, -3.64025608e-02,\n",
268 |        "        -2.90403571e-02,  5.37508540e-02, -2.00727507e-02,\n",
269 |        "         3.00442167e-02,  3.45369503e-02, -3.68632935e-02,\n",
270 |        "         1.22389954e-03, -6.67770281e-02,  2.16749627e-02,\n",
271 |        "         3.61376889e-02, -4.56607640e-02,  2.02212632e-02,\n",
272 |        "         4.63767387e-02, -4.86524552e-02,  5.23989350e-02,\n",
273 |        "         1.38630597e-02,  5.03290556e-02,  5.27634881e-02,\n",
274 |        "        -3.88095379e-02, -1.34635530e-02, -7.79085681e-02,\n",
275 |        "         1.63281877e-02, -7.12259766e-03]], dtype=float32)"
276 |       ]
277 |      },
278 |      "execution_count": 5,
279 |      "metadata": {},
280 |      "output_type": "execute_result"
281 |     }
282 |    ],
283 |    "source": [
284 |     "encoder.encode(\"hello world\")"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": 10,
290 |    "metadata": {
291 |     "ExecuteTime": {
292 |      "end_time": "2020-06-07T23:58:33.082587Z",
293 |      "start_time": "2020-06-07T23:58:33.053011Z"
294 |     }
295 |    },
296 |    "outputs": [
297 |     {
298 |      "data": {
299 |       "text/html": [
300 |        "<div>\n",
301 |        "<style scoped>\n",
302 |        "    .dataframe tbody tr th:only-of-type {\n",
303 |        "        vertical-align: middle;\n",
304 |        "    }\n",
305 |        "\n",
306 |        "    .dataframe tbody tr th {\n",
307 |        "        vertical-align: top;\n",
308 |        "    }\n",
309 |        "\n",
310 |        "    .dataframe thead th {\n",
311 |        "        text-align: right;\n",
312 |        "    }\n",
313 |        "</style>\n",
314 |        "<table border=\"1\" class=\"dataframe\">\n",
315 |        "  <thead>\n",
316 |        "    <tr style=\"text-align: right;\">\n",
317 |        "      <th></th>\n",
318 |        "      <th>username</th>\n",
319 |        "      <th>text</th>\n",
320 |        "    </tr>\n",
321 |        "  </thead>\n",
322 |        "  <tbody>\n",
323 |        "    <tr>\n",
324 |        "      <th>0</th>\n",
325 |        "      <td>user1</td>\n",
326 |        "      <td>hello world</td>\n",
327 |        "    </tr>\n",
328 |        "    <tr>\n",
329 |        "      <th>1</th>\n",
330 |        "      <td>user1</td>\n",
331 |        "      <td>merhaba dunya</td>\n",
332 |        "    </tr>\n",
333 |        "    <tr>\n",
334 |        "      <th>2</th>\n",
335 |        "      <td>user2</td>\n",
336 |        "      <td>Bonjour le monde</td>\n",
337 |        "    </tr>\n",
338 |        "    <tr>\n",
339 |        "      <th>3</th>\n",
340 |        "      <td>user2</td>\n",
341 |        "      <td>مرحبا بالعالم</td>\n",
342 |        "    </tr>\n",
343 |        "  </tbody>\n",
344 |        "</table>\n",
345 |        "</div>"
346 |       ],
347 |       "text/plain": [
348 |        "  username              text\n",
349 |        "0    user1       hello world\n",
350 |        "1    user1     merhaba dunya\n",
351 |        "2    user2  Bonjour le monde\n",
352 |        "3    user2     مرحبا بالعالم"
353 |       ]
354 |      },
355 |      "execution_count": 10,
356 |      "metadata": {},
357 |      "output_type": "execute_result"
358 |     }
359 |    ],
360 |    "source": [
361 |     "import pandas as pd\n",
362 |     "\n",
363 |     "df = pd.DataFrame({\n",
364 |     "    \"username\":[\"user1\", \"user1\", \"user2\", \"user2\"],\n",
365 |     "    \"text\": [\"hello world\", \"merhaba dunya\", \"Bonjour le monde\", \"مرحبا بالعالم\"]\n",
366 |     "})\n",
367 |     "df"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "code",
372 |    "execution_count": 11,
373 |    "metadata": {
374 |     "ExecuteTime": {
375 |      "end_time": "2020-06-07T23:59:26.926654Z",
376 |      "start_time": "2020-06-07T23:59:26.886003Z"
377 |     }
378 |    },
379 |    "outputs": [
380 |     {
381 |      "name": "stderr",
382 |      "output_type": "stream",
383 |      "text": [
384 |       "100%|██████████| 2/2 [00:00<00:00, 95.93it/s]\n"
385 |      ]
386 |     }
387 |    ],
388 |    "source": [
389 |     "encoder.encode_df(df, user_col=\"username\", text_col=\"text\", out_path=\"demo.npz\")"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": 12,
395 |    "metadata": {
396 |     "ExecuteTime": {
397 |      "end_time": "2020-06-07T23:59:38.423912Z",
398 |      "start_time": "2020-06-07T23:59:38.418479Z"
399 |     }
400 |    },
401 |    "outputs": [
402 |     {
403 |      "data": {
404 |       "text/plain": [
405 |        "array(['user1', 'user2'], dtype='<U5')"
406 |       ]
407 |      },
408 |      "execution_count": 12,
409 |      "metadata": {},
410 |      "output_type": "execute_result"
411 |     }
412 |    ],
413 |    "source": [
414 |     "f = np.load(\"demo.npz\")\n",
415 |     "f[\"users\"]"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 13,
421 |    "metadata": {
422 |     "ExecuteTime": {
423 |      "end_time": "2020-06-07T23:59:42.624482Z",
424 |      "start_time": "2020-06-07T23:59:42.610824Z"
425 |     }
426 |    },
427 |    "outputs": [
428 |     {
429 |      "data": {
430 |       "text/plain": [
431 |        "array([[ 0.1271717 ,  0.00337878, -0.00207015, ..., -0.03947673,\n",
432 |        "         0.02093398, -0.00205181],\n",
433 |        "       [ 0.07556362, -0.00347732, -0.01797179, ..., -0.0588544 ,\n",
434 |        "         0.02632636, -0.00710142]], dtype=float32)"
435 |       ]
436 |      },
437 |      "execution_count": 13,
438 |      "metadata": {},
439 |      "output_type": "execute_result"
440 |     }
441 |    ],
442 |    "source": [
443 |     "f[\"vectors\"]"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "code",
448 |    "execution_count": 14,
449 |    "metadata": {
450 |     "ExecuteTime": {
451 |      "end_time": "2020-06-07T23:59:52.972054Z",
452 |      "start_time": "2020-06-07T23:59:52.959046Z"
453 |     }
454 |    },
455 |    "outputs": [
456 |     {
457 |      "data": {
458 |       "text/plain": [
459 |        "(2, 512)"
460 |       ]
461 |      },
462 |      "execution_count": 14,
463 |      "metadata": {},
464 |      "output_type": "execute_result"
465 |     }
466 |    ],
467 |    "source": [
468 |     "f[\"vectors\"].shape"
469 |    ]
470 |   },
471 |   {
472 |    "cell_type": "markdown",
473 |    "metadata": {},
474 |    "source": [
475 |     "## BERT"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "code",
480 |    "execution_count": 15,
481 |    "metadata": {
482 |     "ExecuteTime": {
483 |      "end_time": "2020-06-08T00:00:16.428320Z",
484 |      "start_time": "2020-06-08T00:00:16.423172Z"
485 |     }
486 |    },
487 |    "outputs": [],
488 |    "source": [
489 |     "from src.encoder import EncoderBERT\n",
490 |     "\n",
491 |     "encoder = EncoderBERT(\"roberta-base-nli-stsb-mean-tokens\")\n",
492 |     "\n",
493 |     "# same API"
494 |    ]
495 |   },
496 |   {
497 |    "cell_type": "markdown",
498 |    "metadata": {},
499 |    "source": [
500 |     "# Projection"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "code",
505 |    "execution_count": 17,
506 |    "metadata": {
507 |     "ExecuteTime": {
508 |      "end_time": "2020-06-08T00:07:17.361732Z",
509 |      "start_time": "2020-06-08T00:07:17.338281Z"
510 |     }
511 |    },
512 |    "outputs": [],
513 |    "source": [
514 |     "from src.projection import Projector, os"
515 |    ]
516 |   },
517 |   {
518 |    "cell_type": "code",
519 |    "execution_count": 18,
520 |    "metadata": {
521 |     "ExecuteTime": {
522 |      "end_time": "2020-06-08T00:14:15.163004Z",
523 |      "start_time": "2020-06-08T00:14:15.159439Z"
524 |     }
525 |    },
526 |    "outputs": [],
527 |    "source": [
528 |     "projector = Projector(\"election_vectors.npz\")"
529 |    ]
530 |   },
531 |   {
532 |    "cell_type": "code",
533 |    "execution_count": 20,
534 |    "metadata": {
535 |     "ExecuteTime": {
536 |      "end_time": "2020-06-08T00:16:09.544929Z",
537 |      "start_time": "2020-06-08T00:16:09.537875Z"
538 |     }
539 |    },
540 |    "outputs": [],
541 |    "source": [
542 |     "projector.project(\"projections.npz\", min_counts=3, min_dist=0.0, n_neighbors=30)"
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "code",
547 |    "execution_count": 21,
548 |    "metadata": {
549 |     "ExecuteTime": {
550 |      "end_time": "2020-06-08T00:16:24.565870Z",
551 |      "start_time": "2020-06-08T00:16:24.558409Z"
552 |     }
553 |    },
554 |    "outputs": [],
555 |    "source": [
556 |     "os.makedirs(\"trials\")\n",
557 |     "\n",
558 |     "projector.grid_search(\n",
559 |     "    trials_dir=\"trials\",\n",
560 |     "    min_counts=3, # minimum 3 tweets per user\n",
561 |     "    min_dists_range = [0.0, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99],\n",
562 |     "    n_neighbors_range=[20, 30, 40, 50, 60, 70, 80, 90, 100]\n",
563 |     ")"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "markdown",
568 |    "metadata": {},
569 |    "source": [
570 |     "# Clustering"
571 |    ]
572 |   },
573 |   {
574 |    "cell_type": "code",
575 |    "execution_count": 25,
576 |    "metadata": {
577 |     "ExecuteTime": {
578 |      "end_time": "2020-06-08T00:19:11.426399Z",
579 |      "start_time": "2020-06-08T00:19:11.420912Z"
580 |     }
581 |    },
582 |    "outputs": [],
583 |    "source": [
584 |     "from src.clustering import Clusterer, pickle"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "code",
589 |    "execution_count": 22,
590 |    "metadata": {
591 |     "ExecuteTime": {
592 |      "end_time": "2020-06-08T00:17:49.673292Z",
593 |      "start_time": "2020-06-08T00:17:49.667639Z"
594 |     }
595 |    },
596 |    "outputs": [],
597 |    "source": [
598 |     "clusterer = Clusterer(\"projections.npz\")"
599 |    ]
600 |   },
601 |   {
602 |    "cell_type": "code",
603 |    "execution_count": 34,
604 |    "metadata": {
605 |     "ExecuteTime": {
606 |      "end_time": "2020-06-08T00:21:46.942189Z",
607 |      "start_time": "2020-06-08T00:21:46.174795Z"
608 |     }
609 |    },
610 |    "outputs": [
611 |     {
612 |      "data": {
613 |       "text/plain": [
614 |        "<matplotlib.axes._subplots.AxesSubplot at 0x7fa1bdfd9908>"
615 |       ]
616 |      },
617 |      "execution_count": 34,
618 |      "metadata": {},
619 |      "output_type": "execute_result"
620 |     },
621 |     {
622 |      "data": {
623 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1EAAAI3CAYAAAB6X9FZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3df5RddX0v/PeZaIafYW5CwwwJygMqpEkVYhB9kLtYQQxYFHQh4SbiLYq09SZyVUCqNEFQIYFSxULRh0eKNMIDUkGCkCjW2tu6rNRydaCC8kuB/GgIgYaEIHP280dkLiEB5szMPufMPq+Xa7sye89893dYWTCfeX+/n2+tKIoiAAAADElXqycAAAAwliiiAAAAGqCIAgAAaIAiCgAAoAGKKAAAgAYoogAAABrQNkXUgw8+mLlz52bOnDmZO3duHnrooVZPCQAAYDttU0QtXrw48+bNy4oVKzJv3rwsWrSo1VMCAADYTq0dDtt9/PHHM2fOnPz4xz/OuHHjMjAwkEMPPTQrV67MxIkTX/Zrn3vuuaxevTq9vb151ate1aQZAwBAe9qwYUM2btzYlHfttttu6enpacq72klbVB2rVq3KXnvtlXHjxiVJxo0bl8mTJ2fVqlWvWEStXr06Rx55ZO64445MnTq1GdOFMaG++g2tnsI2unrva/UUAKDyNmzYkHe+4y158j9rTXnfHnvskZUrV3ZcIdUWRRQAADByGzduzJP/WcvffrlI7++V+67V/5F8YOGT2bhxoyKqFfr6+rJmzZoMDAwMLudbu3Zt+vr6Wj01GLN+WzzX6ilso7vVEwCADtL7e8kUP0qXpi0aS0yaNCnTpk3L8uXLkyTLly/PtGnTXnEpHwAAsL16k/7XqdoiiUqSc889N2effXYuv/zyTJgwIUuWLGn1lAAAALbTNkXU/vvvnxtuuKHV0wAAgDFvoKhnoOQe3FvHH1fuS9pUWyznAwAAGCvaJokCAABGRz1F6TuWOndHlCQKAACgIZIoAAComCJF6il3U1SR5hzo244kUQAAAA2QRAEAQMUMpMhAUW4SNVDq6O1NEgUAANAASRQAAFRMvQl7onTnAwAAYEgUUQAAAA2wnA8AACqmniIDlvOVRhIFAADQAEkUAABUjMYS5ZJEAQAANEASBQAAFTNQNOGw3XKHb2uSKAAAgAZIogAAoGKKlL9nqYODKEkUAABAIyRRAABQMQNNOCdqoNTR25skCgAAoAGSKAAAqJiBovzuebrzAQAAMCSSKAAAqBjd+coliQIAAGiAJAoAACpmILXSu+cNpFbyG9qXJAoAAKABiigAAIAGWM4HAAAVUy+2XmW/o1NJogAAABogiQIAgIqpN6GxRL2DG0sooqCi6h19egMAQHkUUQAAUDFanJfLnigAAIAGSKIAAKBi6kWtCd35JFEAAAAMgSQKAAAqRne+ckmiAAAAGiCJAgCAitnana/cpKjspKudSaIAAAAaIIkCAICKaU53vnLHb2eKKKioeuqtngIAQCUpogAAoGKa052vc9kTBQAA0ABFFFTUrn0Pt3oKAACVZDkfAABUzEDRlYGSGz+UPX47k0QBAAA0QBLVYrt3/142Pruu1dOAl9Td3Z0tW7aMwkjlHvgHjZo+fXr6+/tbPQ2AUtTTVXrjh05uLKGIarG3v+aUbT6+7ZdLWzQTgOa6+YGDWvr+4/a7q6XvB2DsUkQBAEDFaHFeLnuiAAAAGiCJAgCAihkoahkoyt2PPFAUSTqzRZ8kCgAAoAGSKABaouzfkAJ0snpqqZfcGXfrnihJFAAAAK9AEgUAABVTT1cGSk+iinRqjz5JFAAAQAMkUa1WdOY6UgAAyjOQrvK783XofqhEEQVj2q7/157Z9NDjrZ4GtER3d3e2bNkyghE0tqA9TZ8+Pf39/a2eBvAyFFEwhr35ylNKf8cPj7yo9HcAdJojj/jCSz674wefbuJMqKqt3fnK3blT79D9UIk9UQAAAA1RRAEAADTAcj4AAKiYelErvbFEvYMPTZdEAQAANEASBWNYJ/8GCAB4aQPpykDJeclAqaO3N0kUAABAAyRRAABNVuvcM0ppknrRlXpRcovzDv57LIkCAABogCQKAAAqpp5a6Xui6uncKEoSBQAA0ABJFAAAVMxAE86JKnv8dqaIAgA62u6vnpSNz61v2fu7u7uzZcuWwY9rtc+0ZB7Tp09Pf39/S94NY40iCsaw//WOpTnsu59q9TQAxrS3T56XJLnt0S+3eCbNdcy0P9vm49v6L2jRTChDPV2p2xNVGnuiAAAAGiCJAgCAiqkXXRko/ZwoSRQAAABDIIkCAICKqaeWesrtnlf2+O1MEQVjXL2D24sCALSCIqrV6p27lhQA2kK93uoZAGOMIoq21uqzO6roxeeRvJJalpQ4m87h/BUAmmnrYbvltj8YKDr3FxCKKNra2/eaP/jn2x65tIUzoVPs//99vpRx++e25vBMAGgns2fPzvjx49Pd3Z0kOeOMM3L44YfnrrvuyqJFi7Jly5ZMmTIlF110USZNmpQkw35WJt35AACgYurpykDJ13AP87300ktz88035+abb87hhx+eer2eM888M4sWLcqKFSsya9asXHzxxVu/j2E+K5siCgAAGLbVq1fnkUce2eZ66qmnhvz1/f396e7uzqxZs5IkJ510Um6//fYRPSub5XwAL1RCt8P7T/r0qI8JAC+nXtRK7+D7/Pjz58/f7tmCBQuycOHCHX7dGWeckaIo8uY3vzmf+MQnsmrVquy9996DzydOnJh6vZ4NGzYM+1lPT89ofZs7pIgCAACGbdmyZent7d3m3oQJE17yc/v6+vLss8/m85//fM4777wcddRRzZjmqFJEAQCdrXDcCNXz/L6lst+RJL29vZk6deqQvqavry9JMn78+MybNy9/+qd/mg9+8IN57LHHBj9n/fr16erqSk9PT/r6+ob1rGz2RAEAAKXbtGlT/vM//zNJUhRFvvOd72TatGmZMWNGnnnmmdx5551Jkuuuuy5HH310kgz7WdkkUQAAUDH1oiv1ks+JanT8xx9/PAsXLszAwEDq9Xr233//LF68OF1dXVm6dGkWL168TavyJMN+VjZFFAAAULp99tknN9100w6fzZw5M7fccsuoPitT2xRRL3XwFgAA0Jh6koGU3J2v1NHbW9sUUcnWg7fe8IY3tHoazVd08l9BAAAYWzSWaLHbHmjOqcoAAHSOeroG90WVdnVwKdFWSdSLD956qf7yAAAArdI25eOyZcvy7W9/OzfeeGOKosh5553X6ikBHagoRv8CgGYbSK0pV6dqmyLqxQdv/fSnP23xjAAAALbXFsv5Nm3alIGBgey+++7bHLwFAEBJ6qLyKiuacE5UUfL47awtiqiXOngLoOn8TAEAvIK2KKJe7uAtAACAdtIWRRS8JLvyAQAaNlB0ZaDk5XZlj9/OFFEAL1AUndtpCAAYGkUUAABUTD1JveQW5PVSR29viqh2YMkaAACMGYoogBfyOw0AKqDehD1RZbdQb2eKKIAXsCcKAHglnVs+AuzAQyefvTWNGs0LAJqsXtSacnUqRRQAAEADLOejvdU7ue8LAE0x4L81VM9AujJQcl5S9vjtTBEFsJ3OXZ4AALwyRRQAAFRM0YQ9S53cjEkRBfBimkEAAC9DEQUAABVTTy31kvcs1Tt4+bsiCuDFJFEAwMtQRAEAQMUMFLUMlLxnqezx25kiCgCgExUviN1rnfvDMAyHIgrgRYoBP0wAneW2ey9s9RRgTFFEAbzYKC5PePjUM0dtLAAYqnoTWpyXPX4769xjhgEAAIZBEgXwIjXd+QAY44p0pV6Um5cUHZzHKKLaQd1PbAAAMFYoogBezO81ABjjBlLLQMmH4ZY9fjvr3AwOAABgGCRRMEbtvvPkbHzmP1o9jY7W3d2dLVu2vOzn1P74rCbNhudNnz49/f39rZ4GQEvpzlcuRRSMUYcd8CfbfHz7Xee1aCZQjumf+sthfV3/ko+P8kwAYFuKqDZy26+/2OopAEDHKYp6q6cAo64oauV355NEQZsq7PCHTqXVPADtShEFAAAVU08t9ZK755U9fjtTRLUDaQvA9obxr8b+i+yHAqB8iigYqxTfAMBLGChqGSh5z1LZ47czRRQAbcmeKADalSIKAAAqpl50ld6dr+zx25kiCoD2JImCclkWDsPWueUjAADAMEiiAGhPfkkOMGxbD9stt/GDw3YBoM107n+aAWh3iqgR2P1VE7Nx4IlRGau7uzu12qUjHmf69Onp7+8fhRkBADBWOWy3XIqoEXj7niclSW5bfXmLZwKUZZeevmx+cnWrp8GLdHd3Z8uWLdvdr/3lJ1owm87kl3ZAJ1NEwVilq1JTbNqwKoeefMmQPvfH1/gBnmo7Zp/TB/98W/+XWjgT4JXUm7Anquzx25kiitLs1tWTp4snR2Wsrcsd/3rE4/jNKcPh0FcA4IUUUZTmsO53Z8Xma1o9jUFH7/Gh3N7/tVZPg7FI6gfAGLM1iSr7sN3OTaKcEwUAANAASRQAAFRMkSacE6U7HwAvxZ4o+J2qLm2tV/T7AkqjiAIAgIpxTlS5FFF0jGJgoNVTGF1V/Y1wO/KPGgB4AUUUwCuoDaH+/ufrP1n+RABgiJwTVS5FFMArsCcKAHghRRQAAFSMJKpczokCAABogCRqJGzsh1LMmDEjd999d6un8bK6u7uzZcuWwY9rtTNaOJtXNn369PT397d6Gox1/rsHkEQRBbQhP+w35u3vu/gVP+d//V17F3kAjK6iCcv5ig5ezqeIgrGq3uoJ0C40vgCA5lJEAQBDU/fbGxgrNJYol8YSI1Evtl4ALfSP3zpj616Vl7tgFNy26rKthZRiCuhwkiiACugawoHAMBqKgYHcvu6rrZ4G8ArqSeopOYkqdfT2pogCqAKpOAA0jSIKAAAqxp6ocimiYIyq2efCC/nrAABNo4gCAIZuoJN3QVSMX8ZVmnOiyqWIAqgAZ0UBQPMoogCqwG+UAXgBe6LKpYiCscoPzQCjw79PgQYpogAAoGLqaUISVfI5VO2sq9UTAGDkasXLX5AkM2bMSK1WG9G1YsP/m5122mnE49RqtcyYMaPV/0gAhkUSBQAdor+/v9VTGDRnt/+eFf1Xt3oaUFlFUSu9e57ufACMbS+TNv39yk81bx4A0AEUUSNQFM7KoIXq1mgBALSCIgqgAmq6iwHwAkVqpTd+KDq4sYQiCgDobPUOXVliRQMMmyIKAAAqpl7UUnPYbmkUUdACu79qYjYOPDEqY3V3d2fLli2p1ZaMeKzp06e3VfcuGtChv0hnDLMEFRjDFFHQAm+fNDdJctuav27xTLY6euJHkiS39/8/LZ4JADAaiqL8FuSd/LsQh+0CVMD37zg7tYH6Di8AaCd/9Vd/lQMOOCD33XdfkuSuu+7Ke97znsyZMycf+tCH8vjjjw9+7nCflU0RBQAAFVMvak25GnX33XfnrrvuypQpU7bOs17PmWeemUWLFmXFihWZNWtWLr744hE9awZF1EjUC51tXo5ztKCpasWOLwAo0+rVq/PII49scz311FPbfd6zzz6b8847L+eee+7gvf7+/nR3d2fWrFlJkpNOOim33377iJ41gz1RAFXRyYvTAdhGUdSasCdq6/jz58/f7tmCBQuycOHCbe596Utfynve855MnTp18N6qVauy9957D348ceLE1Ov1bNiwYdjPenp6Ru17fCmKKICKqAl/AWiBZcuWpbe3d5t7EyZM2Objf/u3f0t/f3/OOOOMZk6tNIooaAXLQMeU3bp68nTxZKun0ZDnW98nSa12Totn0xit9gFGrhjmnqVG35Ekvb2926RLO/KTn/wk999/f4488sgkW5cAfvjDH87JJ5+cxx57bPDz1q9fn66urvT09KSvr29Yz5pBEQXwCg4b/4dZ8cyyVk+jkuYc8tnt7q34yeIWzASAMp122mk57bTTBj+ePXt2rrjiirzuda/L9ddfnzvvvDOzZs3Kddddl6OPPjpJMmPGjDzzzDMNP2sGRRQAAFTM1nOiyn/HSHV1dWXp0qVZvHhxtmzZkilTpuSiiy4a0bNmUESNhE3cVIW/ywBAE33/+98f/PPMmTNzyy237PDzhvusbIooAFpHAQ9QinpqScrdE1Uvefx25pwoaIGiqKdwjtaYYT8UlKCNGuzc/uTXOrOgL+r/5wIaIokC6EC7j5uYjfUnWj2NQdt2E9y+2cRYptvgjhUDA62eAsCwKaIAOtBhux6XjBuX25+4stVTGVVzdv1gVjz99VZPI0lyzNSPJUlu67+0xTMBOlGRJhy2azkfAJ2magUUY8vKZ7/R6ikADJskClphoM3Wn9frSZffqVABbbTPBqCV6kUtKTmJKvsw33bWtJ+alixZktmzZ+eAAw7IfffdN3j/wQcfzNy5czNnzpzMnTs3Dz30ULOmBLyAVAJGWTMOaQGgJZpWRB155JFZtmxZpkyZss39xYsXZ968eVmxYkXmzZuXRYsWNWtKAFSMZgUAWz3/e5yyr07VtCJq1qxZ6evr2+be448/nnvuuSfHHntskuTYY4/NPffck/Xr1zdrWgAAKepttswaaGst3QSxatWq7LXXXhk3blySZNy4cZk8eXJWrVrVymkBtLUZM2akVquN6Frxn3+TnXbaacTj1Gq1zJgxo9X/SAB4sWJrd74yr7L3XLUzjSXoHJ2cOVMp7XTm0DtffVJW9l/X6mkAw+G/izBsLS2i+vr6smbNmgwMDGTcuHEZGBjI2rVrt1v2BwAADF1RpPSkqJPr8JYu55s0aVKmTZuW5cuXJ0mWL1+eadOmZeLEia2cFgAAwEtqWhL1uc99LitXrsy6detyyimnpKenJ7feemvOPffcnH322bn88sszYcKELFmypFlTAgDoXM5Vq7R6UUut9CTKnqjSnXPOOTnnnHO2u7///vvnhhtuaNY0ABhFK39rPxSMWZ28FgtGSGMJAAComK17oprwjg7V0j1RY93tT1zZ6ikwVtXrWy9gVK189hutngJjVSf/NAg0TBIFAGXwixKghZpxjpM9UQAAdBaFPgyb5XyAZSwAAA2QRAEAQMUUacJyvljOBzST5AcAYMxSRAEAQAX5lW157IkCAABogCQKAMpg2S7QQs1ocZ6i1rG7oiRRI6U9KAAAdBRJFAAAVE2R8jdFdXDgLomCFrj9qataPQUAAIZJEgUA0IlsSag0e6LKJYkCAABogCQKAAAqprAnqlSSKAAAgAZIooAU1sUDtJUZM2bk7rvvbsq7uru7U6td3pR3TZ8+Pf39/U15V6dr1p6oTqWIAgBoM+1UaOxW68nTeXLE43R3d+fuu+9OrTbyH7wVY7Sa5XwAALyk//vV70pRFCO+nnnmmVEZZ87uf6SAGornk6iyrw6liKJz1Dt49yMAAKNGETVShR/MGR77kAAAxiZFFAAAVExRNOcaa5YvX577778/SfLAAw9k/vz5OfnkkwfvDZUiCgAA6Ahf/OIXs8ceeyRJli5dmje+8Y15y1veks9+9rMNjaM7H5A891yrZwBAm1r57DdaPQWGawwmRWVbv3599txzz2zZsiX/+q//mksvvTSvetWr8ta3vrWhcRRRAABAR5g4cWIefvjh3HffffmDP/iDjB8/Pps3b07R4NpERRQAlGEsbhYAKqMo0oTDdssdvgwf/ehH8773vS/jxo3LX/7lXyZJ/vmf/zkHHnhgQ+MoogAAGDN0t2Uk3ve+9+WYY45Jkuy8885JkoMOOmiwoBoqjSUAAKBqiiZdY8zxxx+fnXfeebCASpJJkybltNNOa2gcRRS0yIqNV7d6CgAAHeXhhx/e7l5RFHnkkUcaGsdyPgAAqJiiqDVhT1TJ44+is846K0ny29/+dvDPz3v00Ufzute9rqHxFFEAAEClveY1r9nhn5Nk5syZOfrooxsaTxEFAGWoj8HNAkB1NGPP0hj619yCBQuSJG9605ty+OGHj3g8RRQAANARDj/88DzwwAP5xS9+kU2bNm3z7IQTThjyOIooAAConNrvrrLfMbZcccUVueyyy3LggQdmp512Grxfq9UUUQDQasVzz7V6ClBNDrJmBK6++urccMMNDR+u+2KKqBFy4BsAAG3Hnqgd2mmnnbLffvuNeBznRAEAaAQCHeH000/P5z73uaxduzb1en2bqxGSKCArnlnW6ikAAJTu7LPPTpLccMMNg/eKokitVsu///u/D3kcRRQAAFSN5Xw7dMcdd4zKOIooAACgI0yZMmVUxlFEAQBA1RS1rVfZ7xgD/vzP/zznn39+kuTMM89MrbbjeS9dunTIYyqiAACAypo6dergn1/72teOypiKKDpHoR09ANA5yj5Sa2zkUMkf//EfD/55wYIFozKmIgoAAOgYP/7xj3PTTTdl7dq1mTx5co477ri89a1vbWgM50QBAEDVFE26xpgbbrgh//N//s/83u/9Xo466qhMnjw5n/zkJ3P99dc3NI4kCgAA6AhXXnllrrrqqhx44IGD94455ph87GMfy4knnjjkcRRRAABQNbrz7dCGDRuy//77b3Nvv/32y5NPPtnQOJbzAUAZ6vWtFwBtY+bMmbnwwguzefPmJMmmTZuydOnSHHzwwQ2NI4kaqbLbngAAQKOKpFb2j6lj8Mfgz372s/n4xz+eWbNmZY899siTTz6Zgw8+OH/xF3/R0DiKKDpGMTDQ6ikAANBCkydPzrJly7J69erB7ny9vb0Nj6OIAgCAqmlG97wxmEQlyVNPPZV/+Zd/GSyijjjiiEyYMKGhMeyJAgAAOsKPfvSjzJ49O9dcc01+/vOf52//9m8ze/bs/OhHP2ponIaSqH/6p3/KrbfemvXr1+eKK67Iz3/+82zcuDFve9vbGnoptMLK317X6ikA0KZWbL6m1VOA0aU73w6df/75Oe+88/Kud71r8N5tt92Wz372s7n99tuHPM6Qk6hrrrkm5557bvbdd9/85Cc/SZLstNNO+dKXvtTAtAEAAFpj7dq1mTNnzjb3jjrqqKxbt66hcYZcRF199dW56qqrctppp6Wra+uX7bfffnnwwQcbeiEAdISi0MEVoM0cd9xxWbZs2Tb3rr322hx//PENjTPk5XxPP/10+vr6kiS12tbo7rnnnsurX/3qhl4IAACUTGOJHbrnnnty3XXX5corr8xee+2VNWvWZP369XnjG9+Y+fPnD37eiwutFxtyEXXIIYfkq1/9av70T/908N7Xv/71HHroocOYPgAAQHOdeOKJOfHEE0c8zpCLqHPOOSd/8id/khtuuCFPP/105syZk1133TVf+cpXRjwJAABgFEmidui9733vqIwz5CJq8uTJufHGG/Pzn/88jz76aPr6+vLGN75xcH9Ux3KAKwAAdJQhF1Ev7sJ333335R/+4R+SJKeffvrozgoAxriiXm/1FIBONwaTorFiyEXU6tWrt/n4P/7jP/KTn/wk73jHO0Z9UgAAAO1qyEXUBRdcsN29H/7wh7n11ltHdUIAAPCS6uKVIXHY7qATTzwx119/fZLkr/7qr7JgwYIRjzmiDU1vf/vb873vfW/EkwAAACjDQw89lC1btiRJvva1r43KmENOon7zm99s8/HmzZuzfPnywbOjAACA9lDTnW/QkUcemTlz5mTKlCnZsmXLNudBvdArnQ31QkMuoo466qjUarUUvzt9feedd860adNy4YUXDvllANApVmy8OnN2+++tngZAx7vgggty55135tFHH83Pf/7znHDCCSMec8hF1C9+8YsRvwwAAGgCSdQ2Zs2alVmzZuW3v/3tqJwVNeQiCgAAYCQ++tGP5pFHHklXV1d22WWX/Pmf/3mmTZuWBx98MGeffXY2bNiQnp6eLFmyJPvuu2+SDPvZjpxwwgn58Y9/nJtuuilr167N5MmTc9xxx+Wtb31rQ9/HyxZRP/rRj4Y0yNve9raGXgoAnWDFxqtbPQWAtrJkyZLsvvvuSZLvfe97+fSnP51vfetbWbx4cebNm5fjjjsuN998cxYtWpSvf/3rSTLsZztyww035JJLLsn73//+vOlNb8qqVavyyU9+MqeffnpOPPHEIX8fL1tEfeYzn3nFAWq1Wu64444hvxAAAKiOF58nmyQTJkzIhAkTtrv/fAGVJBs3bkytVsvjjz+ee+65J1dddVWS5Nhjj83555+f9evXpyiKYT2bOHHiDud65ZVX5qqrrsqBBx44eO+YY47Jxz72sdEror7//e8PeSAAAKA9NLM734663S1YsCALFy7c4Zd95jOfyT/90z+lKIpceeWVWbVqVfbaa6+MGzcuSTJu3LhMnjw5q1atSlEUw3r2UkXUhg0bsv/++29zb7/99suTTz7Z0Lfe0J6odevW5Wc/+1meeOKJwS59SUalwwUAADD2LFu2LL29vdvc21EK9bzPf/7zSZKbbropS5cuzemnn17q/F5o5syZufDCC3PGGWdk5513zqZNm3LJJZfk4IMPbmicIRdR3/ve93LmmWfmta99bX71q1/lda97XX75y19m5syZiigAAGgnRW3rVfY7kvT29mbq1KkNf/nxxx+fRYsWpbe3N2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPXspnP/vZfPzjH8+sWbOyxx575Mknn8zBBx+cv/iLv2ho7kMuor74xS/mC1/4Qo455pgccsghuemmm3LjjTfmV7/6VUMvBAAAOs/TTz+dp556arDI+f73v5899tgjkyZNyrRp07J8+fIcd9xxWb58eaZNmza4JG+4z3Zk8uTJWbZsWVavXj3Yne/FKdpQDLmIeuyxx3LMMcdsc++9731vDjvssHzqU59q+MUAAEDn2Lx5c04//fRs3rw5XV1d2WOPPXLFFVekVqvl3HPPzdlnn53LL788EyZMyJIlSwa/brjPXk5vb++wiqfnDbmImjRpUtatW5c999wzU6ZMyb/927/lv/yX/5J6vT7slwMAACVow8N299xzz1x//fU7fLb//vvnhhtuGNVnZeoa6ie+//3vz7/+678mSf7oj/4oH/zgB3Pcccflv/23/1ba5MaCoj6GjmoGAABGbMhJ1GmnnTb454P3InoAABc+SURBVOOPPz5vectbsnnz5u1aBAIAQGkKq6CGpA2TqFar1+v58Y9/nDe/+c0ZP378iMYachL1+c9/Pj/72c8GP957770VUAAAwJjQ1dWVj370oyMuoJIGiqiiKPLRj34073znO3PppZfmgQceGPHLAQCA0VfL1gN3S71a/U0OwyGHHJK77rprxOMMeTnfOeeck09/+tP50Y9+lOXLl2fu3LnZZ5998u53vzunnHLKiCcCAABQpr333jsf+chHcuSRR6a3tze12v8pBRs59HfISVSyNQI77LDDcsEFF2T58uXp6enJ0qVLGxkCAAAoW9Gka4zZsmVL3vGOd6RWq2XNmjVZvXr14NWIISdRSbJp06Z897vfza233pp/+Zd/ySGHHJILL7ywoRcCAAC0wgUXXDAq4wy5iPrYxz6Wf/zHf8zv//7v5w//8A9z4YUXvuxpwAAAQIvozveS7r///tx+++15/PHHs2jRojzwwAN59tlnc+CBBw55jCEv5/uDP/iD3HrrrVm2bFnmzZungAIAAMaU2267LfPnz8+aNWty0003JUmefvrphlfXDTmJ+shHPtLYDAEAgJaoSaJ26NJLL83f/M3f5MADD8xtt92WJDnwwAPzi1/8oqFxGmosAQAAMFatX78+BxxwQJIMduar1WrbdOkbCkUUAABUTZGkqJV8tfqbbNz06dNz8803b3Pv1ltvzRvf+MaGxmmoOx8AAMBY9ZnPfCYf/vCH881vfjObNm3Khz/84Tz44IP52te+1tA4TS2ilixZkhUrVuTRRx/NLbfckje84Q1JktmzZ2f8+PHp7u5Okpxxxhk5/PDDmzk1AACoDnuidmj//ffPbbfdlr//+7/PEUcckb6+vhxxxBHZddddGxqn4SJq8+bNue+++3LvvffmF7/4Re69994sW7ZsSF975JFH5oMf/GDmz5+/3bNLL710sKgCAAAow84775w3v/nNmTp1avbaa6+GC6hkCEXU833U77333tx77735zW9+k9122y0HHHBApk2blhNOOGHIL5s1a1bDEwQAABgNjz32WM4444z87//9vzNhwoQ89dRTedOb3pSLLrooU6ZMGfI4r1hEffjDH86+++6bmTNn5t57780hhxySL3/5y5kwYcKIvoEXO+OMM1IURd785jfnE5/4xKiPDwAAnUKL8x371Kc+lenTp+fKK6/MLrvskqeffjpf+tKXcvbZZ+eaa64Z8jiv2J1vl112yVVXXZWPfexj+fa3v50DDzwwxx13XH74wx+O6Bt4oWXLluXb3/52brzxxhRFkfPOO2/UxgYAAEiSu+++O2eddVZ22WWXJMmuu+6aM844I/39/Q2N84pF1He+853Bvund3d35sz/7s1xyySW54IIL8pnPfCYbN24cxvS31dfXlyQZP3585s2bl5/+9KcjHhMAADpW0aRrjDnooIPys5/9bJt7/f39OfjggxsaZ1jd+Q4++ODcfPPN+eIXv5jjjz8+3/ve94YzTJJk06ZNGRgYyO67756iKPKd73wn06ZNG/Z4AAAAz/vSl740+Od99tknp512Wo444oj09vZm9erV+Yd/+Icce+yxDY057Bbn48ePz1lnnZWjjz56yF/zuc99LitXrsy6detyyimnpKenJ1dccUUWLlyYgYGB1Ov17L///lm8ePFwp0UbKepj8NcTAAAVUEvGZFJUhtWrV2/z8Tvf+c4kyfr16zN+/PgcddRR2bJlS0NjjvicqEZO9z3nnHNyzjnnbHf/pptuGuk0AAAAtnPBBReM+phNPWwXAABoAt35XtLmzZvz8MMPZ9OmTdvcnzlz5pDHUEQBAAAd4aabbsp5552XV7/61dlpp50G79dqtfzgBz8Y8jiKKAAAqBpJ1A5ddNFF+fKXv5zDDjtsROO8YotzAACAKnj1q1+dt7zlLSMeRxEFAAAVUyuac401p59+ei688MKsX79+RONYzgcAAHSEfffdN5deemm+8Y1vDN4riiK1Wi3//u//PuRxFFEAAEBHOOuss3LcccflXe961zaNJRqliAIAADrChg0bcvrpp6dWq41oHHuiAACgaoomXWPM+973vtx8880jHkcSBQAAdISf/exnWbZsWf76r/86e+655zbPli1bNuRxFFEjtPLZb7zyJwEAAC134okn5sQTTxzxOIooAAComia0IC/G4HK+9773vaMyjiIKAADoCN/85jdf8tkJJ5ww5HEUUQAAUDXNaPwwBpOoFzeVWLduXX7zm9/k4IMPVkQBAAC82DXXXLPdvW9+85u5//77GxpHi3MAAKgaLc6H7H3ve19uvPHGhr5GEgUAAHSEer2+zcebN2/Ot7/97ey+++4NjaOIAgCAiqml/O58ydgLo37/938/tVptm3t77bVXzj///IbGUUQBAAAd4Y477tjm45133jkTJ05seBxFFAAAVI3ufDs0ZcqUURlHEQUAwJhR1MfgT+603Mknn7zdMr4XqtVqufrqq4c8niIKAAAqplY0YU9UMXbCqPe85z07vL9mzZpcc801eeaZZxoaTxEFAABU2vvf//5tPn7iiSfy1a9+Nddff33e9a535X/8j//R0HiKKAAAqBp7onZo48aNufLKK7Ns2bIcccQR+da3vpXXvOY1DY+jiAIAACrtmWeeydVXX52vfe1rOfTQQ/ONb3wjr3/964c9niIKAACqRhK1jdmzZ6der+fUU0/NjBkzsm7duqxbt26bz3nb29425PEUUQAAQKXttNNOSZJrr712h89rtdp2Z0i9HEUUAABQad///vdHdTxFFAAAVEyzWpx3qq5WTwAAAGAskUQBAEDVaCxRKkkUAABAAyRRAABQNZKoUkmiAAAAGiCJAgCAiqmlCd35OpgkCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqmVjRhT5QkCgAAgKGQRAEAQNXYE1UqSRQAAEADJFEAAFA1kqhSSaIAAAAaoIgCAGDMWPnsN1o9BbCcDwAAqqb2u4tySKIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgaoqkVnJSVEiiAAAAGApJFAAAVI09UaWSRAEAADRAEgUAAFXUwUlR2SRRAAAADZBEAQBAxdSa0J2v7PHbmSQKAACgAZIoAACoGt35SiWJAgAAaIAkCgAAKsaeqHJJogAAgNI98cQT+chHPpI5c+bk3e9+dxYsWJD169cnSe6666685z3vyZw5c/KhD30ojz/++ODXDfdZmRRRAABA6Wq1Wk499dSsWLEit9xyS/bZZ59cfPHFqdfrOfPMM7No0aKsWLEis2bNysUXX5wkw35WNkUUAABUTdGkK8nq1avzyCOPbHM99dRT202pp6cnhx566ODHBx10UB577LH09/enu7s7s2bNSpKcdNJJuf3225Nk2M/KZk8UAAAwbPPnz9/u3oIFC7Jw4cKX/Jp6vZ5rr702s2fPzqpVq7L33nsPPps4cWLq9Xo2bNgw7Gc9PT2j9N3tmCIKAAAqppmNJZYtW5be3t5tnk2YMOFlv/b888/PLrvskg984AP57ne/W9YUS6OIAgAAhq23tzdTp04d8ucvWbIkDz/8cK644op0dXWlr68vjz322ODz9evXp6urKz09PcN+VjZ7ogAAoGqauCeqEZdcckn6+/tz2WWXZfz48UmSGTNm5Jlnnsmdd96ZJLnuuuty9NFHj+hZ2SRRAABA6X75y1/mK1/5Svbdd9+cdNJJSZKpU6fmsssuy9KlS7N48eJs2bIlU6ZMyUUXXZQk6erqGtazsimiAACgaoaZFDX8jga8/vWvz7333rvDZzNnzswtt9wyqs/KZDkfAABAAyRRAABQMbU0oTtfucO3NUkUAABAAyRRAABQNW24J6pKJFEAAAANkEQBAEDF1IqkVpQbFZW956qdSaIAAAAaIIkCAICqsSeqVJIoAACABiiiAAAAGmA5HwAAVMzWxhLlv6NTSaIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgYuyJKpckCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqok/csla1pRdQTTzyRs846K7/+9a8zfvz4vPa1r815552XiRMn5q677sqiRYuyZcuWTJkyJRdddFEmTZrUrKkBAAAMWdOW89VqtZx66qlZsWJFbrnlluyzzz65+OKLU6/Xc+aZZ2bRokVZsWJFZs2alYsvvrhZ0wIAgOopmnR1qKYVUT09PTn00EMHPz7ooIPy2GOPpb+/P93d3Zk1a1aS5KSTTsrtt9/erGkBAAA0pCWNJer1eq699trMnj07q1atyt577z34bOLEianX69mwYUMrpgYAAGPe8+dElX11qpYUUeeff3522WWXfOADH2jF6wEAAIat6d35lixZkocffjhXXHFFurq60tfXl8cee2zw+fr169PV1ZWenp5mTw0AAKqhKLZeZb+jQzU1ibrkkkvS39+fyy67LOPHj0+SzJgxI88880zuvPPOJMl1112Xo48+upnTAgAAGLKmJVG//OUv85WvfCX77rtvTjrppCTJ1KlTc9lll2Xp0qVZvHjxNi3OAQAA2lHTiqjXv/71uffee3f4bObMmbnllluaNRUAAKi0ZjR+0FgCAACAIWl6YwkAAKBkzTgMVxIFAADAUEiiAACgYmr1rVfZ7+hUkigAAIAGSKIAAKCKOnjPUtkkUQAAAA2QRAEAQMU4J6pckigAAIAGSKIAAKBqimLrVfY7OpQkCgAAoAGSKAAAqBh7osoliQIAAGiAJAoAAKqmSPnnREmiAAAAGApFFAAAQAMs5wMAgIrRWKJckigAAIAGSKIAAKBqHLZbKkkUAABAAyRRAABQMfZElUsSBQAA0ABJFAAAVFEHJ0Vlk0QBAAA0QBIFAABV04Q9UZ2cdEmiAAAAGiCJAgCAqqkXW6+y39GhJFEAAAANkEQBAEDVFCl/z1LnBlGSKAAAgEZIogAAoGJqTejOV3r3vzYmiQIAAGiAIgoAAKABlvMBAEDVFMXWq+x3dChJFAAAQAMkUQAAUDEaS5RLEgUAANAASRQAAFRRBydFZZNEAQAANEASBQAAFVMritRK7p5X9vjtTBIFAADQAEkUAABUTf13V9nv6FCSKAAAgAZIogAAoGLsiSqXJAoAAKABkigAAKiaIuWfE9W5QZQkCgAAoBGSKAAAqJoiSdl7liRRAAAADIUkCgAAKqZWbL3KfkenkkQBAAA0QBEFAADQAMv5AACgcoryG0t0cGcJSRQAAEADJFEAAFAxtXpSq5X/jk4liQIAAEq3ZMmSzJ49OwcccEDuu+++wfsPPvhg5s6dmzlz5mTu3Ll56KGHRvysbIooAAComqJoztWAI488MsuWLcuUKVO2ub948eLMmzcvK1asyLx587Jo0aIRPyubIgoAABi21atX55FHHtnmeuqpp7b7vFmzZqWvr2+be48//njuueeeHHvssUmSY489Nvfcc0/Wr18/7GfNYE8UAABUTZHym+f9bvz58+dv92jBggVZuHDhKw6xatWq7LXXXhk3blySZNy4cZk8eXJWrVqVoiiG9WzixImj9A2+NEUUAAAwbMuWLUtvb+829yZMmNCi2TSHIgoAACqmVhSplXxO1PPj9/b2ZurUqcMao6+vL2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPmsGeKAAAoCUmTZqUadOmZfny5UmS5cuXZ9q0aZk4ceKwnzWDJAoAAKpmGN3zhvWOBnzuc5/LypUrs27dupxyyinp6enJrbfemnPPPTdnn312Lr/88kyYMCFLliwZ/JrhPitbrSjK/qdbrkceeSRHHnlk7rjjjmHHiJTjnePnZeWz32j1NAAAOsbzPxtPmXBcXjVut1Lf9dzAxjz61M0d+XO4JAoAAKqmSFJvwjs6lD1RAAAADZBEAQBA1TShO1/pe67amCQKAACgAYooAACABljOBwAAVdOGLc6rRBIFAADQAEkUAABUjSSqVJIoAACABkiiAACgaso+aLdZ72hTkigAAIAGSKIAAKBiakVKP2y31rlboiRRAAAAjZBEAQBA1ejOVypJFAAAQAMkUQAAUDlNSKIiiQIAAGAIJFEAAFA1RZHSkyJ7ogAAABgKSRQAAFRNPUmt5Hd0bhAliQIAAGiEIgoAAKABlvMBAEDF1IoitZLX29U0lgAAAGAompZEPfHEEznrrLPy61//OuPHj89rX/vanHfeeZk4cWIOOOCAvOENb0hX19aabunSpTnggAOaNTXKUtRbPQMAgM6kxXmpmlZE1Wq1nHrqqTn00EOTJEuWLMnFF1+cL3zhC0mS6667LrvuumuzpgMAADAsTVvO19PTM1hAJclBBx2Uxx57rFmvBwCAzlEUSb3kSxLVXPV6Pddee21mz549eO/kk0/OwMBA/ut//a9ZuHBhxo8f34qpAQAAvKyWNJY4//zzs8suu+QDH/hAkuQHP/hB/u7v/i7Lli3Lr371q1x22WWtmBYAAFRDUTTn6lBNL6KWLFmShx9+OF/84hcHG0n09fUlSXbbbbe8//3vz09/+tNmTwsAAGBImrqc75JLLkl/f3+++tWvDi7Xe/LJJ9Pd3Z2ddtopzz33XFasWJFp06Y1c1oAAFAtxeD/latW/ivaUdOKqF/+8pf5yle+kn333TcnnXRSkmTq1Kk59dRTs2jRotRqtTz33HM5+OCDc/rppzdrWgAAAA1pWhH1+te/Pvfee+8On91yyy3NmgYAAFRfM86JakbS1aZa0lgCAABgrGpJi3MAAKBE9SYlUeNKfkWbkkQBAAA0QBIFAABVU9ST1Et+Sdnjty9JFAAAQAMUUQAAAA2wnA8AAKpGi/NSSaIAAAAaIIkCAICqKYrfpVElqkmiAAAAGAJJFAAAVE0zkih7ogAAABgKSRQAAFRNkSYkUZ1LEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAABVU68nRb3cd9RKHr+NSaIAAAAaIIkCAICqsSeqVJIoAACABkiiAACgcpqQRNUkUQAAAAyBIgoAAKABlvMBAEDV1IutV6ks5wMAAGAIJFEAAFAxRVGkKPmw3aL0FurtSxJFaVb+9rpWTwEAAEadJAoAAKrGnqhSSaIAAAAaIIkCAICqKZpw2K49UQAAAAyFJAoAAKqmqCf1crvzpVby+G1MEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAAAVU9TrKUreE1XYEwUAAMBQSKIAAKBqijRhT1S5w7czSRQAAEADFFEAAAANsJwPAACqpl5svcpU69z1fJIoAACABkiiAACgaor61qvsd3QoSRQAAEADJFEAAFA1RZGi7D1RXfZEAQAAMASSKAAAqBp7okoliQIAAGiAJAoAACqmqJe/J6r0PVdtTBIFAAA0xYMPPpi5c+dmzpw5mTt3bh566KFWT2lYFFEAAFA1RfF/9kWVdjWeRC1evDjz5s3LihUrMm/evCxatKiEb758lvMBAEDFPPeq3zbtHatXr97u2YQJEzJhwoRt7j3++OO55557ctVVVyVJjj322Jx//vlZv359Jk6cWPp8R5MiCgAAKmK33XbLHnvskbW5vynv6+7uzvz587e7v2DBgixcuHCbe6tWrcpee+2VcePGJUnGjRuXyZMnZ9WqVYqoZuvt7c0dd9yR3t7eVk8FAABaqqenJytXrszGjRub8r6iKFKr1ba7/+IUqmrGfBH1qle9KlOnTm31NAAAoC309PSkp6en1dPYTl9fX9asWZOBgYGMGzcuAwMDWbt2bfr6+lo9tYZpLAEAAJRu0qRJmTZtWpYvX54kWb58eaZNmzbmlvIlSa0ohtFWAwAAoEH3339/zj777Dz11FOZMGFClixZkv3226/V02qYIgoAAKABlvMBAAA0QBEFAADQAEUUAABAAxRRAAAADVBEAQAANEARBQAA0ABFFAAAQAP+f98plyJiy31ZAAAAAElFTkSuQmCC\n",
624 |       "text/plain": [
625 |        "<Figure size 1080x720 with 2 Axes>"
626 |       ]
627 |      },
628 |      "metadata": {},
629 |      "output_type": "display_data"
630 |     }
631 |    ],
632 |    "source": [
633 |     "clusterer.cluster(tree_path=\"tree.pkl\")\n",
634 |     "\n",
635 |     "clusterer.plot_tree(path=\"tree.pkl\")"
636 |    ]
637 |   },
638 |   {
639 |    "cell_type": "markdown",
640 |    "metadata": {},
641 |    "source": [
642 |     "## Injecting labels"
643 |    ]
644 |   },
645 |   {
646 |    "cell_type": "code",
647 |    "execution_count": 36,
648 |    "metadata": {
649 |     "ExecuteTime": {
650 |      "end_time": "2020-06-08T01:01:34.529613Z",
651 |      "start_time": "2020-06-08T01:01:34.509232Z"
652 |     }
653 |    },
654 |    "outputs": [
655 |     {
656 |      "data": {
657 |       "text/html": [
658 |        "<div>\n",
659 |        "<style scoped>\n",
660 |        "    .dataframe tbody tr th:only-of-type {\n",
661 |        "        vertical-align: middle;\n",
662 |        "    }\n",
663 |        "\n",
664 |        "    .dataframe tbody tr th {\n",
665 |        "        vertical-align: top;\n",
666 |        "    }\n",
667 |        "\n",
668 |        "    .dataframe thead th {\n",
669 |        "        text-align: right;\n",
670 |        "    }\n",
671 |        "</style>\n",
672 |        "<table border=\"1\" class=\"dataframe\">\n",
673 |        "  <thead>\n",
674 |        "    <tr style=\"text-align: right;\">\n",
675 |        "      <th></th>\n",
676 |        "      <th>username</th>\n",
677 |        "      <th>label</th>\n",
678 |        "    </tr>\n",
679 |        "  </thead>\n",
680 |        "  <tbody>\n",
681 |        "    <tr>\n",
682 |        "      <th>0</th>\n",
683 |        "      <td>user1</td>\n",
684 |        "      <td>pro</td>\n",
685 |        "    </tr>\n",
686 |        "    <tr>\n",
687 |        "      <th>1</th>\n",
688 |        "      <td>user2</td>\n",
689 |        "      <td>anti</td>\n",
690 |        "    </tr>\n",
691 |        "  </tbody>\n",
692 |        "</table>\n",
693 |        "</div>"
694 |       ],
695 |       "text/plain": [
696 |        "  username label\n",
697 |        "0    user1   pro\n",
698 |        "1    user2  anti"
699 |       ]
700 |      },
701 |      "execution_count": 36,
702 |      "metadata": {},
703 |      "output_type": "execute_result"
704 |     }
705 |    ],
706 |    "source": [
707 |     "# this is just an example. We are hiding the actual labels of real users here\n",
708 |     "labels = pd.DataFrame({\"username\":[\"user1\", \"user2\"], \"label\":[\"pro\", \"anti\"]})\n",
709 |     "labels"
710 |    ]
711 |   },
712 |   {
713 |    "cell_type": "code",
714 |    "execution_count": 38,
715 |    "metadata": {
716 |     "ExecuteTime": {
717 |      "end_time": "2020-06-08T01:21:09.279291Z",
718 |      "start_time": "2020-06-08T01:21:09.274152Z"
719 |     }
720 |    },
721 |    "outputs": [],
722 |    "source": [
723 |     "clusterer.inject_labels(users=labels.username, labels=labels.label)\n",
724 |     "\n",
725 |     "clusterer.align_clusters_with_labels(\n",
726 |     "    # this means multiple clusters can be assigned the same label\n",
727 |     "    allow_multiple_clusters=True\n",
728 |     ")"
729 |    ]
730 |   },
731 |   {
732 |    "cell_type": "markdown",
733 |    "metadata": {},
734 |    "source": [
735 |     "## Example on Turkish Election dataset"
736 |    ]
737 |   },
738 |   {
739 |    "cell_type": "code",
740 |    "execution_count": 39,
741 |    "metadata": {
742 |     "ExecuteTime": {
743 |      "end_time": "2020-06-08T01:23:28.426960Z",
744 |      "start_time": "2020-06-08T01:23:28.419726Z"
745 |     }
746 |    },
747 |    "outputs": [],
748 |    "source": [
749 |     "clusterer.plot()"
750 |    ]
751 |   },
752 |   {
753 |    "cell_type": "markdown",
754 |    "metadata": {},
755 |    "source": [
756 |     "<img src=\"ed.png\">"
757 |    ]
758 |   },
759 |   {
760 |    "cell_type": "markdown",
761 |    "metadata": {},
762 |    "source": [
763 |     "## Example on Trump dataset"
764 |    ]
765 |   },
766 |   {
767 |    "cell_type": "code",
768 |    "execution_count": 37,
769 |    "metadata": {
770 |     "ExecuteTime": {
771 |      "end_time": "2020-06-08T01:05:41.558112Z",
772 |      "start_time": "2020-06-08T01:05:41.554222Z"
773 |     }
774 |    },
775 |    "outputs": [],
776 |    "source": [
777 |     "# this calculates the micro f1 score for all umap configurations in the grid search\n",
778 |     "# and plots the result of each configuration\n",
779 |     "# then returns the results matrix and a heatmap plot of it\n",
780 |     "results, hm = cluster_projection_grid_search(\n",
781 |     "    \"trials\", users=labels.username, labels=labels.label,\n",
782 |     "    # this means multiple clusters can be assigned the same label\n",
783 |     "    allow_multiple_clusters=True\n",
784 |     ")"
785 |    ]
786 |   },
787 |   {
788 |    "cell_type": "markdown",
789 |    "metadata": {},
790 |    "source": [
791 |     "Example of plotted projections and grid search heatmap\n",
792 |     "<img src=\"trials/hm.png?\">\n",
793 |     "<img src=\"trials/0.0_30.png?\">\n",
794 |     "<img src=\"trials/0.1_60.png?\">"
795 |    ]
796 |   }
797 |  ],
798 |  "metadata": {
799 |   "kernelspec": {
800 |    "display_name": "Python 3",
801 |    "language": "python",
802 |    "name": "python3"
803 |   },
804 |   "language_info": {
805 |    "codemirror_mode": {
806 |     "name": "ipython",
807 |     "version": 3
808 |    },
809 |    "file_extension": ".py",
810 |    "mimetype": "text/x-python",
811 |    "name": "python",
812 |    "nbconvert_exporter": "python",
813 |    "pygments_lexer": "ipython3",
814 |    "version": "3.6.9"
815 |   },
816 |   "toc": {
817 |    "base_numbering": 1,
818 |    "nav_menu": {},
819 |    "number_sections": true,
820 |    "sideBar": true,
821 |    "skip_h1_title": false,
822 |    "title_cell": "Table of Contents",
823 |    "title_sidebar": "Contents",
824 |    "toc_cell": false,
825 |    "toc_position": {},
826 |    "toc_section_display": true,
827 |    "toc_window_display": false
828 |   },
829 |   "varInspector": {
830 |    "cols": {
831 |     "lenName": 16,
832 |     "lenType": 16,
833 |     "lenVar": 40
834 |    },
835 |    "kernels_config": {
836 |     "python": {
837 |      "delete_cmd_postfix": "",
838 |      "delete_cmd_prefix": "del ",
839 |      "library": "var_list.py",
840 |      "varRefreshCmd": "print(var_dic_list())"
841 |     },
842 |     "r": {
843 |      "delete_cmd_postfix": ") ",
844 |      "delete_cmd_prefix": "rm(",
845 |      "library": "var_list.r",
846 |      "varRefreshCmd": "cat(var_dic_list()) "
847 |     }
848 |    },
849 |    "types_to_exclude": [
850 |     "module",
851 |     "function",
852 |     "builtin_function_or_method",
853 |     "instance",
854 |     "_Feature"
855 |    ],
856 |    "window_display": false
857 |   }
858 |  },
859 |  "nbformat": 4,
860 |  "nbformat_minor": 4
861 | }
862 | 


--------------------------------------------------------------------------------
/ed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ed.png


--------------------------------------------------------------------------------
/methodology_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/methodology_diagram.png


--------------------------------------------------------------------------------
/src/AR_STOPWORDS.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/AR_STOPWORDS.pkl


--------------------------------------------------------------------------------
/src/__pycache__/clustering.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/clustering.cpython-36.pyc


--------------------------------------------------------------------------------
/src/__pycache__/encoder.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/encoder.cpython-36.pyc


--------------------------------------------------------------------------------
/src/__pycache__/preprocessing.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/preprocessing.cpython-36.pyc


--------------------------------------------------------------------------------
/src/__pycache__/projection.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/projection.cpython-36.pyc


--------------------------------------------------------------------------------
/src/__pycache__/top_terms.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/top_terms.cpython-36.pyc


--------------------------------------------------------------------------------
/src/clustering.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | from typing import Optional
  4 | 
  5 | import hdbscan
  6 | import matplotlib.pyplot as plt
  7 | import numpy as np
  8 | import pandas as pd
  9 | import seaborn as sns
 10 | from sklearn.metrics import classification_report, f1_score
 11 | from tqdm import tqdm
 12 | 
 13 | from projection import Projector
 14 | 
 15 | 
 16 | class Clusterer:
 17 | 
 18 |     def __init__(self, projection_path):
 19 |         self.projection_path = projection_path
 20 |         self._params = self._load_standard_embeddings()
 21 |         self.N: int = len(self._params["users"])
 22 | 
 23 |     def _load_standard_embeddings(self):
 24 |         file = np.load(self.projection_path, allow_pickle=True)
 25 |         params = dict()
 26 |         for k in file.keys():
 27 |             params[k] = file[k]
 28 |         return params
 29 | 
 30 |     @staticmethod
 31 |     def _cluster(standard_embeddings, **kwargs):
 32 |         return hdbscan.HDBSCAN(**kwargs).fit(standard_embeddings)
 33 | 
 34 |     def cluster(self, min_samples: Optional[int] = None, min_cluster_size: Optional[int] = None,
 35 |                 min_samples_divisor: int = 1000, min_cluster_size_divisor: int = 100,
 36 |                 tree_path=None,
 37 |                 **kwargs):
 38 |         if min_samples is None:
 39 |             kwargs["min_samples"] = max(10, self.N // min_samples_divisor)
 40 |         if min_cluster_size is None:
 41 |             kwargs["min_cluster_size"] = max(10, self.N // min_cluster_size_divisor)
 42 | 
 43 |         model = self._cluster(standard_embeddings=self._params["umap"],
 44 |                               **kwargs
 45 |                               )
 46 | 
 47 |         self._params["clusters"] = model.labels_
 48 |         np.savez(open(self.projection_path, 'wb'), **self._params)
 49 |         if tree_path is not None:
 50 |             pickle.dump(model.condensed_tree_, open(tree_path, 'wb'), protocol=3)
 51 | 
 52 |     @staticmethod
 53 |     def plot_tree(path):
 54 |         sns.set(context='notebook', style='white', rc={'figure.figsize': (15, 10)})
 55 |         return pickle.load(open(path, 'rb')).plot()
 56 | 
 57 |     def plot(self, labels_col="clusters"):
 58 |         return Projector.plot(embeddings=self._params["umap"], labels=self._params[labels_col])
 59 | 
 60 |     def inject_labels(self, users, labels):
 61 |         labels_dict = dict(zip(users, labels))
 62 |         self._params["labels"] = np.array(
 63 |             [labels_dict[u] if u in labels_dict else 'unk' for u in self._params["users"]]
 64 |         )
 65 | 
 66 |     def align_clusters_with_labels(self, allow_multiple_clusters=True):
 67 |         labels = self._params["labels"]
 68 |         ind = labels != 'unk'
 69 |         users = self._params["users"][ind]
 70 |         labels = labels[ind]
 71 | 
 72 |         df = pd.DataFrame(
 73 |             {"username": users, "labels": labels}
 74 |         ).merge(
 75 |             pd.DataFrame({"username": self._params["users"], "clusters": self._params["clusters"]})
 76 |         )
 77 | 
 78 |         g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False)
 79 | 
 80 |         d = {}
 81 |         while len(g) > 0:
 82 |             label, cluster = g.index[0]
 83 |             d[cluster] = label
 84 |             g = g.reset_index()
 85 |             g = g[(g.label != label) & (g.clusters != cluster)].set_index(["label", "clusters"]).sort_values("username",
 86 |                                                                                                              ascending=False)
 87 |         unlabeled_clusters = set(df.clusters) - set(d.keys())
 88 |         if allow_multiple_clusters and len(unlabeled_clusters) > 0:
 89 |             g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index()
 90 |             for c in unlabeled_clusters:
 91 |                 l = g.set_index("clusters").loc[c].label
 92 |                 if isinstance(l, pd.Series):
 93 |                     l = l.iloc[0]
 94 |                 d[c] = l
 95 | 
 96 |                 g = g[g.clusters != c]
 97 | 
 98 |         self._params["predictions"] = np.array([d[x] if x in d else 'unk' for x in self._params['clusters']])
 99 | 
100 |     def evaluate(self, metric=f1_score, report=True):
101 |         if "predictions" not in self._params:
102 |             raise Exception("No labels aligned with clusters")
103 | 
104 |         y = self._params["labels"]
105 |         p = self._params["predictions"]
106 | 
107 |         ind = y != 'unk'
108 |         y = y[ind]
109 |         p = p[ind]
110 | 
111 |         s = set(y)
112 |         if report:
113 |             return pd.DataFrame(classification_report(y, p, labels=s, output_dict=True))
114 | 
115 |         return metric(y, p, labels=s, average='micro')
116 | 
117 |     @staticmethod
118 |     def cluster_projection_grid_search(trials_dir, users=None, labels=None, allow_multiple_clusters=True):
119 |         results = dict()
120 |         for fn in tqdm(os.listdir(trials_dir)):
121 |             if not fn.endswith("npz"):
122 |                 continue
123 |             min_dist, n_neighbors = fn.replace(".npz", '').split("_")
124 |             projection_path = os.path.join(trials_dir, fn)
125 |             c = Clusterer(projection_path)
126 |             c.cluster()
127 |             # title = f"min_dist:{min_dist}\tn_neighbors:{n_neighbors}".expandtabs()
128 |             plot_path = os.path.join(trials_dir, f"{min_dist}_{n_neighbors}.png")
129 |             c.inject_labels(users=users, labels=labels)
130 |             c.align_clusters_with_labels(allow_multiple_clusters=allow_multiple_clusters)
131 |             fig = c.plot()
132 |             plt.savefig(plot_path, bbox_inches='tight')
133 |             plt.close()
134 | 
135 |             score = c.evaluate()
136 |             results.setdefault(min_dist, dict())
137 |             results[min_dist][n_neighbors] = score
138 |         return results
139 | 


--------------------------------------------------------------------------------
/src/encoder.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import tensorflow_hub as hub
 4 | # noinspection PyUnresolvedReferences
 5 | import tensorflow_text
 6 | from tqdm import tqdm
 7 | 
 8 | 
 9 | class Encoder:
10 |     DEFAULT_MODEL = "https://tfhub.dev/google/universal-sentence-encoder/4"
11 | 
12 |     def __init__(self, model_url: str = DEFAULT_MODEL):
13 |         """
14 |         Args:
15 |             model_url: str, url to the Universal Sentence Encoder model
16 |             Default is English USE >> https://tfhub.dev/google/universal-sentence-encoder/4
17 |             for the multilingual version, use: https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
18 |             more models are available at: https://tfhub.dev/google/collections/universal-sentence-encoder/1
19 |         """
20 |         self.model_url = model_url
21 |         self.encoder = self._load_model()
22 | 
23 |     def _load_model(self):
24 |         return hub.load(self.model_url)
25 | 
26 |     def encode(self, text):
27 |         return np.array(self.encoder(text))
28 | 
29 |     def encode_df(self, df: pd.DataFrame, out_path: str, user_col: str = "username", text_col: str = "text"):
30 |         users = list()
31 |         vectors = list()
32 |         counts = list()
33 | 
34 |         for user, tweets in tqdm(df.groupby(user_col)[text_col]):
35 |             try:
36 |                 vs = np.array(self.encoder(tweets.tolist()))
37 |                 users.append(user)
38 |                 vectors.append(np.mean(vs, axis=0))
39 |                 counts.append(len(tweets))
40 |             except Exception as e:
41 |                 print(user)
42 |                 print(e)
43 | 
44 |         np.savez(out_path, users=np.array(users), vectors=np.array(vectors), counts=np.array(counts),
45 |                  allow_pickle=True)
46 | 
47 | 
48 | class EncoderBERT(Encoder):
49 |     DEFAULT_MODEL = "roberta-base-nli-stsb-mean-tokens"
50 | 
51 |     def _load_model(self):
52 |         from sentence_transformers import SentenceTransformer
53 |         return SentenceTransformer(self.model_url)
54 | 


--------------------------------------------------------------------------------
/src/mutual_information.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | import numpy as np
 4 | import pandas as pd
 5 | import seaborn as sns
 6 | from sklearn.metrics import adjusted_mutual_info_score as ami
 7 | from tqdm import tqdm
 8 | 
 9 | 
10 | def correlate_clustering(df1, df2, metric_func, clusters_col="clusters", user_col="username", **kwargs):
11 |     merged = pd.merge(df1[df1[clusters_col] >= 0], df2[df2[clusters_col] >= 0], on=user_col)
12 |     y1, y2 = merged.labels_x, merged.labels_y
13 |     return metric_func(y1, y2, **kwargs)
14 | 
15 | 
16 | def calculate_alignment_matrix(dfs, metric_func, **kwargs):
17 |     matrix = np.zeros((len(dfs), len(dfs)))
18 |     for i, df1 in tqdm(enumerate(dfs)):
19 |         for j, df2 in enumerate(dfs):
20 |             matrix[i][j] = correlate_clustering(df1, df2, metric_func, **kwargs)
21 |     return matrix
22 | 
23 | 
24 | def plot_heatmap(frames, topics, func=ami):
25 |     hm = calculate_alignment_matrix(frames, func)
26 |     hm = pd.DataFrame(hm, columns=topics, index=topics).loc[reversed(topics)]
27 |     fig = sns.heatmap(
28 |         hm.round(2),
29 |         annot=True,
30 |         cmap="Blues",
31 |         annot_kws={"size": 30},
32 |         #         yticklabels=[i.title() for i in topics]
33 |     )
34 |     fig.set_yticklabels(labels=reversed(topics), rotation=45)
35 |     fig.set_xticklabels(labels=topics, rotation=45)
36 |     n = min(len(topics) * 2, 18)
37 |     sns.set(context='notebook', style='white', rc={'figure.figsize': (n, n)}, font_scale=3.5)
38 |     return fig
39 | 
40 | 
41 | def mutual_information(topics, root="topicals"):
42 |     frames = list()
43 |     for topic in tqdm(topics):
44 |         f = np.load(os.path.join(root, f"/{topic}.npz"))
45 |         users = f["users"]
46 |         clusters = f["clusters"]
47 |         frames.append(pd.DataFrame({"users": users, "labels": clusters}))
48 | 
49 |     fig = plot_heatmap(frames, topics)


--------------------------------------------------------------------------------
/src/preprocessing.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | 
 3 | import preprocessor as p
 4 | 
 5 | p.set_options(p.OPT.URL, p.OPT.MENTION)
 6 | 
 7 | 
 8 | def camel_case_split(identifier):
 9 |     matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
10 |     return [m.group(0) for m in matches]
11 | 
12 | 
13 | def clean(text):
14 |     text = p.clean(text)
15 |     text = re.sub(r'^RT ', '', text)
16 |     text = ' '.join(camel_case_split(text))
17 |     text = re.sub(r'\W+', ' ', text)
18 |     text = re.sub(r"\d+", "number", text)
19 |     if len(text.strip().split()) < 3:
20 |         return None
21 |     return text.lower().strip()
22 | 


--------------------------------------------------------------------------------
/src/projection.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from itertools import product
  3 | 
  4 | import matplotlib.pyplot as plt
  5 | import numpy as np
  6 | import pandas as pd
  7 | import seaborn as sns
  8 | from tqdm import tqdm
  9 | from umap import UMAP
 10 | 
 11 | 
 12 | class Projector:
 13 |     DEFAULT_UMAP_PARAMS = dict(
 14 |         n_components=2,
 15 |         min_dist=0,
 16 |         n_neighbors=90,
 17 |         metric="cosine",
 18 |         random_state=42
 19 |     )
 20 |     DEFAULT_DIST_RANGE = [0.0, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99]
 21 |     DEFAULT_NEIGHBORS_RANGE = [20, 30, 40, 50, 60, 70, 80, 90, 100]
 22 | 
 23 |     def __init__(self, vectors_path):
 24 |         self.vectors_path = vectors_path
 25 |         self.users, self.vectors, self.counts = self._load_vectors(vectors_path)
 26 | 
 27 |     @staticmethod
 28 |     def _load_vectors(vectors_path):
 29 |         file = np.load(vectors_path)
 30 |         users: np.ndarray = file['users']
 31 |         vectors: np.ndarray = file['vectors']
 32 |         counts: np.ndarray = file['counts']
 33 |         return users, vectors, counts
 34 | 
 35 |     @staticmethod
 36 |     def _project(vectors, **kwargs):
 37 |         return UMAP(**kwargs).fit_transform(vectors)
 38 | 
 39 |     def project(self, out_path, min_counts=3, **kwargs):
 40 |         params = self.DEFAULT_UMAP_PARAMS.copy()
 41 |         params.update(kwargs)
 42 | 
 43 |         ind = self.counts >= min_counts
 44 |         users = self.users[ind]
 45 |         vectors = self.vectors[ind]
 46 | 
 47 |         standard_embeddings = self._project(
 48 |             vectors=vectors,
 49 |             **params
 50 |         )
 51 |         np.savez(open(out_path, 'wb'),
 52 |                  umap=standard_embeddings, users=users)
 53 | 
 54 |     @staticmethod
 55 |     def plot_grid_search_heatmap(results, heatmap_destination="temp.png"):
 56 |         hm = pd.DataFrame(results)
 57 |         hm.index = hm.index.astype(int)
 58 |         hm = hm.sort_index(ascending=False)
 59 |         x = sorted(hm.columns)
 60 |         hm.index.name = "n_neighbors"
 61 |         hm.columns.name = "min_dist"
 62 | 
 63 |         sns.set(context='notebook', style='white', rc={'figure.figsize': (len(hm) * 2, len(hm.columns) * 2)},
 64 |                 font_scale=2.5)
 65 | 
 66 |         sns.heatmap(hm[x], annot=True, cmap="Blues", annot_kws={"size": 30}, vmin=0.3, vmax=1, cbar=False)
 67 |         plt.savefig(heatmap_destination, bbox_inches='tight')
 68 | 
 69 |     @staticmethod
 70 |     def plot(embeddings, labels):
 71 |         fig = plt.figure()
 72 |         ax = fig.add_subplot(111)
 73 |         scatter = plt.scatter(embeddings[:, 0], embeddings[:, 1],
 74 |                               c=labels, s=0.1, cmap='Spectral')
 75 |         return scatter
 76 | 
 77 |     def grid_search(self, trials_dir, min_dists_range=DEFAULT_DIST_RANGE, n_neighbors_range=DEFAULT_NEIGHBORS_RANGE,
 78 |                     n_components=2,
 79 |                     metric="cosine", min_counts=3,
 80 |                     skip_existing=True, verbose=False):
 81 |         ind = self.counts >= min_counts
 82 |         users = self.users[ind]
 83 |         vectors = self.vectors[ind]
 84 | 
 85 |         umap_params = list(product(min_dists_range, n_neighbors_range))
 86 |         for min_dist, n in tqdm(umap_params, desc="UMAP"):
 87 |             if verbose:
 88 |                 print(f"{min_dist}_{n}")
 89 |             out_path = os.path.join(trials_dir, f"{min_dist}_{n}.npz")
 90 |             if os.path.isfile(out_path) and skip_existing:
 91 |                 continue
 92 | 
 93 |             standard_embeddings = self._project(
 94 |                 vectors=vectors,
 95 |                 random_state=42,
 96 |                 n_components=n_components,
 97 |                 n_neighbors=n,
 98 |                 min_dist=min_dist,
 99 |                 metric=metric
100 |             )
101 |             np.savez(open(out_path, 'wb'),
102 |                      umap=standard_embeddings, users=users)
103 | 


--------------------------------------------------------------------------------
/src/top_terms.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | from collections import Counter
  4 | 
  5 | import matplotlib.pyplot as plt
  6 | import numpy as np
  7 | import pandas as pd
  8 | from PIL import Image
  9 | from ar_wordcloud import ArabicWordCloud
 10 | from joblib import Parallel, delayed
 11 | from tqdm.notebook import tqdm
 12 | from wordcloud import STOPWORDS, WordCloud
 13 | 
 14 | 
 15 | def get_word_counts(file, text_col=None):
 16 |     res = {}
 17 |     tfg = 0
 18 |     pbar = tqdm(desc=file.split("/")[-1])
 19 |     with open(file) as f:
 20 |         for i, l in enumerate(f, 1):
 21 |             l = l.replace('.', '').strip().lower()
 22 |             if text_col is not None:
 23 |                 l = l.split('\t')[text_col]
 24 |             for w in l.split():
 25 |                 if len(w) <= 2:
 26 |                     continue
 27 |                 res.setdefault(w, 0)
 28 |                 res[w] += 1
 29 |                 tfg += 1
 30 |             if i % 10_000 == 0:
 31 |                 pbar.update(i)
 32 |     return res, tfg
 33 | 
 34 | 
 35 | def count_words_csv(text_series):
 36 |     counter = Counter()
 37 |     text_series.apply(lambda x: counter.update(x.lower().strip().split()))
 38 |     return dict(counter), sum(counter.values())
 39 | 
 40 | 
 41 | def valence_step(tfe1, tfg1, tfe2, tfg2, out, e):
 42 |     a = tfe1 / tfg1
 43 |     b = tfe2 / tfg2
 44 |     v1 = 2 * (a / (a + b)) - 1
 45 |     if v1 >= 0.8:
 46 |         out.write(f"{v1 * np.log(tfe1)}\t{e}\t{v1}\t{tfe1}\n")
 47 | 
 48 | 
 49 | def sort_scores(file):
 50 |     pd.read_csv(
 51 |         file, sep='\t', names=["score", "term", "valence", "frequency"]
 52 |     ).sort_values(
 53 |         "score", ascending=False
 54 |     ).to_csv(
 55 |         file.replace("txt", "tsv"), sep='\t', index=None
 56 |     )
 57 |     os.remove(file)
 58 | 
 59 | 
 60 | def valence(tf1, tfg1, tf2, tfg2, out):
 61 |     with open(out, 'w') as o:
 62 |         Parallel(n_jobs=-1, backend='threading')(
 63 |             delayed(valence_step)(
 64 |                 tfe, tfg1, 0 if e not in tf2 else tf2[e], tfg2, o, e
 65 |             ) for e, tfe in tf1.items() if len(e) > 2
 66 |         )
 67 |     print("Sorting terms")
 68 |     sort_scores(out)
 69 | 
 70 | 
 71 | def pipeline(df1, df2, out1, out2=None, text_col='text'):
 72 |     print("Counting terms...")
 73 |     (tf1, tfg1), (tf2, tfg2) = Parallel(n_jobs=2, backend='threading')(
 74 |         delayed(count_words_csv)(df[text_col]) for df in [df1, df2])
 75 |     del df1, df2
 76 |     print("Calculating valence for group 1 ...")
 77 |     valence(tf1, tfg1, tf2, tfg2, out1)
 78 |     if out2 is not None:
 79 |         print("Calculating valence for group 2 ...")
 80 |         valence(tf2, tfg2, tf1, tfg1, out2)
 81 | 
 82 | 
 83 | def plot_worcloud(file, mask_path=None, arabic=False):
 84 |     params = dict(width=800, height=800,
 85 |                   background_color='white',
 86 |                   min_font_size=10)
 87 |     if mask_path is not None:
 88 |         params["mask"] = np.array(Image.open(mask_path))
 89 | 
 90 |     scores = pd.read_csv(file, sep='\t').dropna()
 91 |     is_en = lambda x: bool(re.search('[a-z]', x.lower()))
 92 |     if arabic:
 93 |         import pickle
 94 |         params['stopwords'] = pickle.load(open('AR_STOPWORDS.pkl', 'rb'))
 95 |         scores = scores[~scores.term.apply(is_en)]
 96 |     else:
 97 |         params['stopwords'] = set(STOPWORDS)
 98 |         scores = scores[scores.term.apply(is_en)]
 99 |     scores = scores[:500].set_index("term").to_dict()["score"]
100 |     if arabic:
101 |         wordcloud = ArabicWordCloud(**params)
102 |         fig = wordcloud.from_dict(scores)
103 |     else:
104 |         wordcloud = WordCloud(**params)
105 |         fig = wordcloud.generate_from_frequencies(scores)
106 |     plt.figure(figsize=(8, 8), facecolor=None)
107 |     plt.imshow(wordcloud)
108 |     plt.axis("off")
109 |     plt.tight_layout(pad=0)
110 |     plt.savefig(f"{file}.png")
111 | 
112 | 
113 | def calculate_top_terms(clusters_path, tweets_path, prefix, user_col, text_col, use_clusters=True, mask_path=None):
114 |     enf = np.load(clusters_path)
115 |     df = pd.read_pickle(tweets_path)
116 |     users, clusters = enf["users"], enf["clusters"]
117 |     if use_clusters:
118 |         labels = dict(zip(users, clusters))
119 |         ind = clusters >= 0
120 |     else:
121 |         y = np.array(
122 |             [1 if re.search("(lfc)|(liverpool)", x.lower()) else 0 if re.search("(cfc)|(chelsea)", x.lower()) else -1
123 |              for x in enf["users"]])
124 |         ind = y >= 0
125 |         labels = dict(zip(users, y))
126 |     df = df[df[user_col].apply(lambda x: x in labels)]
127 |     df = df.assign(label=df[user_col].apply(lambda x: labels[x]))
128 | 
129 |     o1 = os.path.join("terms", f"{prefix}.0.txt")
130 |     o2 = os.path.join("terms", f"{prefix}.1.txt")
131 |     pipeline(df[df.label == 0], df[df.label == 1],
132 |              out1=o1,
133 |              out2=o2,
134 |              text_col=text_col
135 |              )
136 | 
137 |     for o in enumerate(o1, o2):
138 |         plot_worcloud(o, mask_path=mask_path)
139 | 


--------------------------------------------------------------------------------
/src/turkish_normalizer.py:
--------------------------------------------------------------------------------
 1 | # get zemberek from https://github.com/ahmetaa/zemberek-nlp
 2 | 
 3 | from os.path import join
 4 | 
 5 | from jpype import JClass, JString, getDefaultJVMPath, startJVM
 6 | 
 7 | ZEMBEREK_PATH: str = join('zemberek', 'bin', 'zemberek-full.jar')
 8 | 
 9 | startJVM(
10 |     getDefaultJVMPath(),
11 |     '-ea',
12 |     f'-Djava.class.path={ZEMBEREK_PATH}',
13 |     convertStrings=False
14 | )
15 | 
16 | TurkishMorphology: JClass = JClass('zemberek.morphology.TurkishMorphology')
17 | TurkishSentenceNormalizer: JClass = JClass(
18 |     'zemberek.normalization.TurkishSentenceNormalizer'
19 | )
20 | Paths: JClass = JClass('java.nio.file.Paths')
21 | 
22 | normalizer = TurkishSentenceNormalizer(
23 |     TurkishMorphology.createWithDefaults(),
24 |     Paths.get(
25 |         join('zemberek', 'data', 'normalization')
26 |     ),
27 |     Paths.get(
28 |         join('zemberek', 'data', 'lm', 'lm.2gram.slm')
29 |     )
30 | )
31 | 
32 | 
33 | def normalize(text):
34 |     return str(normalizer.normalize(JString(text)))
35 | 
36 | 
37 | def normalize_df(df, text_col):
38 |     df[text_col] = df[text_col].apply(normalize)
39 |     return df
40 | 


--------------------------------------------------------------------------------
/trials/0.0_30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.0_30.png


--------------------------------------------------------------------------------
/trials/0.1_60.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.1_60.png


--------------------------------------------------------------------------------
/trials/hm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/hm.png


--------------------------------------------------------------------------------
/wc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/wc.png


--------------------------------------------------------------------------------