├── .gitignore ├── LICENSE ├── Presentation.pdf ├── Presentation.pptx ├── README.md ├── ami.png ├── clusterUsersUniversalSentenceEncoder.py ├── demo.ipynb ├── ed.png ├── methodology_diagram.png ├── src ├── AR_STOPWORDS.pkl ├── __pycache__ │ ├── clustering.cpython-36.pyc │ ├── encoder.cpython-36.pyc │ ├── preprocessing.cpython-36.pyc │ ├── projection.cpython-36.pyc │ └── top_terms.cpython-36.pyc ├── clustering.py ├── encoder.py ├── mutual_information.py ├── preprocessing.py ├── projection.py ├── top_terms.py └── turkish_normalizer.py ├── trials ├── 0.0_30.png ├── 0.1_60.png └── hm.png └── wc.png /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | .ipynb_checkpoints/ 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Ammar Rashed 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Presentation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pdf -------------------------------------------------------------------------------- /Presentation.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pptx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Embeddings-Based Unsupervised Stance Detection 2 | 3 | This repository contains the implementation of an unsupervised method for target-specific stance detection using embeddings-based clustering, as presented in our ICWSM 2021 paper. 4 | 5 | ## Publications 6 | 7 | - **Paper (ICWSM'21)**: [Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey](https://ojs.aaai.org/index.php/ICWSM/article/view/18082) 8 | - **Paper Presentation**: [PaperTalk ICWSM'21](https://papertalk.org/papertalks/31537) 9 | - **Thesis (MSc August 2020)**: [Embeddings-Based Clustering For Target Specific Stances](https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=fl0Kw4p1rmMDotyKRdYv1AZv-bsnninllPXAXfoe9S1sXEDBPXspE5WeUtqcCjlk) 10 | 11 | ## Overview 12 | 13 | We propose an unsupervised method for stance detection that can capture fine-grained divergences across various topics in polarized communities. Our approach overcomes the limitations of previous methods by: 14 | 15 | - Not requiring platform-specific features (like retweets) 16 | - Working effectively with limited data 17 | - Supporting hierarchical clustering without specifying the number of clusters 18 | - Using pre-trained language models to handle morphologically rich languages 19 | 20 | ## Methodology 21 | 22 | The method consists of five main steps: 23 | 24 | 1. **Data Collection**: Collect tweets related to specific topics or targets 25 | 2. **Feature Extraction**: Encode tweets using pre-trained universal sentence encoders 26 | 3. **User Representation**: Average tweet vectors per user to create user embeddings 27 | 4. **Projection**: Project user vectors to lower dimensional space using UMAP 28 | 5. **Clustering**: Cluster the projected vectors using HDBSCAN 29 | 30 | ![Methodology Diagram](methodology_diagram.png) 31 | 32 | ## Key Features 33 | 34 | ### Fine-grained Stance Detection 35 | 36 | Our method can automatically detect stances down to the party-affiliation level in a completely unsupervised manner, outperforming previous approaches. 37 | 38 | ![Fine-grained Stance Detection](ed.png) 39 | 40 | ### Cross-Topic Mutual Information 41 | 42 | Using our clustering method, we can analyze the correlations between user stances across different topics, allowing for deeper insight into the structure of polarization. 43 | 44 | ![Mutual Information Heatmap](ami.png) 45 | 46 | ### Semantic Analysis Between Clusters 47 | 48 | We identify the most prominent terms in each cluster to show how different groups talk about the same issues in different contexts, revealing semantic divergences between polarized groups. 49 | 50 | ![Word Clouds of Prominent Terms](wc.png) 51 | 52 | ## Performance 53 | 54 | Our method achieves: 55 | - 90% precision in identifying user stances 56 | - Over 80% recall 57 | - Competitive performance with supervised methods, while being completely unsupervised 58 | - Ability to detect fine-grained sub-groups that previous methods couldn't identify 59 | 60 | ## Installation 61 | 62 | ```bash 63 | # Clone this repository 64 | git clone https://github.com/AmmarRashed/UnsupervisedStanceDetection.git 65 | cd UnsupervisedStanceDetection 66 | 67 | # Create and activate a virtual environment (recommended) 68 | python -m venv venv 69 | source venv/bin/activate # On Windows: venv\Scripts\activate 70 | 71 | # Install dependencies 72 | pip install -r requirements.txt 73 | ``` 74 | 75 | ### Requirements 76 | 77 | > **Note**: This work was tested using specific versions of packages. Newer versions might not work as expected. 78 | 79 | - [umap-learn 0.3.x](https://pypi.org/project/umap-learn/0.3.10/) 80 | - [hdbscan 0.8.x](https://pypi.org/project/hdbscan/0.8.26/) 81 | - [tensorflow-hub 0.8.x](https://pypi.org/project/tensorflow-hub/0.8.0/) 82 | - [tensorflow-text 2.2.x](https://pypi.org/project/tensorflow-text/2.2.1/) 83 | - matplotlib 84 | - numpy 85 | - pandas 86 | - tqdm 87 | 88 | ## Usage 89 | 90 | ```python 91 | # Basic usage example 92 | python clusterUsersUniversalSentenceEncoder.py your_data.tsv 93 | ``` 94 | 95 | The input file should be a tab-separated file with: 96 | - First column: UserIDs 97 | - Second column: Tweets 98 | 99 | ### Code Sample 100 | 101 | ```python 102 | from clusterUsersUniversalSentenceEncoder import cluster_users, plot_clusters_no_labels 103 | import tensorflow_hub as hub 104 | import pandas as pd 105 | 106 | # Load the universal sentence encoder 107 | embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4') 108 | 109 | # Load and prepare your data 110 | df_text = pd.read_csv('your_data.tsv', header=None, usecols=[0, 1], sep='\t') 111 | df_text.columns = ['User', 'Text'] 112 | df_text = df_text.apply(lambda s: s.str.strip()) 113 | 114 | # Cluster users based on their tweets 115 | cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at='results.npz') 116 | 117 | # Visualize the clusters 118 | plot_clusters_no_labels('results.npz.cluster') 119 | ``` 120 | 121 | ## Customization Options 122 | 123 | The method can be customized with different parameters: 124 | 125 | - **Sentence Encoder**: Different pre-trained models can be used (multilingual, transformer-based, etc.) 126 | - **UMAP Parameters**: Adjust `min_dist` and `n_neighbors` to control projection characteristics 127 | - **HDBSCAN Parameters**: Modify `min_cluster_size` and `min_samples` to control clustering sensitivity 128 | 129 | ## Applications 130 | 131 | This method has been successfully applied to: 132 | - Political polarization analysis 133 | - Election stance detection 134 | - Sports fan sentiment analysis 135 | - Cross-cultural stance detection 136 | 137 | ## Citation 138 | 139 | If you use this code in your research, please cite our paper: 140 | 141 | ``` 142 | Rashed, A., Kutlu, M., Darwish, K., Elsayed, T., & Bayrak, C. (2021). Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 537-548. https://doi.org/10.1609/icwsm.v15i1.18082 143 | ``` 144 | 145 | BibTeX format: 146 | 147 | ```bibtex 148 | @article{rashed2021embeddings, 149 | title={Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey}, 150 | author={Rashed, Ammar and Kutlu, Mucahid and Darwish, Kareem and Elsayed, Tamer and Bayrak, Cansın}, 151 | journal={Proceedings of the International AAAI Conference on Web and Social Media}, 152 | volume={15}, 153 | number={1}, 154 | pages={537--548}, 155 | year={2021}, 156 | doi={10.1609/icwsm.v15i1.18082} 157 | } 158 | ``` 159 | 160 | ## Contributing 161 | 162 | Contributions are welcome! Please feel free to submit a Pull Request. 163 | 164 | ## License 165 | 166 | This project is licensed under the MIT License - see the LICENSE file for details. 167 | 168 | ## Contact 169 | 170 | - Ammar Rashed (ammar.rasid@ozu.edu.tr) 171 | - Kareem Darwish (kdarwish@hbku.edu.qa) 172 | -------------------------------------------------------------------------------- /ami.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ami.png -------------------------------------------------------------------------------- /clusterUsersUniversalSentenceEncoder.py: -------------------------------------------------------------------------------- 1 | ############################################################################### 2 | # Code written by Ammar Rashid (Özyeğin University) 3 | # ammar.rasid@ozu.edu.tr 4 | # and modified by Kareem Darwish (Qatar Computing Research Institute) 5 | # kdarwish@hbku.edu.qa 6 | # The code is provided for research purposes ONLY 7 | ############################################################################### 8 | 9 | ############################################################################### 10 | # sys.argv[1] is a tab separated file with first column containing UserIDs 11 | # and second column containing tweets 12 | ############################################################################### 13 | # there are many options for the universal sentence encoder including multilingual 14 | # models, Transformer model (slow), and CNN model (fast) 15 | # check out: https://tfhub.dev/google/universal-sentence-encoder/1 16 | # for options 17 | ############################################################################### 18 | 19 | import ntpath 20 | import sys 21 | from typing import Callable 22 | 23 | import matplotlib.pyplot as plt 24 | import numpy as np 25 | import pandas as pd 26 | import tensorflow_hub as hub 27 | from hdbscan import HDBSCAN 28 | from tqdm import tqdm 29 | from umap import UMAP 30 | 31 | 32 | def cluster_users(df, encoder: Callable, min_tweets=3, user_col="username", 33 | tweet_col="norm_tweet", save_at="temp.npz", 34 | min_dist=0.0, n_neighbors=90, **kwargs): 35 | gs = df.groupby(user_col) 36 | users = list() 37 | vectors = list() 38 | for user, frame in tqdm(gs): 39 | if len(frame) < min_tweets: 40 | continue 41 | try: 42 | tweets = frame[tweet_col] 43 | vec = np.mean(np.array(encoder(tweets.tolist())), axis=0) 44 | users.append(user) 45 | vectors.append(vec) 46 | except Exception as e: 47 | print(f"ERROR at:{user}") 48 | print(e) 49 | print() 50 | 51 | users: np.ndarray = users 52 | vectors: np.ndarray = vectors 53 | 54 | standard_embeddings = UMAP( 55 | random_state=42, 56 | n_components=2, 57 | n_neighbors=n_neighbors, 58 | min_dist=min_dist, 59 | metric='cosine', **kwargs 60 | ).fit_transform(vectors) 61 | print("Projection complete") 62 | 63 | params = dict() 64 | 65 | clusterer = cluster_embeddings(standard_embeddings, **kwargs) 66 | params['clusters'] = clusterer.labels_ 67 | params["allow_pickle"] = True 68 | np.savez(open(save_at + '.cluster', 'wb'), users=np.array(users), vectors=np.array(vectors), 69 | umap=np.array(standard_embeddings), clusters=np.array(clusterer.labels_)) 70 | 71 | output_file = open(save_at + '.clusters.txt', mode='w') 72 | for i in range(len(clusterer.labels_)): 73 | output_file.write(str(users[i]) + '\t' + str(clusterer.labels_[i]) + '\n') 74 | output_file.close() 75 | 76 | 77 | def plot_clusters_no_labels(embeddings_path, clusters_col="clusters", green_label="pro", red_label='anti', align=False, 78 | title=None, include_ratio=True, labeled_only=False): 79 | if title is None: 80 | title = ntpath.basename(embeddings_path).split('.')[0] 81 | f = np.load(embeddings_path) 82 | users = f["users"] 83 | clusters = f[clusters_col] 84 | cluster_ratio = round(sum(clusters >= 0) * 100 / len(clusters), 2) 85 | em = f["umap"] 86 | 87 | ind = clusters >= 0 88 | users = users[ind] 89 | clusters = clusters[ind] 90 | em = em[ind, :] 91 | c = ['red', 'blue', 'green', 'black', 'orange', 'teal'] 92 | if align: 93 | d = align_clusters_with_labels( 94 | pd.DataFrame({"username": users, "clusters": clusters}) 95 | ) 96 | c = ['red', 'blue', 'green', 'black', 'orange', 'teal', 'olive', 'yellow'] 97 | else: 98 | labels_dict = {} 99 | 100 | cmap = list() 101 | for i in range(len(clusters)): 102 | cmap.append(c[clusters[i] - 1]) 103 | 104 | fig = plt.figure() 105 | ax = fig.add_subplot(111) 106 | scatter = plt.scatter(em[:, 0], em[:, 1], c=cmap, 107 | s=0.5, cmap='Spectral') 108 | ax.set_title(title, fontsize=22) 109 | plt.show() 110 | return scatter 111 | 112 | 113 | def align_clusters_with_labels(df, allow_multiple_clusters=True): 114 | df = df[df.clusters >= 0] 115 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False) 116 | 117 | d = {} 118 | while len(g) > 0: 119 | label, cluster = g.index[0] 120 | d[cluster] = label 121 | g = g.reset_index() 122 | g = g[(g.label != label) & (g.clusters != cluster)] \ 123 | .set_index(["label", "clusters"]) \ 124 | .sort_values("username", ascending=False) 125 | unlabeled_clusters = set(df.clusters) - set(d.keys()) 126 | if allow_multiple_clusters and len(unlabeled_clusters) > 0: 127 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index() 128 | for c in unlabeled_clusters: 129 | l = g.set_index("clusters").loc[c].label 130 | if isinstance(l, pd.Series): 131 | l = l.iloc[0] 132 | d[c] = l 133 | 134 | g = g[g.clusters != c] 135 | 136 | return d 137 | 138 | 139 | def cluster_embeddings(standard_embedding, 140 | min_cluster_size=None, 141 | min_samples=None, 142 | plot_tree=False, 143 | min_samples_div=1000, 144 | min_cluster_size_div=100, 145 | **kwargs): 146 | if min_cluster_size is None: 147 | min_cluster_size = max(10, len(standard_embedding) // min_cluster_size_div) 148 | if min_samples is None: 149 | min_samples = max(10, len(standard_embedding) // min_samples_div) 150 | clusterer = HDBSCAN( 151 | min_samples=min_samples, 152 | min_cluster_size=min_cluster_size, **kwargs 153 | ).fit(standard_embedding) 154 | if plot_tree: 155 | clusterer.condensed_tree_.plot() 156 | # return clusterer.labels_, clusterer.condensed_tree_ 157 | return clusterer 158 | 159 | 160 | if __name__ == "__main__": 161 | embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4') # You can use different encoders here 162 | 163 | inputFile = sys.argv[1] # ex. trump.tsv 164 | df_text = pd.read_csv(inputFile, header=None, usecols=[0, 1], error_bad_lines=False, sep='\t') 165 | df_text.columns = ['User', 'Text'] 166 | df_text = df_text.apply(lambda s: s.str.strip()) 167 | cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at=inputFile + '.npz') 168 | plot_clusters_no_labels(inputFile + '.npz.cluster') 169 | -------------------------------------------------------------------------------- /demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Preprocessing\n", 8 | "- Remove URLs and Mentions\n", 9 | "- Separate composite camel case words (e.g. BlackLives --> black lives)\n", 10 | "- Remove non-alphanumeric characters\n", 11 | "- Replace numbers with the token \"_number_\"\n", 12 | "- Lowercase everything" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": { 19 | "ExecuteTime": { 20 | "end_time": "2020-06-07T23:45:10.994786Z", 21 | "start_time": "2020-06-07T23:45:10.990361Z" 22 | } 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "from src.preprocessing import clean" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 3, 32 | "metadata": { 33 | "ExecuteTime": { 34 | "end_time": "2020-06-07T23:46:12.708646Z", 35 | "start_time": "2020-06-07T23:46:12.680332Z" 36 | } 37 | }, 38 | "outputs": [ 39 | { 40 | "data": { 41 | "text/plain": [ 42 | "'black lives matter'" 43 | ] 44 | }, 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "output_type": "execute_result" 48 | } 49 | ], 50 | "source": [ 51 | "clean(\"#BlackLivesMatter https://www.google.com/ @realDonaldTrump\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "# Encoding\n", 59 | "## Universal Sentence Encoder" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 1, 65 | "metadata": { 66 | "ExecuteTime": { 67 | "end_time": "2020-06-07T23:55:34.962132Z", 68 | "start_time": "2020-06-07T23:55:33.625304Z" 69 | } 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "from src.encoder import Encoder" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": { 80 | "ExecuteTime": { 81 | "end_time": "2020-06-07T23:55:40.034473Z", 82 | "start_time": "2020-06-07T23:55:37.473873Z" 83 | } 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# default encoder is USE for English only.\n", 88 | "# But you can use multilingual as well, like ...\n", 89 | "encoder = Encoder(model_url=\"https://tfhub.dev/google/universal-sentence-encoder-multilingual/3\")" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 5, 95 | "metadata": { 96 | "ExecuteTime": { 97 | "end_time": "2020-06-07T23:55:57.481777Z", 98 | "start_time": "2020-06-07T23:55:57.448134Z" 99 | } 100 | }, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "array([[ 9.02929604e-02, 2.53139641e-02, -8.63993599e-04,\n", 106 | " 3.37017924e-02, -6.26476333e-02, -4.42366041e-02,\n", 107 | " 2.18325537e-02, 5.37963435e-02, -8.38939548e-02,\n", 108 | " -9.51755140e-03, -3.12121455e-02, -5.35302460e-02,\n", 109 | " -4.03429270e-02, -6.45988435e-02, -4.22783829e-02,\n", 110 | " 6.87545631e-03, 2.68412735e-02, 1.69232395e-02,\n", 111 | " 4.52055521e-02, -7.21441209e-02, 7.80028552e-02,\n", 112 | " 7.60580525e-02, -4.91863601e-02, -3.33283916e-02,\n", 113 | " -6.48475764e-03, 5.31073436e-02, 5.94470128e-02,\n", 114 | " 4.97598015e-02, -5.83836809e-02, 4.62118129e-04,\n", 115 | " -2.54417248e-02, -4.07968946e-02, 2.24085082e-03,\n", 116 | " -5.71764819e-02, 3.96157652e-02, -5.56416325e-02,\n", 117 | " 1.06351763e-01, -2.11038422e-02, -4.97004427e-02,\n", 118 | " 1.37671484e-02, 2.52124630e-02, 6.93862326e-03,\n", 119 | " -8.78239796e-03, -4.25839275e-02, -7.41932988e-02,\n", 120 | " 3.93395983e-02, -5.14756478e-02, -4.80900072e-02,\n", 121 | " 2.03796737e-02, 4.60575111e-02, -5.39578963e-03,\n", 122 | " 5.13799861e-02, 4.98849079e-02, -1.53071098e-02,\n", 123 | " -2.55209878e-02, -9.37783793e-02, 6.80431351e-02,\n", 124 | " 5.42037040e-02, 5.88915544e-03, 3.77027579e-02,\n", 125 | " 2.97001610e-03, -2.73854788e-02, -2.25164257e-02,\n", 126 | " 2.94775404e-02, -4.49141003e-02, -1.22707179e-02,\n", 127 | " 3.51123661e-02, -9.62661114e-03, -4.74585360e-03,\n", 128 | " -6.34014159e-02, 2.71070562e-02, -8.06257129e-03,\n", 129 | " -6.70747459e-02, -5.07746078e-02, -4.76036221e-02,\n", 130 | " 7.18707684e-03, 3.65909301e-02, -3.67699936e-02,\n", 131 | " 3.70868184e-02, 1.12690397e-01, -1.10753492e-01,\n", 132 | " 1.88780073e-02, 6.59464002e-02, 4.53360453e-02,\n", 133 | " 5.63650019e-02, 4.43356484e-02, 1.30171627e-02,\n", 134 | " -2.71456428e-02, -2.89043244e-02, 1.64611451e-02,\n", 135 | " -1.92087479e-02, -5.28771989e-02, -6.49906620e-02,\n", 136 | " 3.41170616e-02, -1.70326754e-02, -3.64219025e-02,\n", 137 | " -6.40340447e-02, -4.20621075e-02, 5.49546070e-02,\n", 138 | " -2.71569863e-02, 1.50894774e-02, -8.34458917e-02,\n", 139 | " 8.04520994e-02, -1.96695887e-02, -1.00700237e-01,\n", 140 | " -4.36882209e-03, 3.58170122e-02, -6.73646182e-02,\n", 141 | " -3.49581279e-02, -3.14205624e-02, -3.03178281e-02,\n", 142 | " 4.55205292e-02, -4.74548228e-02, -3.40684615e-02,\n", 143 | " 1.98236890e-02, 2.36651860e-02, -2.66083088e-02,\n", 144 | " -7.57225677e-02, -1.35216266e-02, -1.82724686e-03,\n", 145 | " 2.93097384e-02, -3.48166339e-02, 2.47215275e-02,\n", 146 | " 3.22892033e-02, 2.67713480e-02, -5.79769351e-02,\n", 147 | " -2.56844629e-02, 8.45318958e-02, -2.28709877e-02,\n", 148 | " -9.08638537e-03, 1.39732165e-02, 2.45238561e-02,\n", 149 | " 3.46098132e-02, 5.28965704e-02, -7.04566389e-02,\n", 150 | " -4.86870706e-02, 2.34722588e-02, 2.37552300e-02,\n", 151 | " -5.92066869e-02, 8.27051178e-02, -4.00973856e-03,\n", 152 | " -3.28391939e-02, -6.41322583e-02, 5.77677647e-03,\n", 153 | " -8.03356245e-02, 2.29892787e-02, -9.79695190e-03,\n", 154 | " -6.78975321e-03, -8.75867438e-03, -4.90265042e-02,\n", 155 | " 2.29285266e-02, -4.98827323e-02, -1.09793276e-01,\n", 156 | " -3.85776274e-02, 4.46549704e-04, 1.88573524e-02,\n", 157 | " -5.65416738e-02, 4.27371226e-02, -7.73091055e-03,\n", 158 | " 9.66695976e-03, -8.21083859e-02, 6.33048778e-03,\n", 159 | " -3.02886646e-02, 1.40163992e-02, -5.77669404e-02,\n", 160 | " 1.12527721e-01, -1.80120803e-02, 2.36992892e-02,\n", 161 | " -1.32897524e-02, -2.57620830e-02, 6.66455459e-03,\n", 162 | " 4.19999138e-02, -4.12883908e-02, -9.73793678e-03,\n", 163 | " 5.12045547e-02, -4.39418741e-02, -2.89999340e-02,\n", 164 | " -3.28261144e-02, 4.97053796e-03, -3.94377932e-02,\n", 165 | " 7.66094103e-02, -1.74061339e-02, -4.58508246e-02,\n", 166 | " 2.20106803e-02, 1.01143029e-02, 3.49179357e-02,\n", 167 | " -2.76056118e-02, 4.00061607e-02, -3.07031441e-02,\n", 168 | " -9.87935532e-03, -3.51552591e-02, 6.12977035e-02,\n", 169 | " -1.34940445e-02, 1.69758976e-03, -9.62444022e-03,\n", 170 | " -1.15804393e-02, 3.89520489e-02, 8.71613845e-02,\n", 171 | " 6.13522753e-02, -3.57693098e-02, 5.38780093e-02,\n", 172 | " -9.76623152e-04, -3.23415212e-02, 6.76710904e-02,\n", 173 | " 9.33619123e-03, -8.11213255e-03, 6.82704076e-02,\n", 174 | " 5.05042709e-02, -8.47424865e-02, -5.89879490e-02,\n", 175 | " 7.25789368e-02, -2.18459424e-02, 4.00722250e-02,\n", 176 | " -7.63654150e-03, 1.03146099e-02, 5.40494919e-02,\n", 177 | " 1.61888842e-02, 4.32131365e-02, 5.60503006e-02,\n", 178 | " -8.37420970e-02, -1.66589953e-02, -1.09322891e-02,\n", 179 | " 3.10896002e-02, 2.87623964e-02, 7.79771879e-02,\n", 180 | " -3.36286873e-02, 7.37195835e-02, -1.33916633e-02,\n", 181 | " 4.63935211e-02, 6.50074799e-03, 3.98444422e-02,\n", 182 | " -3.78602743e-02, 4.35293913e-02, -2.90157180e-02,\n", 183 | " 1.25429835e-02, 3.68853062e-02, -1.06367087e-02,\n", 184 | " -7.20745325e-02, -2.14768406e-02, -5.44496030e-02,\n", 185 | " -6.13776930e-02, 1.05972447e-01, 1.43837687e-02,\n", 186 | " 1.63943495e-03, -7.05509707e-02, -2.42533088e-02,\n", 187 | " 3.51534374e-02, 1.08488565e-02, -3.42009105e-02,\n", 188 | " -6.08277731e-02, 9.20248553e-02, 2.36441251e-02,\n", 189 | " 6.30925670e-02, 6.67787269e-02, -6.49841651e-02,\n", 190 | " 3.18379910e-03, 1.96745917e-02, -1.01103224e-02,\n", 191 | " 1.94480140e-02, -8.43841955e-02, -8.75772089e-02,\n", 192 | " -3.86252701e-02, 1.45352371e-02, 9.57477372e-03,\n", 193 | " -4.48818831e-03, -3.39164175e-02, -3.36552039e-02,\n", 194 | " -1.33386850e-02, -1.90982111e-02, 4.97365110e-02,\n", 195 | " 2.88681649e-02, 5.77684073e-03, -7.05776513e-02,\n", 196 | " 6.44142255e-02, 6.41829073e-02, -5.80542684e-02,\n", 197 | " -2.47256700e-02, -8.52649808e-02, 1.60062127e-02,\n", 198 | " -1.22919763e-02, 2.45415065e-02, 1.95066840e-03,\n", 199 | " -1.10592833e-02, 1.55704357e-02, 1.52007127e-02,\n", 200 | " -4.29791175e-02, -3.13911252e-02, -2.85093561e-02,\n", 201 | " -3.46784219e-02, 1.07909925e-02, -5.69052845e-02,\n", 202 | " -6.56142086e-02, -2.42444035e-03, -2.36847496e-04,\n", 203 | " -2.00943090e-02, -1.87727269e-02, 2.44390406e-02,\n", 204 | " -3.73762026e-02, -4.07696702e-02, 6.48761019e-02,\n", 205 | " 3.38231586e-02, 6.30460605e-02, 5.82951354e-03,\n", 206 | " -2.62612291e-02, 6.19867910e-03, -2.10380126e-02,\n", 207 | " 1.71352222e-04, 1.86081007e-02, 4.34052311e-02,\n", 208 | " -4.80737984e-02, 6.99277669e-02, 4.66579907e-02,\n", 209 | " -1.48551473e-02, 3.29916701e-02, -9.36777145e-03,\n", 210 | " 7.43718967e-02, -5.12492396e-02, -6.02555387e-02,\n", 211 | " -6.58557117e-02, -3.25691588e-02, -9.58766192e-02,\n", 212 | " 5.89718446e-02, -9.34590474e-02, -2.11967360e-02,\n", 213 | " -5.53228594e-02, 7.27902120e-03, -5.82117960e-03,\n", 214 | " 1.51520390e-02, 3.10048033e-02, 2.35684924e-02,\n", 215 | " -1.24157164e-02, -3.03980522e-02, -1.04722142e-01,\n", 216 | " 1.43642910e-02, 4.62585362e-03, -7.37912394e-03,\n", 217 | " -5.35621382e-02, 3.15730758e-02, -8.77389833e-02,\n", 218 | " 5.22329099e-02, -9.48735885e-03, 4.54171449e-02,\n", 219 | " -8.38277936e-02, -3.25404741e-02, 7.16998801e-03,\n", 220 | " 6.80265725e-02, -2.07673144e-02, -4.05646153e-02,\n", 221 | " 1.34903835e-02, 3.22747529e-02, -4.12309058e-02,\n", 222 | " 2.79887812e-03, 7.98721611e-03, 6.17843941e-02,\n", 223 | " 4.60151024e-03, 9.92045365e-03, 5.00864871e-02,\n", 224 | " -5.63305654e-02, -3.88379730e-02, 3.02622397e-03,\n", 225 | " 2.20519323e-02, 1.54148676e-02, 4.85269316e-02,\n", 226 | " -5.63364588e-02, -3.73017862e-02, -3.11127473e-02,\n", 227 | " 1.61838830e-02, -6.77759647e-02, -9.11579654e-02,\n", 228 | " 4.67085131e-02, -4.00679782e-02, -3.72959077e-02,\n", 229 | " -3.94075289e-02, -1.12072146e-02, 1.26367714e-02,\n", 230 | " 4.40460369e-02, -7.77020901e-02, -2.46636514e-02,\n", 231 | " -1.49408458e-02, 5.86274220e-03, -7.11899400e-02,\n", 232 | " -1.15099251e-02, 3.33920382e-02, 5.09477453e-03,\n", 233 | " 1.51081178e-02, 1.04949502e-02, -5.80682866e-02,\n", 234 | " -3.40924747e-02, -3.48201320e-02, 2.49468200e-02,\n", 235 | " 4.42005768e-02, -2.37165336e-02, 3.79255484e-03,\n", 236 | " -8.86938721e-02, -1.56422518e-02, 4.10543345e-02,\n", 237 | " 4.47053164e-02, 5.43537475e-02, 5.49245300e-03,\n", 238 | " 7.09640309e-02, 1.93180814e-02, -3.05815432e-02,\n", 239 | " -6.89341733e-03, -3.62095423e-02, -3.08503956e-03,\n", 240 | " 6.30579367e-02, 4.35884781e-02, 1.84933823e-02,\n", 241 | " -7.83578958e-03, 2.59191096e-02, -5.52807143e-03,\n", 242 | " -4.72284009e-05, 2.06883010e-02, -1.38790896e-02,\n", 243 | " 5.72590455e-02, 3.44927758e-02, 2.15114728e-02,\n", 244 | " 2.95498725e-02, -5.41498102e-02, -8.79013725e-03,\n", 245 | " 7.38454312e-02, 1.96587350e-02, 1.34385750e-02,\n", 246 | " -5.90348169e-02, -5.32622188e-02, 3.93599793e-02,\n", 247 | " -4.86550853e-02, -3.91548872e-02, 4.74032760e-02,\n", 248 | " 1.50756622e-02, 6.87927082e-02, -3.02066337e-02,\n", 249 | " -2.66485778e-03, -2.01581307e-02, 5.31393997e-02,\n", 250 | " 1.00522246e-02, -1.83966588e-02, -4.26581167e-02,\n", 251 | " -2.71374499e-03, 9.05769784e-03, -4.29850779e-02,\n", 252 | " -1.37065900e-02, -6.19315952e-02, -6.49061725e-02,\n", 253 | " 4.96972874e-02, -9.45900939e-03, 7.37345219e-02,\n", 254 | " 5.60122356e-02, -2.91699544e-02, -1.58697236e-02,\n", 255 | " -6.30429089e-02, 4.82642651e-02, 5.28050645e-04,\n", 256 | " 3.94601114e-02, -7.31267557e-02, 3.35745700e-02,\n", 257 | " 2.48057507e-02, -4.80459072e-02, -1.18432520e-02,\n", 258 | " -4.43868563e-02, 3.98386568e-02, -5.34126982e-02,\n", 259 | " 5.74409105e-02, 1.32571915e-02, -2.18527261e-02,\n", 260 | " 1.10984361e-02, -1.16096223e-02, 6.81838691e-02,\n", 261 | " 3.42932194e-02, -8.47309604e-02, -4.01029214e-02,\n", 262 | " -3.77797745e-02, -6.41229227e-02, 3.81232128e-02,\n", 263 | " 2.52712779e-02, 1.10559305e-02, 9.84640513e-03,\n", 264 | " -1.20055480e-02, -9.66666546e-03, -5.53334281e-02,\n", 265 | " 2.41286773e-02, 1.00961186e-01, -1.27077922e-02,\n", 266 | " -4.23806421e-02, -1.07549950e-02, -3.54763754e-02,\n", 267 | " 5.58016337e-02, -7.87500739e-02, -3.64025608e-02,\n", 268 | " -2.90403571e-02, 5.37508540e-02, -2.00727507e-02,\n", 269 | " 3.00442167e-02, 3.45369503e-02, -3.68632935e-02,\n", 270 | " 1.22389954e-03, -6.67770281e-02, 2.16749627e-02,\n", 271 | " 3.61376889e-02, -4.56607640e-02, 2.02212632e-02,\n", 272 | " 4.63767387e-02, -4.86524552e-02, 5.23989350e-02,\n", 273 | " 1.38630597e-02, 5.03290556e-02, 5.27634881e-02,\n", 274 | " -3.88095379e-02, -1.34635530e-02, -7.79085681e-02,\n", 275 | " 1.63281877e-02, -7.12259766e-03]], dtype=float32)" 276 | ] 277 | }, 278 | "execution_count": 5, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "encoder.encode(\"hello world\")" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 10, 290 | "metadata": { 291 | "ExecuteTime": { 292 | "end_time": "2020-06-07T23:58:33.082587Z", 293 | "start_time": "2020-06-07T23:58:33.053011Z" 294 | } 295 | }, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/html": [ 300 | "
\n", 301 | "\n", 314 | "\n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | "
usernametext
0user1hello world
1user1merhaba dunya
2user2Bonjour le monde
3user2مرحبا بالعالم
\n", 345 | "
" 346 | ], 347 | "text/plain": [ 348 | " username text\n", 349 | "0 user1 hello world\n", 350 | "1 user1 merhaba dunya\n", 351 | "2 user2 Bonjour le monde\n", 352 | "3 user2 مرحبا بالعالم" 353 | ] 354 | }, 355 | "execution_count": 10, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "import pandas as pd\n", 362 | "\n", 363 | "df = pd.DataFrame({\n", 364 | " \"username\":[\"user1\", \"user1\", \"user2\", \"user2\"],\n", 365 | " \"text\": [\"hello world\", \"merhaba dunya\", \"Bonjour le monde\", \"مرحبا بالعالم\"]\n", 366 | "})\n", 367 | "df" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 11, 373 | "metadata": { 374 | "ExecuteTime": { 375 | "end_time": "2020-06-07T23:59:26.926654Z", 376 | "start_time": "2020-06-07T23:59:26.886003Z" 377 | } 378 | }, 379 | "outputs": [ 380 | { 381 | "name": "stderr", 382 | "output_type": "stream", 383 | "text": [ 384 | "100%|██████████| 2/2 [00:00<00:00, 95.93it/s]\n" 385 | ] 386 | } 387 | ], 388 | "source": [ 389 | "encoder.encode_df(df, user_col=\"username\", text_col=\"text\", out_path=\"demo.npz\")" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 12, 395 | "metadata": { 396 | "ExecuteTime": { 397 | "end_time": "2020-06-07T23:59:38.423912Z", 398 | "start_time": "2020-06-07T23:59:38.418479Z" 399 | } 400 | }, 401 | "outputs": [ 402 | { 403 | "data": { 404 | "text/plain": [ 405 | "array(['user1', 'user2'], dtype='" 615 | ] 616 | }, 617 | "execution_count": 34, 618 | "metadata": {}, 619 | "output_type": "execute_result" 620 | }, 621 | { 622 | "data": { 623 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1EAAAI3CAYAAAB6X9FZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3df5RddX0v/PeZaIafYW5CwwwJygMqpEkVYhB9kLtYQQxYFHQh4SbiLYq09SZyVUCqNEFQIYFSxULRh0eKNMIDUkGCkCjW2tu6rNRydaCC8kuB/GgIgYaEIHP280dkLiEB5szMPufMPq+Xa7sye89893dYWTCfeX+/n2+tKIoiAAAADElXqycAAAAwliiiAAAAGqCIAgAAaIAiCgAAoAGKKAAAgAYoogAAABrQNkXUgw8+mLlz52bOnDmZO3duHnrooVZPCQAAYDttU0QtXrw48+bNy4oVKzJv3rwsWrSo1VMCAADYTq0dDtt9/PHHM2fOnPz4xz/OuHHjMjAwkEMPPTQrV67MxIkTX/Zrn3vuuaxevTq9vb151ate1aQZAwBAe9qwYUM2btzYlHfttttu6enpacq72klbVB2rVq3KXnvtlXHjxiVJxo0bl8mTJ2fVqlWvWEStXr06Rx55ZO64445MnTq1GdOFMaG++g2tnsI2unrva/UUAKDyNmzYkHe+4y158j9rTXnfHnvskZUrV3ZcIdUWRRQAADByGzduzJP/WcvffrlI7++V+67V/5F8YOGT2bhxoyKqFfr6+rJmzZoMDAwMLudbu3Zt+vr6Wj01GLN+WzzX6ilso7vVEwCADtL7e8kUP0qXpi0aS0yaNCnTpk3L8uXLkyTLly/PtGnTXnEpHwAAsL16k/7XqdoiiUqSc889N2effXYuv/zyTJgwIUuWLGn1lAAAALbTNkXU/vvvnxtuuKHV0wAAgDFvoKhnoOQe3FvHH1fuS9pUWyznAwAAGCvaJokCAABGRz1F6TuWOndHlCQKAACgIZIoAAComCJF6il3U1SR5hzo244kUQAAAA2QRAEAQMUMpMhAUW4SNVDq6O1NEgUAANAASRQAAFRMvQl7onTnAwAAYEgUUQAAAA2wnA8AACqmniIDlvOVRhIFAADQAEkUAABUjMYS5ZJEAQAANEASBQAAFTNQNOGw3XKHb2uSKAAAgAZIogAAoGKKlL9nqYODKEkUAABAIyRRAABQMQNNOCdqoNTR25skCgAAoAGSKAAAqJiBovzuebrzAQAAMCSSKAAAqBjd+coliQIAAGiAJAoAACpmILXSu+cNpFbyG9qXJAoAAKABiigAAIAGWM4HAAAVUy+2XmW/o1NJogAAABogiQIAgIqpN6GxRL2DG0sooqCi6h19egMAQHkUUQAAUDFanJfLnigAAIAGSKIAAKBi6kWtCd35JFEAAAAMgSQKAAAqRne+ckmiAAAAGiCJAgCAitnana/cpKjspKudSaIAAAAaIIkCAICKaU53vnLHb2eKKKioeuqtngIAQCUpogAAoGKa052vc9kTBQAA0ABFFFTUrn0Pt3oKAACVZDkfAABUzEDRlYGSGz+UPX47k0QBAAA0QBLVYrt3/142Pruu1dOAl9Td3Z0tW7aMwkjlHvgHjZo+fXr6+/tbPQ2AUtTTVXrjh05uLKGIarG3v+aUbT6+7ZdLWzQTgOa6+YGDWvr+4/a7q6XvB2DsUkQBAEDFaHFeLnuiAAAAGiCJAgCAihkoahkoyt2PPFAUSTqzRZ8kCgAAoAGSKABaouzfkAJ0snpqqZfcGXfrnihJFAAAAK9AEgUAABVTT1cGSk+iinRqjz5JFAAAQAMkUa1WdOY6UgAAyjOQrvK783XofqhEEQVj2q7/157Z9NDjrZ4GtER3d3e2bNkyghE0tqA9TZ8+Pf39/a2eBvAyFFEwhr35ylNKf8cPj7yo9HcAdJojj/jCSz674wefbuJMqKqt3fnK3blT79D9UIk9UQAAAA1RRAEAADTAcj4AAKiYelErvbFEvYMPTZdEAQAANEASBWNYJ/8GCAB4aQPpykDJeclAqaO3N0kUAABAAyRRAABNVuvcM0ppknrRlXpRcovzDv57LIkCAABogCQKAAAqpp5a6Xui6uncKEoSBQAA0ABJFAAAVMxAE86JKnv8dqaIAgA62u6vnpSNz61v2fu7u7uzZcuWwY9rtc+0ZB7Tp09Pf39/S94NY40iCsaw//WOpTnsu59q9TQAxrS3T56XJLnt0S+3eCbNdcy0P9vm49v6L2jRTChDPV2p2xNVGnuiAAAAGiCJAgCAiqkXXRko/ZwoSRQAAABDIIkCAICKqaeWesrtnlf2+O1MEQVjXL2D24sCALSCIqrV6p27lhQA2kK93uoZAGOMIoq21uqzO6roxeeRvJJalpQ4m87h/BUAmmnrYbvltj8YKDr3FxCKKNra2/eaP/jn2x65tIUzoVPs//99vpRx++e25vBMAGgns2fPzvjx49Pd3Z0kOeOMM3L44YfnrrvuyqJFi7Jly5ZMmTIlF110USZNmpQkw35WJt35AACgYurpykDJ13AP87300ktz88035+abb87hhx+eer2eM888M4sWLcqKFSsya9asXHzxxVu/j2E+K5siCgAAGLbVq1fnkUce2eZ66qmnhvz1/f396e7uzqxZs5IkJ510Um6//fYRPSub5XwAL1RCt8P7T/r0qI8JAC+nXtRK7+D7/Pjz58/f7tmCBQuycOHCHX7dGWeckaIo8uY3vzmf+MQnsmrVquy9996DzydOnJh6vZ4NGzYM+1lPT89ofZs7pIgCAACGbdmyZent7d3m3oQJE17yc/v6+vLss8/m85//fM4777wcddRRzZjmqFJEAQCdrXDcCNXz/L6lst+RJL29vZk6deqQvqavry9JMn78+MybNy9/+qd/mg9+8IN57LHHBj9n/fr16erqSk9PT/r6+ob1rGz2RAEAAKXbtGlT/vM//zNJUhRFvvOd72TatGmZMWNGnnnmmdx5551Jkuuuuy5HH310kgz7WdkkUQAAUDH1oiv1ks+JanT8xx9/PAsXLszAwEDq9Xr233//LF68OF1dXVm6dGkWL168TavyJMN+VjZFFAAAULp99tknN9100w6fzZw5M7fccsuoPitT2xRRL3XwFgAA0Jh6koGU3J2v1NHbW9sUUcnWg7fe8IY3tHoazVd08l9BAAAYWzSWaLHbHmjOqcoAAHSOeroG90WVdnVwKdFWSdSLD956qf7yAAAArdI25eOyZcvy7W9/OzfeeGOKosh5553X6ikBHagoRv8CgGYbSK0pV6dqmyLqxQdv/fSnP23xjAAAALbXFsv5Nm3alIGBgey+++7bHLwFAEBJ6qLyKiuacE5UUfL47awtiqiXOngLoOn8TAEAvIK2KKJe7uAtAACAdtIWRRS8JLvyAQAaNlB0ZaDk5XZlj9/OFFEAL1AUndtpCAAYGkUUAABUTD1JveQW5PVSR29viqh2YMkaAACMGYoogBfyOw0AKqDehD1RZbdQb2eKKIAXsCcKAHglnVs+AuzAQyefvTWNGs0LAJqsXtSacnUqRRQAAEADLOejvdU7ue8LAE0x4L81VM9AujJQcl5S9vjtTBEFsJ3OXZ4AALwyRRQAAFRM0YQ9S53cjEkRBfBimkEAAC9DEQUAABVTTy31kvcs1Tt4+bsiCuDFJFEAwMtQRAEAQMUMFLUMlLxnqezx25kiCgCgExUviN1rnfvDMAyHIgrgRYoBP0wAneW2ey9s9RRgTFFEAbzYKC5PePjUM0dtLAAYqnoTWpyXPX4769xjhgEAAIZBEgXwIjXd+QAY44p0pV6Um5cUHZzHKKLaQd1PbAAAMFYoogBezO81ABjjBlLLQMmH4ZY9fjvr3AwOAABgGCRRMEbtvvPkbHzmP1o9jY7W3d2dLVu2vOzn1P74rCbNhudNnz49/f39rZ4GQEvpzlcuRRSMUYcd8CfbfHz7Xee1aCZQjumf+sthfV3/ko+P8kwAYFuKqDZy26+/2OopAEDHKYp6q6cAo64oauV355NEQZsq7PCHTqXVPADtShEFAAAVU08t9ZK755U9fjtTRLUDaQvA9obxr8b+i+yHAqB8iigYqxTfAMBLGChqGSh5z1LZ47czRRQAbcmeKADalSIKAAAqpl50ld6dr+zx25kiCoD2JImCclkWDsPWueUjAADAMEiiAGhPfkkOMGxbD9stt/GDw3YBoM107n+aAWh3iqgR2P1VE7Nx4IlRGau7uzu12qUjHmf69Onp7+8fhRkBADBWOWy3XIqoEXj7niclSW5bfXmLZwKUZZeevmx+cnWrp8GLdHd3Z8uWLdvdr/3lJ1owm87kl3ZAJ1NEwVilq1JTbNqwKoeefMmQPvfH1/gBnmo7Zp/TB/98W/+XWjgT4JXUm7Anquzx25kiitLs1tWTp4snR2Wsrcsd/3rE4/jNKcPh0FcA4IUUUZTmsO53Z8Xma1o9jUFH7/Gh3N7/tVZPg7FI6gfAGLM1iSr7sN3OTaKcEwUAANAASRQAAFRMkSacE6U7HwAvxZ4o+J2qLm2tV/T7AkqjiAIAgIpxTlS5FFF0jGJgoNVTGF1V/Y1wO/KPGgB4AUUUwCuoDaH+/ufrP1n+RABgiJwTVS5FFMArsCcKAHghRRQAAFSMJKpczokCAABogCRqJGzsh1LMmDEjd999d6un8bK6u7uzZcuWwY9rtTNaOJtXNn369PT397d6Gox1/rsHkEQRBbQhP+w35u3vu/gVP+d//V17F3kAjK6iCcv5ig5ezqeIgrGq3uoJ0C40vgCA5lJEAQBDU/fbGxgrNJYol8YSI1Evtl4ALfSP3zpj616Vl7tgFNy26rKthZRiCuhwkiiACugawoHAMBqKgYHcvu6rrZ4G8ArqSeopOYkqdfT2pogCqAKpOAA0jSIKAAAqxp6ocimiYIyq2efCC/nrAABNo4gCAIZuoJN3QVSMX8ZVmnOiyqWIAqgAZ0UBQPMoogCqwG+UAXgBe6LKpYiCscoPzQCjw79PgQYpogAAoGLqaUISVfI5VO2sq9UTAGDkasXLX5AkM2bMSK1WG9G1YsP/m5122mnE49RqtcyYMaPV/0gAhkUSBQAdor+/v9VTGDRnt/+eFf1Xt3oaUFlFUSu9e57ufACMbS+TNv39yk81bx4A0AEUUSNQFM7KoIXq1mgBALSCIgqgAmq6iwHwAkVqpTd+KDq4sYQiCgDobPUOXVliRQMMmyIKAAAqpl7UUnPYbmkUUdACu79qYjYOPDEqY3V3d2fLli2p1ZaMeKzp06e3VfcuGtChv0hnDLMEFRjDFFHQAm+fNDdJctuav27xTLY6euJHkiS39/8/LZ4JADAaiqL8FuSd/LsQh+0CVMD37zg7tYH6Di8AaCd/9Vd/lQMOOCD33XdfkuSuu+7Ke97znsyZMycf+tCH8vjjjw9+7nCflU0RBQAAFVMvak25GnX33XfnrrvuypQpU7bOs17PmWeemUWLFmXFihWZNWtWLr744hE9awZF1EjUC51tXo5ztKCpasWOLwAo0+rVq/PII49scz311FPbfd6zzz6b8847L+eee+7gvf7+/nR3d2fWrFlJkpNOOim33377iJ41gz1RAFXRyYvTAdhGUdSasCdq6/jz58/f7tmCBQuycOHCbe596Utfynve855MnTp18N6qVauy9957D348ceLE1Ov1bNiwYdjPenp6Ru17fCmKKICKqAl/AWiBZcuWpbe3d5t7EyZM2Objf/u3f0t/f3/OOOOMZk6tNIooaAXLQMeU3bp68nTxZKun0ZDnW98nSa12Totn0xit9gFGrhjmnqVG35Ekvb2926RLO/KTn/wk999/f4488sgkW5cAfvjDH87JJ5+cxx57bPDz1q9fn66urvT09KSvr29Yz5pBEQXwCg4b/4dZ8cyyVk+jkuYc8tnt7q34yeIWzASAMp122mk57bTTBj+ePXt2rrjiirzuda/L9ddfnzvvvDOzZs3Kddddl6OPPjpJMmPGjDzzzDMNP2sGRRQAAFTM1nOiyn/HSHV1dWXp0qVZvHhxtmzZkilTpuSiiy4a0bNmUESNhE3cVIW/ywBAE33/+98f/PPMmTNzyy237PDzhvusbIooAFpHAQ9QinpqScrdE1Uvefx25pwoaIGiqKdwjtaYYT8UlKCNGuzc/uTXOrOgL+r/5wIaIokC6EC7j5uYjfUnWj2NQdt2E9y+2cRYptvgjhUDA62eAsCwKaIAOtBhux6XjBuX25+4stVTGVVzdv1gVjz99VZPI0lyzNSPJUlu67+0xTMBOlGRJhy2azkfAJ2magUUY8vKZ7/R6ikADJskClphoM3Wn9frSZffqVABbbTPBqCV6kUtKTmJKvsw33bWtJ+alixZktmzZ+eAAw7IfffdN3j/wQcfzNy5czNnzpzMnTs3Dz30ULOmBLyAVAJGWTMOaQGgJZpWRB155JFZtmxZpkyZss39xYsXZ968eVmxYkXmzZuXRYsWNWtKAFSMZgUAWz3/e5yyr07VtCJq1qxZ6evr2+be448/nnvuuSfHHntskuTYY4/NPffck/Xr1zdrWgAAKepttswaaGst3QSxatWq7LXXXhk3blySZNy4cZk8eXJWrVrVymkBtLUZM2akVquN6Frxn3+TnXbaacTj1Gq1zJgxo9X/SAB4sWJrd74yr7L3XLUzjSXoHJ2cOVMp7XTm0DtffVJW9l/X6mkAw+G/izBsLS2i+vr6smbNmgwMDGTcuHEZGBjI2rVrt1v2BwAADF1RpPSkqJPr8JYu55s0aVKmTZuW5cuXJ0mWL1+eadOmZeLEia2cFgAAwEtqWhL1uc99LitXrsy6detyyimnpKenJ7feemvOPffcnH322bn88sszYcKELFmypFlTAgDoXM5Vq7R6UUut9CTKnqjSnXPOOTnnnHO2u7///vvnhhtuaNY0ABhFK39rPxSMWZ28FgtGSGMJAAComK17oprwjg7V0j1RY93tT1zZ6ikwVtXrWy9gVK189hutngJjVSf/NAg0TBIFAGXwixKghZpxjpM9UQAAdBaFPgyb5XyAZSwAAA2QRAEAQMUUacJyvljOBzST5AcAYMxSRAEAQAX5lW157IkCAABogCQKAMpg2S7QQs1ocZ6i1rG7oiRRI6U9KAAAdBRJFAAAVE2R8jdFdXDgLomCFrj9qataPQUAAIZJEgUA0IlsSag0e6LKJYkCAABogCQKAAAqprAnqlSSKAAAgAZIooAU1sUDtJUZM2bk7rvvbsq7uru7U6td3pR3TZ8+Pf39/U15V6dr1p6oTqWIAgBoM+1UaOxW68nTeXLE43R3d+fuu+9OrTbyH7wVY7Sa5XwAALyk//vV70pRFCO+nnnmmVEZZ87uf6SAGornk6iyrw6liKJz1Dt49yMAAKNGETVShR/MGR77kAAAxiZFFAAAVExRNOcaa5YvX577778/SfLAAw9k/vz5OfnkkwfvDZUiCgAA6Ahf/OIXs8ceeyRJli5dmje+8Y15y1veks9+9rMNjaM7H5A891yrZwBAm1r57DdaPQWGawwmRWVbv3599txzz2zZsiX/+q//mksvvTSvetWr8ta3vrWhcRRRAABAR5g4cWIefvjh3HffffmDP/iDjB8/Pps3b07R4NpERRQAlGEsbhYAKqMo0oTDdssdvgwf/ehH8773vS/jxo3LX/7lXyZJ/vmf/zkHHnhgQ+MoogAAGDN0t2Uk3ve+9+WYY45Jkuy8885JkoMOOmiwoBoqjSUAAKBqiiZdY8zxxx+fnXfeebCASpJJkybltNNOa2gcRRS0yIqNV7d6CgAAHeXhhx/e7l5RFHnkkUcaGsdyPgAAqJiiqDVhT1TJ44+is846K0ny29/+dvDPz3v00Ufzute9rqHxFFEAAEClveY1r9nhn5Nk5syZOfrooxsaTxEFAGWoj8HNAkB1NGPP0hj619yCBQuSJG9605ty+OGHj3g8RRQAANARDj/88DzwwAP5xS9+kU2bNm3z7IQTThjyOIooAAConNrvrrLfMbZcccUVueyyy3LggQdmp512Grxfq9UUUQDQasVzz7V6ClBNDrJmBK6++urccMMNDR+u+2KKqBFy4BsAAG3Hnqgd2mmnnbLffvuNeBznRAEAaAQCHeH000/P5z73uaxduzb1en2bqxGSKCArnlnW6ikAAJTu7LPPTpLccMMNg/eKokitVsu///u/D3kcRRQAAFSN5Xw7dMcdd4zKOIooAACgI0yZMmVUxlFEAQBA1RS1rVfZ7xgD/vzP/zznn39+kuTMM89MrbbjeS9dunTIYyqiAACAypo6dergn1/72teOypiKKDpHoR09ANA5yj5Sa2zkUMkf//EfD/55wYIFozKmIgoAAOgYP/7xj3PTTTdl7dq1mTx5co477ri89a1vbWgM50QBAEDVFE26xpgbbrgh//N//s/83u/9Xo466qhMnjw5n/zkJ3P99dc3NI4kCgAA6AhXXnllrrrqqhx44IGD94455ph87GMfy4knnjjkcRRRAABQNbrz7dCGDRuy//77b3Nvv/32y5NPPtnQOJbzAUAZ6vWtFwBtY+bMmbnwwguzefPmJMmmTZuydOnSHHzwwQ2NI4kaqbLbngAAQKOKpFb2j6lj8Mfgz372s/n4xz+eWbNmZY899siTTz6Zgw8+OH/xF3/R0DiKKDpGMTDQ6ikAANBCkydPzrJly7J69erB7ny9vb0Nj6OIAgCAqmlG97wxmEQlyVNPPZV/+Zd/GSyijjjiiEyYMKGhMeyJAgAAOsKPfvSjzJ49O9dcc01+/vOf52//9m8ze/bs/OhHP2ponIaSqH/6p3/KrbfemvXr1+eKK67Iz3/+82zcuDFve9vbGnoptMLK317X6ikA0KZWbL6m1VOA0aU73w6df/75Oe+88/Kud71r8N5tt92Wz372s7n99tuHPM6Qk6hrrrkm5557bvbdd9/85Cc/SZLstNNO+dKXvtTAtAEAAFpj7dq1mTNnzjb3jjrqqKxbt66hcYZcRF199dW56qqrctppp6Wra+uX7bfffnnwwQcbeiEAdISi0MEVoM0cd9xxWbZs2Tb3rr322hx//PENjTPk5XxPP/10+vr6kiS12tbo7rnnnsurX/3qhl4IAACUTGOJHbrnnnty3XXX5corr8xee+2VNWvWZP369XnjG9+Y+fPnD37eiwutFxtyEXXIIYfkq1/9av70T/908N7Xv/71HHroocOYPgAAQHOdeOKJOfHEE0c8zpCLqHPOOSd/8id/khtuuCFPP/105syZk1133TVf+cpXRjwJAABgFEmidui9733vqIwz5CJq8uTJufHGG/Pzn/88jz76aPr6+vLGN75xcH9Ux3KAKwAAdJQhF1Ev7sJ333335R/+4R+SJKeffvrozgoAxriiXm/1FIBONwaTorFiyEXU6tWrt/n4P/7jP/KTn/wk73jHO0Z9UgAAAO1qyEXUBRdcsN29H/7wh7n11ltHdUIAAPCS6uKVIXHY7qATTzwx119/fZLkr/7qr7JgwYIRjzmiDU1vf/vb873vfW/EkwAAACjDQw89lC1btiRJvva1r43KmENOon7zm99s8/HmzZuzfPnywbOjAACA9lDTnW/QkUcemTlz5mTKlCnZsmXLNudBvdArnQ31QkMuoo466qjUarUUvzt9feedd860adNy4YUXDvllANApVmy8OnN2+++tngZAx7vgggty55135tFHH83Pf/7znHDCCSMec8hF1C9+8YsRvwwAAGgCSdQ2Zs2alVmzZuW3v/3tqJwVNeQiCgAAYCQ++tGP5pFHHklXV1d22WWX/Pmf/3mmTZuWBx98MGeffXY2bNiQnp6eLFmyJPvuu2+SDPvZjpxwwgn58Y9/nJtuuilr167N5MmTc9xxx+Wtb31rQ9/HyxZRP/rRj4Y0yNve9raGXgoAnWDFxqtbPQWAtrJkyZLsvvvuSZLvfe97+fSnP51vfetbWbx4cebNm5fjjjsuN998cxYtWpSvf/3rSTLsZztyww035JJLLsn73//+vOlNb8qqVavyyU9+MqeffnpOPPHEIX8fL1tEfeYzn3nFAWq1Wu64444hvxAAAKiOF58nmyQTJkzIhAkTtrv/fAGVJBs3bkytVsvjjz+ee+65J1dddVWS5Nhjj83555+f9evXpyiKYT2bOHHiDud65ZVX5qqrrsqBBx44eO+YY47Jxz72sdEror7//e8PeSAAAKA9NLM734663S1YsCALFy7c4Zd95jOfyT/90z+lKIpceeWVWbVqVfbaa6+MGzcuSTJu3LhMnjw5q1atSlEUw3r2UkXUhg0bsv/++29zb7/99suTTz7Z0Lfe0J6odevW5Wc/+1meeOKJwS59SUalwwUAADD2LFu2LL29vdvc21EK9bzPf/7zSZKbbropS5cuzemnn17q/F5o5syZufDCC3PGGWdk5513zqZNm3LJJZfk4IMPbmicIRdR3/ve93LmmWfmta99bX71q1/lda97XX75y19m5syZiigAAGgnRW3rVfY7kvT29mbq1KkNf/nxxx+fRYsWpbe3N2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPXspnP/vZfPzjH8+sWbOyxx575Mknn8zBBx+cv/iLv2ho7kMuor74xS/mC1/4Qo455pgccsghuemmm3LjjTfmV7/6VUMvBAAAOs/TTz+dp556arDI+f73v5899tgjkyZNyrRp07J8+fIcd9xxWb58eaZNmza4JG+4z3Zk8uTJWbZsWVavXj3Yne/FKdpQDLmIeuyxx3LMMcdsc++9731vDjvssHzqU59q+MUAAEDn2Lx5c04//fRs3rw5XV1d2WOPPXLFFVekVqvl3HPPzdlnn53LL788EyZMyJIlSwa/brjPXk5vb++wiqfnDbmImjRpUtatW5c999wzU6ZMyb/927/lv/yX/5J6vT7slwMAACVow8N299xzz1x//fU7fLb//vvnhhtuGNVnZeoa6ie+//3vz7/+678mSf7oj/4oH/zgB3Pcccflv/23/1ba5MaCoj6GjmoGAABGbMhJ1GmnnTb454P3InoAABc+SURBVOOPPz5vectbsnnz5u1aBAIAQGkKq6CGpA2TqFar1+v58Y9/nDe/+c0ZP378iMYachL1+c9/Pj/72c8GP957770VUAAAwJjQ1dWVj370oyMuoJIGiqiiKPLRj34073znO3PppZfmgQceGPHLAQCA0VfL1gN3S71a/U0OwyGHHJK77rprxOMMeTnfOeeck09/+tP50Y9+lOXLl2fu3LnZZ5998u53vzunnHLKiCcCAABQpr333jsf+chHcuSRR6a3tze12v8pBRs59HfISVSyNQI77LDDcsEFF2T58uXp6enJ0qVLGxkCAAAoW9Gka4zZsmVL3vGOd6RWq2XNmjVZvXr14NWIISdRSbJp06Z897vfza233pp/+Zd/ySGHHJILL7ywoRcCAAC0wgUXXDAq4wy5iPrYxz6Wf/zHf8zv//7v5w//8A9z4YUXvuxpwAAAQIvozveS7r///tx+++15/PHHs2jRojzwwAN59tlnc+CBBw55jCEv5/uDP/iD3HrrrVm2bFnmzZungAIAAMaU2267LfPnz8+aNWty0003JUmefvrphlfXDTmJ+shHPtLYDAEAgJaoSaJ26NJLL83f/M3f5MADD8xtt92WJDnwwAPzi1/8oqFxGmosAQAAMFatX78+BxxwQJIMduar1WrbdOkbCkUUAABUTZGkqJV8tfqbbNz06dNz8803b3Pv1ltvzRvf+MaGxmmoOx8AAMBY9ZnPfCYf/vCH881vfjObNm3Khz/84Tz44IP52te+1tA4TS2ilixZkhUrVuTRRx/NLbfckje84Q1JktmzZ2f8+PHp7u5Okpxxxhk5/PDDmzk1AACoDnuidmj//ffPbbfdlr//+7/PEUcckb6+vhxxxBHZddddGxqn4SJq8+bNue+++3LvvffmF7/4Re69994sW7ZsSF975JFH5oMf/GDmz5+/3bNLL710sKgCAAAow84775w3v/nNmTp1avbaa6+GC6hkCEXU833U77333tx77735zW9+k9122y0HHHBApk2blhNOOGHIL5s1a1bDEwQAABgNjz32WM4444z87//9vzNhwoQ89dRTedOb3pSLLrooU6ZMGfI4r1hEffjDH86+++6bmTNn5t57780hhxySL3/5y5kwYcKIvoEXO+OMM1IURd785jfnE5/4xKiPDwAAnUKL8x371Kc+lenTp+fKK6/MLrvskqeffjpf+tKXcvbZZ+eaa64Z8jiv2J1vl112yVVXXZWPfexj+fa3v50DDzwwxx13XH74wx+O6Bt4oWXLluXb3/52brzxxhRFkfPOO2/UxgYAAEiSu+++O2eddVZ22WWXJMmuu+6aM844I/39/Q2N84pF1He+853Bvund3d35sz/7s1xyySW54IIL8pnPfCYbN24cxvS31dfXlyQZP3585s2bl5/+9KcjHhMAADpW0aRrjDnooIPys5/9bJt7/f39OfjggxsaZ1jd+Q4++ODcfPPN+eIXv5jjjz8+3/ve94YzTJJk06ZNGRgYyO67756iKPKd73wn06ZNG/Z4AAAAz/vSl740+Od99tknp512Wo444oj09vZm9erV+Yd/+Icce+yxDY057Bbn48ePz1lnnZWjjz56yF/zuc99LitXrsy6detyyimnpKenJ1dccUUWLlyYgYGB1Ov17L///lm8ePFwp0UbKepj8NcTAAAVUEvGZFJUhtWrV2/z8Tvf+c4kyfr16zN+/PgcddRR2bJlS0NjjvicqEZO9z3nnHNyzjnnbHf/pptuGuk0AAAAtnPBBReM+phNPWwXAABoAt35XtLmzZvz8MMPZ9OmTdvcnzlz5pDHUEQBAAAd4aabbsp5552XV7/61dlpp50G79dqtfzgBz8Y8jiKKAAAqBpJ1A5ddNFF+fKXv5zDDjtsROO8YotzAACAKnj1q1+dt7zlLSMeRxEFAAAVUyuac401p59+ei688MKsX79+RONYzgcAAHSEfffdN5deemm+8Y1vDN4riiK1Wi3//u//PuRxFFEAAEBHOOuss3LcccflXe961zaNJRqliAIAADrChg0bcvrpp6dWq41oHHuiAACgaoomXWPM+973vtx8880jHkcSBQAAdISf/exnWbZsWf76r/86e+655zbPli1bNuRxFFEjtPLZb7zyJwEAAC134okn5sQTTxzxOIooAAComia0IC/G4HK+9773vaMyjiIKAADoCN/85jdf8tkJJ5ww5HEUUQAAUDXNaPwwBpOoFzeVWLduXX7zm9/k4IMPVkQBAAC82DXXXLPdvW9+85u5//77GxpHi3MAAKgaLc6H7H3ve19uvPHGhr5GEgUAAHSEer2+zcebN2/Ot7/97ey+++4NjaOIAgCAiqml/O58ydgLo37/938/tVptm3t77bVXzj///IbGUUQBAAAd4Y477tjm45133jkTJ05seBxFFAAAVI3ufDs0ZcqUURlHEQUAwJhR1MfgT+603Mknn7zdMr4XqtVqufrqq4c8niIKAAAqplY0YU9UMXbCqPe85z07vL9mzZpcc801eeaZZxoaTxEFAABU2vvf//5tPn7iiSfy1a9+Nddff33e9a535X/8j//R0HiKKAAAqBp7onZo48aNufLKK7Ns2bIcccQR+da3vpXXvOY1DY+jiAIAACrtmWeeydVXX52vfe1rOfTQQ/ONb3wjr3/964c9niIKAACqRhK1jdmzZ6der+fUU0/NjBkzsm7duqxbt26bz3nb29425PEUUQAAQKXttNNOSZJrr712h89rtdp2Z0i9HEUUAABQad///vdHdTxFFAAAVEyzWpx3qq5WTwAAAGAskUQBAEDVaCxRKkkUAABAAyRRAABQNZKoUkmiAAAAGiCJAgCAiqmlCd35OpgkCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqmVjRhT5QkCgAAgKGQRAEAQNXYE1UqSRQAAEADJFEAAFA1kqhSSaIAAAAaoIgCAGDMWPnsN1o9BbCcDwAAqqb2u4tySKIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgaoqkVnJSVEiiAAAAGApJFAAAVI09UaWSRAEAADRAEgUAAFXUwUlR2SRRAAAADZBEAQBAxdSa0J2v7PHbmSQKAACgAZIoAACoGt35SiWJAgAAaIAkCgAAKsaeqHJJogAAgNI98cQT+chHPpI5c+bk3e9+dxYsWJD169cnSe6666685z3vyZw5c/KhD30ojz/++ODXDfdZmRRRAABA6Wq1Wk499dSsWLEit9xyS/bZZ59cfPHFqdfrOfPMM7No0aKsWLEis2bNysUXX5wkw35WNkUUAABUTdGkK8nq1avzyCOPbHM99dRT202pp6cnhx566ODHBx10UB577LH09/enu7s7s2bNSpKcdNJJuf3225Nk2M/KZk8UAAAwbPPnz9/u3oIFC7Jw4cKX/Jp6vZ5rr702s2fPzqpVq7L33nsPPps4cWLq9Xo2bNgw7Gc9PT2j9N3tmCIKAAAqppmNJZYtW5be3t5tnk2YMOFlv/b888/PLrvskg984AP57ne/W9YUS6OIAgAAhq23tzdTp04d8ucvWbIkDz/8cK644op0dXWlr68vjz322ODz9evXp6urKz09PcN+VjZ7ogAAoGqauCeqEZdcckn6+/tz2WWXZfz48UmSGTNm5Jlnnsmdd96ZJLnuuuty9NFHj+hZ2SRRAABA6X75y1/mK1/5Svbdd9+cdNJJSZKpU6fmsssuy9KlS7N48eJs2bIlU6ZMyUUXXZQk6erqGtazsimiAACgaoaZFDX8jga8/vWvz7333rvDZzNnzswtt9wyqs/KZDkfAABAAyRRAABQMbU0oTtfucO3NUkUAABAAyRRAABQNW24J6pKJFEAAAANkEQBAEDF1IqkVpQbFZW956qdSaIAAAAaIIkCAICqsSeqVJIoAACABiiiAAAAGmA5HwAAVMzWxhLlv6NTSaIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgYuyJKpckCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqok/csla1pRdQTTzyRs846K7/+9a8zfvz4vPa1r815552XiRMn5q677sqiRYuyZcuWTJkyJRdddFEmTZrUrKkBAAAMWdOW89VqtZx66qlZsWJFbrnlluyzzz65+OKLU6/Xc+aZZ2bRokVZsWJFZs2alYsvvrhZ0wIAgOopmnR1qKYVUT09PTn00EMHPz7ooIPy2GOPpb+/P93d3Zk1a1aS5KSTTsrtt9/erGkBAAA0pCWNJer1eq699trMnj07q1atyt577z34bOLEianX69mwYUMrpgYAAGPe8+dElX11qpYUUeeff3522WWXfOADH2jF6wEAAIat6d35lixZkocffjhXXHFFurq60tfXl8cee2zw+fr169PV1ZWenp5mTw0AAKqhKLZeZb+jQzU1ibrkkkvS39+fyy67LOPHj0+SzJgxI88880zuvPPOJMl1112Xo48+upnTAgAAGLKmJVG//OUv85WvfCX77rtvTjrppCTJ1KlTc9lll2Xp0qVZvHjxNi3OAQAA2lHTiqjXv/71uffee3f4bObMmbnllluaNRUAAKi0ZjR+0FgCAACAIWl6YwkAAKBkzTgMVxIFAADAUEiiAACgYmr1rVfZ7+hUkigAAIAGSKIAAKCKOnjPUtkkUQAAAA2QRAEAQMU4J6pckigAAIAGSKIAAKBqimLrVfY7OpQkCgAAoAGSKAAAqBh7osoliQIAAGiAJAoAAKqmSPnnREmiAAAAGApFFAAAQAMs5wMAgIrRWKJckigAAIAGSKIAAKBqHLZbKkkUAABAAyRRAABQMfZElUsSBQAA0ABJFAAAVFEHJ0Vlk0QBAAA0QBIFAABV04Q9UZ2cdEmiAAAAGiCJAgCAqqkXW6+y39GhJFEAAAANkEQBAEDVFCl/z1LnBlGSKAAAgEZIogAAoGJqTejOV3r3vzYmiQIAAGiAIgoAAKABlvMBAEDVFMXWq+x3dChJFAAAQAMkUQAAUDEaS5RLEgUAANAASRQAAFRRBydFZZNEAQAANEASBQAAFVMritRK7p5X9vjtTBIFAADQAEkUAABUTf13V9nv6FCSKAAAgAZIogAAoGLsiSqXJAoAAKABkigAAKiaIuWfE9W5QZQkCgAAoBGSKAAAqJoiSdl7liRRAAAADIUkCgAAKqZWbL3KfkenkkQBAAA0QBEFAADQAMv5AACgcoryG0t0cGcJSRQAAEADJFEAAFAxtXpSq5X/jk4liQIAAEq3ZMmSzJ49OwcccEDuu+++wfsPPvhg5s6dmzlz5mTu3Ll56KGHRvysbIooAAComqJoztWAI488MsuWLcuUKVO2ub948eLMmzcvK1asyLx587Jo0aIRPyubIgoAABi21atX55FHHtnmeuqpp7b7vFmzZqWvr2+be48//njuueeeHHvssUmSY489Nvfcc0/Wr18/7GfNYE8UAABUTZHym+f9bvz58+dv92jBggVZuHDhKw6xatWq7LXXXhk3blySZNy4cZk8eXJWrVqVoiiG9WzixImj9A2+NEUUAAAwbMuWLUtvb+829yZMmNCi2TSHIgoAACqmVhSplXxO1PPj9/b2ZurUqcMao6+vL2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPmsGeKAAAoCUmTZqUadOmZfny5UmS5cuXZ9q0aZk4ceKwnzWDJAoAAKpmGN3zhvWOBnzuc5/LypUrs27dupxyyinp6enJrbfemnPPPTdnn312Lr/88kyYMCFLliwZ/JrhPitbrSjK/qdbrkceeSRHHnlk7rjjjmHHiJTjnePnZeWz32j1NAAAOsbzPxtPmXBcXjVut1Lf9dzAxjz61M0d+XO4JAoAAKqmSFJvwjs6lD1RAAAADZBEAQBA1TShO1/pe67amCQKAACgAYooAACABljOBwAAVdOGLc6rRBIFAADQAEkUAABUjSSqVJIoAACABkiiAACgaso+aLdZ72hTkigAAIAGSKIAAKBiakVKP2y31rlboiRRAAAAjZBEAQBA1ejOVypJFAAAQAMkUQAAUDlNSKIiiQIAAGAIJFEAAFA1RZHSkyJ7ogAAABgKSRQAAFRNPUmt5Hd0bhAliQIAAGiEIgoAAKABlvMBAEDF1IoitZLX29U0lgAAAGAompZEPfHEEznrrLPy61//OuPHj89rX/vanHfeeZk4cWIOOOCAvOENb0hX19aabunSpTnggAOaNTXKUtRbPQMAgM6kxXmpmlZE1Wq1nHrqqTn00EOTJEuWLMnFF1+cL3zhC0mS6667LrvuumuzpgMAADAsTVvO19PTM1hAJclBBx2Uxx57rFmvBwCAzlEUSb3kSxLVXPV6Pddee21mz549eO/kk0/OwMBA/ut//a9ZuHBhxo8f34qpAQAAvKyWNJY4//zzs8suu+QDH/hAkuQHP/hB/u7v/i7Lli3Lr371q1x22WWtmBYAAFRDUTTn6lBNL6KWLFmShx9+OF/84hcHG0n09fUlSXbbbbe8//3vz09/+tNmTwsAAGBImrqc75JLLkl/f3+++tWvDi7Xe/LJJ9Pd3Z2ddtopzz33XFasWJFp06Y1c1oAAFAtxeD/latW/ivaUdOKqF/+8pf5yle+kn333TcnnXRSkmTq1Kk59dRTs2jRotRqtTz33HM5+OCDc/rppzdrWgAAAA1pWhH1+te/Pvfee+8On91yyy3NmgYAAFRfM86JakbS1aZa0lgCAABgrGpJi3MAAKBE9SYlUeNKfkWbkkQBAAA0QBIFAABVU9ST1Et+Sdnjty9JFAAAQAMUUQAAAA2wnA8AAKpGi/NSSaIAAAAaIIkCAICqKYrfpVElqkmiAAAAGAJJFAAAVE0zkih7ogAAABgKSRQAAFRNkSYkUZ1LEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAABVU68nRb3cd9RKHr+NSaIAAAAaIIkCAICqsSeqVJIoAACABkiiAACgcpqQRNUkUQAAAAyBIgoAAKABlvMBAEDV1IutV6ks5wMAAGAIJFEAAFAxRVGkKPmw3aL0FurtSxJFaVb+9rpWTwEAAEadJAoAAKrGnqhSSaIAAAAaIIkCAICqKZpw2K49UQAAAAyFJAoAAKqmqCf1crvzpVby+G1MEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAAAVU9TrKUreE1XYEwUAAMBQSKIAAKBqijRhT1S5w7czSRQAAEADFFEAAAANsJwPAACqpl5svcpU69z1fJIoAACABkiiAACgaor61qvsd3QoSRQAAEADJFEAAFA1RZGi7D1RXfZEAQAAMASSKAAAqBp7okoliQIAAGiAJAoAACqmqJe/J6r0PVdtTBIFAAA0xYMPPpi5c+dmzpw5mTt3bh566KFWT2lYFFEAAFA1RfF/9kWVdjWeRC1evDjz5s3LihUrMm/evCxatKiEb758lvMBAEDFPPeq3zbtHatXr97u2YQJEzJhwoRt7j3++OO55557ctVVVyVJjj322Jx//vlZv359Jk6cWPp8R5MiCgAAKmK33XbLHnvskbW5vynv6+7uzvz587e7v2DBgixcuHCbe6tWrcpee+2VcePGJUnGjRuXyZMnZ9WqVYqoZuvt7c0dd9yR3t7eVk8FAABaqqenJytXrszGjRub8r6iKFKr1ba7/+IUqmrGfBH1qle9KlOnTm31NAAAoC309PSkp6en1dPYTl9fX9asWZOBgYGMGzcuAwMDWbt2bfr6+lo9tYZpLAEAAJRu0qRJmTZtWpYvX54kWb58eaZNmzbmlvIlSa0ohtFWAwAAoEH3339/zj777Dz11FOZMGFClixZkv3226/V02qYIgoAAKABlvMBAAA0QBEFAADQAEUUAABAAxRRAAAADVBEAQAANEARBQAA0ABFFAAAQAP+f98plyJiy31ZAAAAAElFTkSuQmCC\n", 624 | "text/plain": [ 625 | "
" 626 | ] 627 | }, 628 | "metadata": {}, 629 | "output_type": "display_data" 630 | } 631 | ], 632 | "source": [ 633 | "clusterer.cluster(tree_path=\"tree.pkl\")\n", 634 | "\n", 635 | "clusterer.plot_tree(path=\"tree.pkl\")" 636 | ] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": {}, 641 | "source": [ 642 | "## Injecting labels" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 36, 648 | "metadata": { 649 | "ExecuteTime": { 650 | "end_time": "2020-06-08T01:01:34.529613Z", 651 | "start_time": "2020-06-08T01:01:34.509232Z" 652 | } 653 | }, 654 | "outputs": [ 655 | { 656 | "data": { 657 | "text/html": [ 658 | "
\n", 659 | "\n", 672 | "\n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | "
usernamelabel
0user1pro
1user2anti
\n", 693 | "
" 694 | ], 695 | "text/plain": [ 696 | " username label\n", 697 | "0 user1 pro\n", 698 | "1 user2 anti" 699 | ] 700 | }, 701 | "execution_count": 36, 702 | "metadata": {}, 703 | "output_type": "execute_result" 704 | } 705 | ], 706 | "source": [ 707 | "# this is just an example. We are hiding the actual labels of real users here\n", 708 | "labels = pd.DataFrame({\"username\":[\"user1\", \"user2\"], \"label\":[\"pro\", \"anti\"]})\n", 709 | "labels" 710 | ] 711 | }, 712 | { 713 | "cell_type": "code", 714 | "execution_count": 38, 715 | "metadata": { 716 | "ExecuteTime": { 717 | "end_time": "2020-06-08T01:21:09.279291Z", 718 | "start_time": "2020-06-08T01:21:09.274152Z" 719 | } 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "clusterer.inject_labels(users=labels.username, labels=labels.label)\n", 724 | "\n", 725 | "clusterer.align_clusters_with_labels(\n", 726 | " # this means multiple clusters can be assigned the same label\n", 727 | " allow_multiple_clusters=True\n", 728 | ")" 729 | ] 730 | }, 731 | { 732 | "cell_type": "markdown", 733 | "metadata": {}, 734 | "source": [ 735 | "## Example on Turkish Election dataset" 736 | ] 737 | }, 738 | { 739 | "cell_type": "code", 740 | "execution_count": 39, 741 | "metadata": { 742 | "ExecuteTime": { 743 | "end_time": "2020-06-08T01:23:28.426960Z", 744 | "start_time": "2020-06-08T01:23:28.419726Z" 745 | } 746 | }, 747 | "outputs": [], 748 | "source": [ 749 | "clusterer.plot()" 750 | ] 751 | }, 752 | { 753 | "cell_type": "markdown", 754 | "metadata": {}, 755 | "source": [ 756 | "" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "## Example on Trump dataset" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 37, 769 | "metadata": { 770 | "ExecuteTime": { 771 | "end_time": "2020-06-08T01:05:41.558112Z", 772 | "start_time": "2020-06-08T01:05:41.554222Z" 773 | } 774 | }, 775 | "outputs": [], 776 | "source": [ 777 | "# this calculates the micro f1 score for all umap configurations in the grid search\n", 778 | "# and plots the result of each configuration\n", 779 | "# then returns the results matrix and a heatmap plot of it\n", 780 | "results, hm = cluster_projection_grid_search(\n", 781 | " \"trials\", users=labels.username, labels=labels.label,\n", 782 | " # this means multiple clusters can be assigned the same label\n", 783 | " allow_multiple_clusters=True\n", 784 | ")" 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "Example of plotted projections and grid search heatmap\n", 792 | "\n", 793 | "\n", 794 | "" 795 | ] 796 | } 797 | ], 798 | "metadata": { 799 | "kernelspec": { 800 | "display_name": "Python 3", 801 | "language": "python", 802 | "name": "python3" 803 | }, 804 | "language_info": { 805 | "codemirror_mode": { 806 | "name": "ipython", 807 | "version": 3 808 | }, 809 | "file_extension": ".py", 810 | "mimetype": "text/x-python", 811 | "name": "python", 812 | "nbconvert_exporter": "python", 813 | "pygments_lexer": "ipython3", 814 | "version": "3.6.9" 815 | }, 816 | "toc": { 817 | "base_numbering": 1, 818 | "nav_menu": {}, 819 | "number_sections": true, 820 | "sideBar": true, 821 | "skip_h1_title": false, 822 | "title_cell": "Table of Contents", 823 | "title_sidebar": "Contents", 824 | "toc_cell": false, 825 | "toc_position": {}, 826 | "toc_section_display": true, 827 | "toc_window_display": false 828 | }, 829 | "varInspector": { 830 | "cols": { 831 | "lenName": 16, 832 | "lenType": 16, 833 | "lenVar": 40 834 | }, 835 | "kernels_config": { 836 | "python": { 837 | "delete_cmd_postfix": "", 838 | "delete_cmd_prefix": "del ", 839 | "library": "var_list.py", 840 | "varRefreshCmd": "print(var_dic_list())" 841 | }, 842 | "r": { 843 | "delete_cmd_postfix": ") ", 844 | "delete_cmd_prefix": "rm(", 845 | "library": "var_list.r", 846 | "varRefreshCmd": "cat(var_dic_list()) " 847 | } 848 | }, 849 | "types_to_exclude": [ 850 | "module", 851 | "function", 852 | "builtin_function_or_method", 853 | "instance", 854 | "_Feature" 855 | ], 856 | "window_display": false 857 | } 858 | }, 859 | "nbformat": 4, 860 | "nbformat_minor": 4 861 | } 862 | -------------------------------------------------------------------------------- /ed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ed.png -------------------------------------------------------------------------------- /methodology_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/methodology_diagram.png -------------------------------------------------------------------------------- /src/AR_STOPWORDS.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/AR_STOPWORDS.pkl -------------------------------------------------------------------------------- /src/__pycache__/clustering.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/clustering.cpython-36.pyc -------------------------------------------------------------------------------- /src/__pycache__/encoder.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/encoder.cpython-36.pyc -------------------------------------------------------------------------------- /src/__pycache__/preprocessing.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/preprocessing.cpython-36.pyc -------------------------------------------------------------------------------- /src/__pycache__/projection.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/projection.cpython-36.pyc -------------------------------------------------------------------------------- /src/__pycache__/top_terms.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/top_terms.cpython-36.pyc -------------------------------------------------------------------------------- /src/clustering.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | from typing import Optional 4 | 5 | import hdbscan 6 | import matplotlib.pyplot as plt 7 | import numpy as np 8 | import pandas as pd 9 | import seaborn as sns 10 | from sklearn.metrics import classification_report, f1_score 11 | from tqdm import tqdm 12 | 13 | from projection import Projector 14 | 15 | 16 | class Clusterer: 17 | 18 | def __init__(self, projection_path): 19 | self.projection_path = projection_path 20 | self._params = self._load_standard_embeddings() 21 | self.N: int = len(self._params["users"]) 22 | 23 | def _load_standard_embeddings(self): 24 | file = np.load(self.projection_path, allow_pickle=True) 25 | params = dict() 26 | for k in file.keys(): 27 | params[k] = file[k] 28 | return params 29 | 30 | @staticmethod 31 | def _cluster(standard_embeddings, **kwargs): 32 | return hdbscan.HDBSCAN(**kwargs).fit(standard_embeddings) 33 | 34 | def cluster(self, min_samples: Optional[int] = None, min_cluster_size: Optional[int] = None, 35 | min_samples_divisor: int = 1000, min_cluster_size_divisor: int = 100, 36 | tree_path=None, 37 | **kwargs): 38 | if min_samples is None: 39 | kwargs["min_samples"] = max(10, self.N // min_samples_divisor) 40 | if min_cluster_size is None: 41 | kwargs["min_cluster_size"] = max(10, self.N // min_cluster_size_divisor) 42 | 43 | model = self._cluster(standard_embeddings=self._params["umap"], 44 | **kwargs 45 | ) 46 | 47 | self._params["clusters"] = model.labels_ 48 | np.savez(open(self.projection_path, 'wb'), **self._params) 49 | if tree_path is not None: 50 | pickle.dump(model.condensed_tree_, open(tree_path, 'wb'), protocol=3) 51 | 52 | @staticmethod 53 | def plot_tree(path): 54 | sns.set(context='notebook', style='white', rc={'figure.figsize': (15, 10)}) 55 | return pickle.load(open(path, 'rb')).plot() 56 | 57 | def plot(self, labels_col="clusters"): 58 | return Projector.plot(embeddings=self._params["umap"], labels=self._params[labels_col]) 59 | 60 | def inject_labels(self, users, labels): 61 | labels_dict = dict(zip(users, labels)) 62 | self._params["labels"] = np.array( 63 | [labels_dict[u] if u in labels_dict else 'unk' for u in self._params["users"]] 64 | ) 65 | 66 | def align_clusters_with_labels(self, allow_multiple_clusters=True): 67 | labels = self._params["labels"] 68 | ind = labels != 'unk' 69 | users = self._params["users"][ind] 70 | labels = labels[ind] 71 | 72 | df = pd.DataFrame( 73 | {"username": users, "labels": labels} 74 | ).merge( 75 | pd.DataFrame({"username": self._params["users"], "clusters": self._params["clusters"]}) 76 | ) 77 | 78 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False) 79 | 80 | d = {} 81 | while len(g) > 0: 82 | label, cluster = g.index[0] 83 | d[cluster] = label 84 | g = g.reset_index() 85 | g = g[(g.label != label) & (g.clusters != cluster)].set_index(["label", "clusters"]).sort_values("username", 86 | ascending=False) 87 | unlabeled_clusters = set(df.clusters) - set(d.keys()) 88 | if allow_multiple_clusters and len(unlabeled_clusters) > 0: 89 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index() 90 | for c in unlabeled_clusters: 91 | l = g.set_index("clusters").loc[c].label 92 | if isinstance(l, pd.Series): 93 | l = l.iloc[0] 94 | d[c] = l 95 | 96 | g = g[g.clusters != c] 97 | 98 | self._params["predictions"] = np.array([d[x] if x in d else 'unk' for x in self._params['clusters']]) 99 | 100 | def evaluate(self, metric=f1_score, report=True): 101 | if "predictions" not in self._params: 102 | raise Exception("No labels aligned with clusters") 103 | 104 | y = self._params["labels"] 105 | p = self._params["predictions"] 106 | 107 | ind = y != 'unk' 108 | y = y[ind] 109 | p = p[ind] 110 | 111 | s = set(y) 112 | if report: 113 | return pd.DataFrame(classification_report(y, p, labels=s, output_dict=True)) 114 | 115 | return metric(y, p, labels=s, average='micro') 116 | 117 | @staticmethod 118 | def cluster_projection_grid_search(trials_dir, users=None, labels=None, allow_multiple_clusters=True): 119 | results = dict() 120 | for fn in tqdm(os.listdir(trials_dir)): 121 | if not fn.endswith("npz"): 122 | continue 123 | min_dist, n_neighbors = fn.replace(".npz", '').split("_") 124 | projection_path = os.path.join(trials_dir, fn) 125 | c = Clusterer(projection_path) 126 | c.cluster() 127 | # title = f"min_dist:{min_dist}\tn_neighbors:{n_neighbors}".expandtabs() 128 | plot_path = os.path.join(trials_dir, f"{min_dist}_{n_neighbors}.png") 129 | c.inject_labels(users=users, labels=labels) 130 | c.align_clusters_with_labels(allow_multiple_clusters=allow_multiple_clusters) 131 | fig = c.plot() 132 | plt.savefig(plot_path, bbox_inches='tight') 133 | plt.close() 134 | 135 | score = c.evaluate() 136 | results.setdefault(min_dist, dict()) 137 | results[min_dist][n_neighbors] = score 138 | return results 139 | -------------------------------------------------------------------------------- /src/encoder.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import tensorflow_hub as hub 4 | # noinspection PyUnresolvedReferences 5 | import tensorflow_text 6 | from tqdm import tqdm 7 | 8 | 9 | class Encoder: 10 | DEFAULT_MODEL = "https://tfhub.dev/google/universal-sentence-encoder/4" 11 | 12 | def __init__(self, model_url: str = DEFAULT_MODEL): 13 | """ 14 | Args: 15 | model_url: str, url to the Universal Sentence Encoder model 16 | Default is English USE >> https://tfhub.dev/google/universal-sentence-encoder/4 17 | for the multilingual version, use: https://tfhub.dev/google/universal-sentence-encoder-multilingual/3 18 | more models are available at: https://tfhub.dev/google/collections/universal-sentence-encoder/1 19 | """ 20 | self.model_url = model_url 21 | self.encoder = self._load_model() 22 | 23 | def _load_model(self): 24 | return hub.load(self.model_url) 25 | 26 | def encode(self, text): 27 | return np.array(self.encoder(text)) 28 | 29 | def encode_df(self, df: pd.DataFrame, out_path: str, user_col: str = "username", text_col: str = "text"): 30 | users = list() 31 | vectors = list() 32 | counts = list() 33 | 34 | for user, tweets in tqdm(df.groupby(user_col)[text_col]): 35 | try: 36 | vs = np.array(self.encoder(tweets.tolist())) 37 | users.append(user) 38 | vectors.append(np.mean(vs, axis=0)) 39 | counts.append(len(tweets)) 40 | except Exception as e: 41 | print(user) 42 | print(e) 43 | 44 | np.savez(out_path, users=np.array(users), vectors=np.array(vectors), counts=np.array(counts), 45 | allow_pickle=True) 46 | 47 | 48 | class EncoderBERT(Encoder): 49 | DEFAULT_MODEL = "roberta-base-nli-stsb-mean-tokens" 50 | 51 | def _load_model(self): 52 | from sentence_transformers import SentenceTransformer 53 | return SentenceTransformer(self.model_url) 54 | -------------------------------------------------------------------------------- /src/mutual_information.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | import numpy as np 4 | import pandas as pd 5 | import seaborn as sns 6 | from sklearn.metrics import adjusted_mutual_info_score as ami 7 | from tqdm import tqdm 8 | 9 | 10 | def correlate_clustering(df1, df2, metric_func, clusters_col="clusters", user_col="username", **kwargs): 11 | merged = pd.merge(df1[df1[clusters_col] >= 0], df2[df2[clusters_col] >= 0], on=user_col) 12 | y1, y2 = merged.labels_x, merged.labels_y 13 | return metric_func(y1, y2, **kwargs) 14 | 15 | 16 | def calculate_alignment_matrix(dfs, metric_func, **kwargs): 17 | matrix = np.zeros((len(dfs), len(dfs))) 18 | for i, df1 in tqdm(enumerate(dfs)): 19 | for j, df2 in enumerate(dfs): 20 | matrix[i][j] = correlate_clustering(df1, df2, metric_func, **kwargs) 21 | return matrix 22 | 23 | 24 | def plot_heatmap(frames, topics, func=ami): 25 | hm = calculate_alignment_matrix(frames, func) 26 | hm = pd.DataFrame(hm, columns=topics, index=topics).loc[reversed(topics)] 27 | fig = sns.heatmap( 28 | hm.round(2), 29 | annot=True, 30 | cmap="Blues", 31 | annot_kws={"size": 30}, 32 | # yticklabels=[i.title() for i in topics] 33 | ) 34 | fig.set_yticklabels(labels=reversed(topics), rotation=45) 35 | fig.set_xticklabels(labels=topics, rotation=45) 36 | n = min(len(topics) * 2, 18) 37 | sns.set(context='notebook', style='white', rc={'figure.figsize': (n, n)}, font_scale=3.5) 38 | return fig 39 | 40 | 41 | def mutual_information(topics, root="topicals"): 42 | frames = list() 43 | for topic in tqdm(topics): 44 | f = np.load(os.path.join(root, f"/{topic}.npz")) 45 | users = f["users"] 46 | clusters = f["clusters"] 47 | frames.append(pd.DataFrame({"users": users, "labels": clusters})) 48 | 49 | fig = plot_heatmap(frames, topics) -------------------------------------------------------------------------------- /src/preprocessing.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | import preprocessor as p 4 | 5 | p.set_options(p.OPT.URL, p.OPT.MENTION) 6 | 7 | 8 | def camel_case_split(identifier): 9 | matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier) 10 | return [m.group(0) for m in matches] 11 | 12 | 13 | def clean(text): 14 | text = p.clean(text) 15 | text = re.sub(r'^RT ', '', text) 16 | text = ' '.join(camel_case_split(text)) 17 | text = re.sub(r'\W+', ' ', text) 18 | text = re.sub(r"\d+", "number", text) 19 | if len(text.strip().split()) < 3: 20 | return None 21 | return text.lower().strip() 22 | -------------------------------------------------------------------------------- /src/projection.py: -------------------------------------------------------------------------------- 1 | import os 2 | from itertools import product 3 | 4 | import matplotlib.pyplot as plt 5 | import numpy as np 6 | import pandas as pd 7 | import seaborn as sns 8 | from tqdm import tqdm 9 | from umap import UMAP 10 | 11 | 12 | class Projector: 13 | DEFAULT_UMAP_PARAMS = dict( 14 | n_components=2, 15 | min_dist=0, 16 | n_neighbors=90, 17 | metric="cosine", 18 | random_state=42 19 | ) 20 | DEFAULT_DIST_RANGE = [0.0, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99] 21 | DEFAULT_NEIGHBORS_RANGE = [20, 30, 40, 50, 60, 70, 80, 90, 100] 22 | 23 | def __init__(self, vectors_path): 24 | self.vectors_path = vectors_path 25 | self.users, self.vectors, self.counts = self._load_vectors(vectors_path) 26 | 27 | @staticmethod 28 | def _load_vectors(vectors_path): 29 | file = np.load(vectors_path) 30 | users: np.ndarray = file['users'] 31 | vectors: np.ndarray = file['vectors'] 32 | counts: np.ndarray = file['counts'] 33 | return users, vectors, counts 34 | 35 | @staticmethod 36 | def _project(vectors, **kwargs): 37 | return UMAP(**kwargs).fit_transform(vectors) 38 | 39 | def project(self, out_path, min_counts=3, **kwargs): 40 | params = self.DEFAULT_UMAP_PARAMS.copy() 41 | params.update(kwargs) 42 | 43 | ind = self.counts >= min_counts 44 | users = self.users[ind] 45 | vectors = self.vectors[ind] 46 | 47 | standard_embeddings = self._project( 48 | vectors=vectors, 49 | **params 50 | ) 51 | np.savez(open(out_path, 'wb'), 52 | umap=standard_embeddings, users=users) 53 | 54 | @staticmethod 55 | def plot_grid_search_heatmap(results, heatmap_destination="temp.png"): 56 | hm = pd.DataFrame(results) 57 | hm.index = hm.index.astype(int) 58 | hm = hm.sort_index(ascending=False) 59 | x = sorted(hm.columns) 60 | hm.index.name = "n_neighbors" 61 | hm.columns.name = "min_dist" 62 | 63 | sns.set(context='notebook', style='white', rc={'figure.figsize': (len(hm) * 2, len(hm.columns) * 2)}, 64 | font_scale=2.5) 65 | 66 | sns.heatmap(hm[x], annot=True, cmap="Blues", annot_kws={"size": 30}, vmin=0.3, vmax=1, cbar=False) 67 | plt.savefig(heatmap_destination, bbox_inches='tight') 68 | 69 | @staticmethod 70 | def plot(embeddings, labels): 71 | fig = plt.figure() 72 | ax = fig.add_subplot(111) 73 | scatter = plt.scatter(embeddings[:, 0], embeddings[:, 1], 74 | c=labels, s=0.1, cmap='Spectral') 75 | return scatter 76 | 77 | def grid_search(self, trials_dir, min_dists_range=DEFAULT_DIST_RANGE, n_neighbors_range=DEFAULT_NEIGHBORS_RANGE, 78 | n_components=2, 79 | metric="cosine", min_counts=3, 80 | skip_existing=True, verbose=False): 81 | ind = self.counts >= min_counts 82 | users = self.users[ind] 83 | vectors = self.vectors[ind] 84 | 85 | umap_params = list(product(min_dists_range, n_neighbors_range)) 86 | for min_dist, n in tqdm(umap_params, desc="UMAP"): 87 | if verbose: 88 | print(f"{min_dist}_{n}") 89 | out_path = os.path.join(trials_dir, f"{min_dist}_{n}.npz") 90 | if os.path.isfile(out_path) and skip_existing: 91 | continue 92 | 93 | standard_embeddings = self._project( 94 | vectors=vectors, 95 | random_state=42, 96 | n_components=n_components, 97 | n_neighbors=n, 98 | min_dist=min_dist, 99 | metric=metric 100 | ) 101 | np.savez(open(out_path, 'wb'), 102 | umap=standard_embeddings, users=users) 103 | -------------------------------------------------------------------------------- /src/top_terms.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | from collections import Counter 4 | 5 | import matplotlib.pyplot as plt 6 | import numpy as np 7 | import pandas as pd 8 | from PIL import Image 9 | from ar_wordcloud import ArabicWordCloud 10 | from joblib import Parallel, delayed 11 | from tqdm.notebook import tqdm 12 | from wordcloud import STOPWORDS, WordCloud 13 | 14 | 15 | def get_word_counts(file, text_col=None): 16 | res = {} 17 | tfg = 0 18 | pbar = tqdm(desc=file.split("/")[-1]) 19 | with open(file) as f: 20 | for i, l in enumerate(f, 1): 21 | l = l.replace('.', '').strip().lower() 22 | if text_col is not None: 23 | l = l.split('\t')[text_col] 24 | for w in l.split(): 25 | if len(w) <= 2: 26 | continue 27 | res.setdefault(w, 0) 28 | res[w] += 1 29 | tfg += 1 30 | if i % 10_000 == 0: 31 | pbar.update(i) 32 | return res, tfg 33 | 34 | 35 | def count_words_csv(text_series): 36 | counter = Counter() 37 | text_series.apply(lambda x: counter.update(x.lower().strip().split())) 38 | return dict(counter), sum(counter.values()) 39 | 40 | 41 | def valence_step(tfe1, tfg1, tfe2, tfg2, out, e): 42 | a = tfe1 / tfg1 43 | b = tfe2 / tfg2 44 | v1 = 2 * (a / (a + b)) - 1 45 | if v1 >= 0.8: 46 | out.write(f"{v1 * np.log(tfe1)}\t{e}\t{v1}\t{tfe1}\n") 47 | 48 | 49 | def sort_scores(file): 50 | pd.read_csv( 51 | file, sep='\t', names=["score", "term", "valence", "frequency"] 52 | ).sort_values( 53 | "score", ascending=False 54 | ).to_csv( 55 | file.replace("txt", "tsv"), sep='\t', index=None 56 | ) 57 | os.remove(file) 58 | 59 | 60 | def valence(tf1, tfg1, tf2, tfg2, out): 61 | with open(out, 'w') as o: 62 | Parallel(n_jobs=-1, backend='threading')( 63 | delayed(valence_step)( 64 | tfe, tfg1, 0 if e not in tf2 else tf2[e], tfg2, o, e 65 | ) for e, tfe in tf1.items() if len(e) > 2 66 | ) 67 | print("Sorting terms") 68 | sort_scores(out) 69 | 70 | 71 | def pipeline(df1, df2, out1, out2=None, text_col='text'): 72 | print("Counting terms...") 73 | (tf1, tfg1), (tf2, tfg2) = Parallel(n_jobs=2, backend='threading')( 74 | delayed(count_words_csv)(df[text_col]) for df in [df1, df2]) 75 | del df1, df2 76 | print("Calculating valence for group 1 ...") 77 | valence(tf1, tfg1, tf2, tfg2, out1) 78 | if out2 is not None: 79 | print("Calculating valence for group 2 ...") 80 | valence(tf2, tfg2, tf1, tfg1, out2) 81 | 82 | 83 | def plot_worcloud(file, mask_path=None, arabic=False): 84 | params = dict(width=800, height=800, 85 | background_color='white', 86 | min_font_size=10) 87 | if mask_path is not None: 88 | params["mask"] = np.array(Image.open(mask_path)) 89 | 90 | scores = pd.read_csv(file, sep='\t').dropna() 91 | is_en = lambda x: bool(re.search('[a-z]', x.lower())) 92 | if arabic: 93 | import pickle 94 | params['stopwords'] = pickle.load(open('AR_STOPWORDS.pkl', 'rb')) 95 | scores = scores[~scores.term.apply(is_en)] 96 | else: 97 | params['stopwords'] = set(STOPWORDS) 98 | scores = scores[scores.term.apply(is_en)] 99 | scores = scores[:500].set_index("term").to_dict()["score"] 100 | if arabic: 101 | wordcloud = ArabicWordCloud(**params) 102 | fig = wordcloud.from_dict(scores) 103 | else: 104 | wordcloud = WordCloud(**params) 105 | fig = wordcloud.generate_from_frequencies(scores) 106 | plt.figure(figsize=(8, 8), facecolor=None) 107 | plt.imshow(wordcloud) 108 | plt.axis("off") 109 | plt.tight_layout(pad=0) 110 | plt.savefig(f"{file}.png") 111 | 112 | 113 | def calculate_top_terms(clusters_path, tweets_path, prefix, user_col, text_col, use_clusters=True, mask_path=None): 114 | enf = np.load(clusters_path) 115 | df = pd.read_pickle(tweets_path) 116 | users, clusters = enf["users"], enf["clusters"] 117 | if use_clusters: 118 | labels = dict(zip(users, clusters)) 119 | ind = clusters >= 0 120 | else: 121 | y = np.array( 122 | [1 if re.search("(lfc)|(liverpool)", x.lower()) else 0 if re.search("(cfc)|(chelsea)", x.lower()) else -1 123 | for x in enf["users"]]) 124 | ind = y >= 0 125 | labels = dict(zip(users, y)) 126 | df = df[df[user_col].apply(lambda x: x in labels)] 127 | df = df.assign(label=df[user_col].apply(lambda x: labels[x])) 128 | 129 | o1 = os.path.join("terms", f"{prefix}.0.txt") 130 | o2 = os.path.join("terms", f"{prefix}.1.txt") 131 | pipeline(df[df.label == 0], df[df.label == 1], 132 | out1=o1, 133 | out2=o2, 134 | text_col=text_col 135 | ) 136 | 137 | for o in enumerate(o1, o2): 138 | plot_worcloud(o, mask_path=mask_path) 139 | -------------------------------------------------------------------------------- /src/turkish_normalizer.py: -------------------------------------------------------------------------------- 1 | # get zemberek from https://github.com/ahmetaa/zemberek-nlp 2 | 3 | from os.path import join 4 | 5 | from jpype import JClass, JString, getDefaultJVMPath, startJVM 6 | 7 | ZEMBEREK_PATH: str = join('zemberek', 'bin', 'zemberek-full.jar') 8 | 9 | startJVM( 10 | getDefaultJVMPath(), 11 | '-ea', 12 | f'-Djava.class.path={ZEMBEREK_PATH}', 13 | convertStrings=False 14 | ) 15 | 16 | TurkishMorphology: JClass = JClass('zemberek.morphology.TurkishMorphology') 17 | TurkishSentenceNormalizer: JClass = JClass( 18 | 'zemberek.normalization.TurkishSentenceNormalizer' 19 | ) 20 | Paths: JClass = JClass('java.nio.file.Paths') 21 | 22 | normalizer = TurkishSentenceNormalizer( 23 | TurkishMorphology.createWithDefaults(), 24 | Paths.get( 25 | join('zemberek', 'data', 'normalization') 26 | ), 27 | Paths.get( 28 | join('zemberek', 'data', 'lm', 'lm.2gram.slm') 29 | ) 30 | ) 31 | 32 | 33 | def normalize(text): 34 | return str(normalizer.normalize(JString(text))) 35 | 36 | 37 | def normalize_df(df, text_col): 38 | df[text_col] = df[text_col].apply(normalize) 39 | return df 40 | -------------------------------------------------------------------------------- /trials/0.0_30.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.0_30.png -------------------------------------------------------------------------------- /trials/0.1_60.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.1_60.png -------------------------------------------------------------------------------- /trials/hm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/hm.png -------------------------------------------------------------------------------- /wc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/wc.png --------------------------------------------------------------------------------