├── .gitignore
├── LICENSE
├── Presentation.pdf
├── Presentation.pptx
├── README.md
├── ami.png
├── clusterUsersUniversalSentenceEncoder.py
├── demo.ipynb
├── ed.png
├── methodology_diagram.png
├── src
├── AR_STOPWORDS.pkl
├── __pycache__
│ ├── clustering.cpython-36.pyc
│ ├── encoder.cpython-36.pyc
│ ├── preprocessing.cpython-36.pyc
│ ├── projection.cpython-36.pyc
│ └── top_terms.cpython-36.pyc
├── clustering.py
├── encoder.py
├── mutual_information.py
├── preprocessing.py
├── projection.py
├── top_terms.py
└── turkish_normalizer.py
├── trials
├── 0.0_30.png
├── 0.1_60.png
└── hm.png
└── wc.png
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | .ipynb_checkpoints/
3 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2025 Ammar Rashed
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Presentation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pdf
--------------------------------------------------------------------------------
/Presentation.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/Presentation.pptx
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Embeddings-Based Unsupervised Stance Detection
2 |
3 | This repository contains the implementation of an unsupervised method for target-specific stance detection using embeddings-based clustering, as presented in our ICWSM 2021 paper.
4 |
5 | ## Publications
6 |
7 | - **Paper (ICWSM'21)**: [Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey](https://ojs.aaai.org/index.php/ICWSM/article/view/18082)
8 | - **Paper Presentation**: [PaperTalk ICWSM'21](https://papertalk.org/papertalks/31537)
9 | - **Thesis (MSc August 2020)**: [Embeddings-Based Clustering For Target Specific Stances](https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=fl0Kw4p1rmMDotyKRdYv1AZv-bsnninllPXAXfoe9S1sXEDBPXspE5WeUtqcCjlk)
10 |
11 | ## Overview
12 |
13 | We propose an unsupervised method for stance detection that can capture fine-grained divergences across various topics in polarized communities. Our approach overcomes the limitations of previous methods by:
14 |
15 | - Not requiring platform-specific features (like retweets)
16 | - Working effectively with limited data
17 | - Supporting hierarchical clustering without specifying the number of clusters
18 | - Using pre-trained language models to handle morphologically rich languages
19 |
20 | ## Methodology
21 |
22 | The method consists of five main steps:
23 |
24 | 1. **Data Collection**: Collect tweets related to specific topics or targets
25 | 2. **Feature Extraction**: Encode tweets using pre-trained universal sentence encoders
26 | 3. **User Representation**: Average tweet vectors per user to create user embeddings
27 | 4. **Projection**: Project user vectors to lower dimensional space using UMAP
28 | 5. **Clustering**: Cluster the projected vectors using HDBSCAN
29 |
30 | 
31 |
32 | ## Key Features
33 |
34 | ### Fine-grained Stance Detection
35 |
36 | Our method can automatically detect stances down to the party-affiliation level in a completely unsupervised manner, outperforming previous approaches.
37 |
38 | 
39 |
40 | ### Cross-Topic Mutual Information
41 |
42 | Using our clustering method, we can analyze the correlations between user stances across different topics, allowing for deeper insight into the structure of polarization.
43 |
44 | 
45 |
46 | ### Semantic Analysis Between Clusters
47 |
48 | We identify the most prominent terms in each cluster to show how different groups talk about the same issues in different contexts, revealing semantic divergences between polarized groups.
49 |
50 | 
51 |
52 | ## Performance
53 |
54 | Our method achieves:
55 | - 90% precision in identifying user stances
56 | - Over 80% recall
57 | - Competitive performance with supervised methods, while being completely unsupervised
58 | - Ability to detect fine-grained sub-groups that previous methods couldn't identify
59 |
60 | ## Installation
61 |
62 | ```bash
63 | # Clone this repository
64 | git clone https://github.com/AmmarRashed/UnsupervisedStanceDetection.git
65 | cd UnsupervisedStanceDetection
66 |
67 | # Create and activate a virtual environment (recommended)
68 | python -m venv venv
69 | source venv/bin/activate # On Windows: venv\Scripts\activate
70 |
71 | # Install dependencies
72 | pip install -r requirements.txt
73 | ```
74 |
75 | ### Requirements
76 |
77 | > **Note**: This work was tested using specific versions of packages. Newer versions might not work as expected.
78 |
79 | - [umap-learn 0.3.x](https://pypi.org/project/umap-learn/0.3.10/)
80 | - [hdbscan 0.8.x](https://pypi.org/project/hdbscan/0.8.26/)
81 | - [tensorflow-hub 0.8.x](https://pypi.org/project/tensorflow-hub/0.8.0/)
82 | - [tensorflow-text 2.2.x](https://pypi.org/project/tensorflow-text/2.2.1/)
83 | - matplotlib
84 | - numpy
85 | - pandas
86 | - tqdm
87 |
88 | ## Usage
89 |
90 | ```python
91 | # Basic usage example
92 | python clusterUsersUniversalSentenceEncoder.py your_data.tsv
93 | ```
94 |
95 | The input file should be a tab-separated file with:
96 | - First column: UserIDs
97 | - Second column: Tweets
98 |
99 | ### Code Sample
100 |
101 | ```python
102 | from clusterUsersUniversalSentenceEncoder import cluster_users, plot_clusters_no_labels
103 | import tensorflow_hub as hub
104 | import pandas as pd
105 |
106 | # Load the universal sentence encoder
107 | embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')
108 |
109 | # Load and prepare your data
110 | df_text = pd.read_csv('your_data.tsv', header=None, usecols=[0, 1], sep='\t')
111 | df_text.columns = ['User', 'Text']
112 | df_text = df_text.apply(lambda s: s.str.strip())
113 |
114 | # Cluster users based on their tweets
115 | cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at='results.npz')
116 |
117 | # Visualize the clusters
118 | plot_clusters_no_labels('results.npz.cluster')
119 | ```
120 |
121 | ## Customization Options
122 |
123 | The method can be customized with different parameters:
124 |
125 | - **Sentence Encoder**: Different pre-trained models can be used (multilingual, transformer-based, etc.)
126 | - **UMAP Parameters**: Adjust `min_dist` and `n_neighbors` to control projection characteristics
127 | - **HDBSCAN Parameters**: Modify `min_cluster_size` and `min_samples` to control clustering sensitivity
128 |
129 | ## Applications
130 |
131 | This method has been successfully applied to:
132 | - Political polarization analysis
133 | - Election stance detection
134 | - Sports fan sentiment analysis
135 | - Cross-cultural stance detection
136 |
137 | ## Citation
138 |
139 | If you use this code in your research, please cite our paper:
140 |
141 | ```
142 | Rashed, A., Kutlu, M., Darwish, K., Elsayed, T., & Bayrak, C. (2021). Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 537-548. https://doi.org/10.1609/icwsm.v15i1.18082
143 | ```
144 |
145 | BibTeX format:
146 |
147 | ```bibtex
148 | @article{rashed2021embeddings,
149 | title={Embeddings-Based Clustering for Target Specific Stances: The Case of a Polarized Turkey},
150 | author={Rashed, Ammar and Kutlu, Mucahid and Darwish, Kareem and Elsayed, Tamer and Bayrak, Cansın},
151 | journal={Proceedings of the International AAAI Conference on Web and Social Media},
152 | volume={15},
153 | number={1},
154 | pages={537--548},
155 | year={2021},
156 | doi={10.1609/icwsm.v15i1.18082}
157 | }
158 | ```
159 |
160 | ## Contributing
161 |
162 | Contributions are welcome! Please feel free to submit a Pull Request.
163 |
164 | ## License
165 |
166 | This project is licensed under the MIT License - see the LICENSE file for details.
167 |
168 | ## Contact
169 |
170 | - Ammar Rashed (ammar.rasid@ozu.edu.tr)
171 | - Kareem Darwish (kdarwish@hbku.edu.qa)
172 |
--------------------------------------------------------------------------------
/ami.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ami.png
--------------------------------------------------------------------------------
/clusterUsersUniversalSentenceEncoder.py:
--------------------------------------------------------------------------------
1 | ###############################################################################
2 | # Code written by Ammar Rashid (Özyeğin University)
3 | # ammar.rasid@ozu.edu.tr
4 | # and modified by Kareem Darwish (Qatar Computing Research Institute)
5 | # kdarwish@hbku.edu.qa
6 | # The code is provided for research purposes ONLY
7 | ###############################################################################
8 |
9 | ###############################################################################
10 | # sys.argv[1] is a tab separated file with first column containing UserIDs
11 | # and second column containing tweets
12 | ###############################################################################
13 | # there are many options for the universal sentence encoder including multilingual
14 | # models, Transformer model (slow), and CNN model (fast)
15 | # check out: https://tfhub.dev/google/universal-sentence-encoder/1
16 | # for options
17 | ###############################################################################
18 |
19 | import ntpath
20 | import sys
21 | from typing import Callable
22 |
23 | import matplotlib.pyplot as plt
24 | import numpy as np
25 | import pandas as pd
26 | import tensorflow_hub as hub
27 | from hdbscan import HDBSCAN
28 | from tqdm import tqdm
29 | from umap import UMAP
30 |
31 |
32 | def cluster_users(df, encoder: Callable, min_tweets=3, user_col="username",
33 | tweet_col="norm_tweet", save_at="temp.npz",
34 | min_dist=0.0, n_neighbors=90, **kwargs):
35 | gs = df.groupby(user_col)
36 | users = list()
37 | vectors = list()
38 | for user, frame in tqdm(gs):
39 | if len(frame) < min_tweets:
40 | continue
41 | try:
42 | tweets = frame[tweet_col]
43 | vec = np.mean(np.array(encoder(tweets.tolist())), axis=0)
44 | users.append(user)
45 | vectors.append(vec)
46 | except Exception as e:
47 | print(f"ERROR at:{user}")
48 | print(e)
49 | print()
50 |
51 | users: np.ndarray = users
52 | vectors: np.ndarray = vectors
53 |
54 | standard_embeddings = UMAP(
55 | random_state=42,
56 | n_components=2,
57 | n_neighbors=n_neighbors,
58 | min_dist=min_dist,
59 | metric='cosine', **kwargs
60 | ).fit_transform(vectors)
61 | print("Projection complete")
62 |
63 | params = dict()
64 |
65 | clusterer = cluster_embeddings(standard_embeddings, **kwargs)
66 | params['clusters'] = clusterer.labels_
67 | params["allow_pickle"] = True
68 | np.savez(open(save_at + '.cluster', 'wb'), users=np.array(users), vectors=np.array(vectors),
69 | umap=np.array(standard_embeddings), clusters=np.array(clusterer.labels_))
70 |
71 | output_file = open(save_at + '.clusters.txt', mode='w')
72 | for i in range(len(clusterer.labels_)):
73 | output_file.write(str(users[i]) + '\t' + str(clusterer.labels_[i]) + '\n')
74 | output_file.close()
75 |
76 |
77 | def plot_clusters_no_labels(embeddings_path, clusters_col="clusters", green_label="pro", red_label='anti', align=False,
78 | title=None, include_ratio=True, labeled_only=False):
79 | if title is None:
80 | title = ntpath.basename(embeddings_path).split('.')[0]
81 | f = np.load(embeddings_path)
82 | users = f["users"]
83 | clusters = f[clusters_col]
84 | cluster_ratio = round(sum(clusters >= 0) * 100 / len(clusters), 2)
85 | em = f["umap"]
86 |
87 | ind = clusters >= 0
88 | users = users[ind]
89 | clusters = clusters[ind]
90 | em = em[ind, :]
91 | c = ['red', 'blue', 'green', 'black', 'orange', 'teal']
92 | if align:
93 | d = align_clusters_with_labels(
94 | pd.DataFrame({"username": users, "clusters": clusters})
95 | )
96 | c = ['red', 'blue', 'green', 'black', 'orange', 'teal', 'olive', 'yellow']
97 | else:
98 | labels_dict = {}
99 |
100 | cmap = list()
101 | for i in range(len(clusters)):
102 | cmap.append(c[clusters[i] - 1])
103 |
104 | fig = plt.figure()
105 | ax = fig.add_subplot(111)
106 | scatter = plt.scatter(em[:, 0], em[:, 1], c=cmap,
107 | s=0.5, cmap='Spectral')
108 | ax.set_title(title, fontsize=22)
109 | plt.show()
110 | return scatter
111 |
112 |
113 | def align_clusters_with_labels(df, allow_multiple_clusters=True):
114 | df = df[df.clusters >= 0]
115 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False)
116 |
117 | d = {}
118 | while len(g) > 0:
119 | label, cluster = g.index[0]
120 | d[cluster] = label
121 | g = g.reset_index()
122 | g = g[(g.label != label) & (g.clusters != cluster)] \
123 | .set_index(["label", "clusters"]) \
124 | .sort_values("username", ascending=False)
125 | unlabeled_clusters = set(df.clusters) - set(d.keys())
126 | if allow_multiple_clusters and len(unlabeled_clusters) > 0:
127 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index()
128 | for c in unlabeled_clusters:
129 | l = g.set_index("clusters").loc[c].label
130 | if isinstance(l, pd.Series):
131 | l = l.iloc[0]
132 | d[c] = l
133 |
134 | g = g[g.clusters != c]
135 |
136 | return d
137 |
138 |
139 | def cluster_embeddings(standard_embedding,
140 | min_cluster_size=None,
141 | min_samples=None,
142 | plot_tree=False,
143 | min_samples_div=1000,
144 | min_cluster_size_div=100,
145 | **kwargs):
146 | if min_cluster_size is None:
147 | min_cluster_size = max(10, len(standard_embedding) // min_cluster_size_div)
148 | if min_samples is None:
149 | min_samples = max(10, len(standard_embedding) // min_samples_div)
150 | clusterer = HDBSCAN(
151 | min_samples=min_samples,
152 | min_cluster_size=min_cluster_size, **kwargs
153 | ).fit(standard_embedding)
154 | if plot_tree:
155 | clusterer.condensed_tree_.plot()
156 | # return clusterer.labels_, clusterer.condensed_tree_
157 | return clusterer
158 |
159 |
160 | if __name__ == "__main__":
161 | embed = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4') # You can use different encoders here
162 |
163 | inputFile = sys.argv[1] # ex. trump.tsv
164 | df_text = pd.read_csv(inputFile, header=None, usecols=[0, 1], error_bad_lines=False, sep='\t')
165 | df_text.columns = ['User', 'Text']
166 | df_text = df_text.apply(lambda s: s.str.strip())
167 | cluster_users(df_text, embed, user_col='User', tweet_col='Text', save_at=inputFile + '.npz')
168 | plot_clusters_no_labels(inputFile + '.npz.cluster')
169 |
--------------------------------------------------------------------------------
/demo.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Preprocessing\n",
8 | "- Remove URLs and Mentions\n",
9 | "- Separate composite camel case words (e.g. BlackLives --> black lives)\n",
10 | "- Remove non-alphanumeric characters\n",
11 | "- Replace numbers with the token \"_number_\"\n",
12 | "- Lowercase everything"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 1,
18 | "metadata": {
19 | "ExecuteTime": {
20 | "end_time": "2020-06-07T23:45:10.994786Z",
21 | "start_time": "2020-06-07T23:45:10.990361Z"
22 | }
23 | },
24 | "outputs": [],
25 | "source": [
26 | "from src.preprocessing import clean"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 3,
32 | "metadata": {
33 | "ExecuteTime": {
34 | "end_time": "2020-06-07T23:46:12.708646Z",
35 | "start_time": "2020-06-07T23:46:12.680332Z"
36 | }
37 | },
38 | "outputs": [
39 | {
40 | "data": {
41 | "text/plain": [
42 | "'black lives matter'"
43 | ]
44 | },
45 | "execution_count": 3,
46 | "metadata": {},
47 | "output_type": "execute_result"
48 | }
49 | ],
50 | "source": [
51 | "clean(\"#BlackLivesMatter https://www.google.com/ @realDonaldTrump\")"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "# Encoding\n",
59 | "## Universal Sentence Encoder"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 1,
65 | "metadata": {
66 | "ExecuteTime": {
67 | "end_time": "2020-06-07T23:55:34.962132Z",
68 | "start_time": "2020-06-07T23:55:33.625304Z"
69 | }
70 | },
71 | "outputs": [],
72 | "source": [
73 | "from src.encoder import Encoder"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 2,
79 | "metadata": {
80 | "ExecuteTime": {
81 | "end_time": "2020-06-07T23:55:40.034473Z",
82 | "start_time": "2020-06-07T23:55:37.473873Z"
83 | }
84 | },
85 | "outputs": [],
86 | "source": [
87 | "# default encoder is USE for English only.\n",
88 | "# But you can use multilingual as well, like ...\n",
89 | "encoder = Encoder(model_url=\"https://tfhub.dev/google/universal-sentence-encoder-multilingual/3\")"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 5,
95 | "metadata": {
96 | "ExecuteTime": {
97 | "end_time": "2020-06-07T23:55:57.481777Z",
98 | "start_time": "2020-06-07T23:55:57.448134Z"
99 | }
100 | },
101 | "outputs": [
102 | {
103 | "data": {
104 | "text/plain": [
105 | "array([[ 9.02929604e-02, 2.53139641e-02, -8.63993599e-04,\n",
106 | " 3.37017924e-02, -6.26476333e-02, -4.42366041e-02,\n",
107 | " 2.18325537e-02, 5.37963435e-02, -8.38939548e-02,\n",
108 | " -9.51755140e-03, -3.12121455e-02, -5.35302460e-02,\n",
109 | " -4.03429270e-02, -6.45988435e-02, -4.22783829e-02,\n",
110 | " 6.87545631e-03, 2.68412735e-02, 1.69232395e-02,\n",
111 | " 4.52055521e-02, -7.21441209e-02, 7.80028552e-02,\n",
112 | " 7.60580525e-02, -4.91863601e-02, -3.33283916e-02,\n",
113 | " -6.48475764e-03, 5.31073436e-02, 5.94470128e-02,\n",
114 | " 4.97598015e-02, -5.83836809e-02, 4.62118129e-04,\n",
115 | " -2.54417248e-02, -4.07968946e-02, 2.24085082e-03,\n",
116 | " -5.71764819e-02, 3.96157652e-02, -5.56416325e-02,\n",
117 | " 1.06351763e-01, -2.11038422e-02, -4.97004427e-02,\n",
118 | " 1.37671484e-02, 2.52124630e-02, 6.93862326e-03,\n",
119 | " -8.78239796e-03, -4.25839275e-02, -7.41932988e-02,\n",
120 | " 3.93395983e-02, -5.14756478e-02, -4.80900072e-02,\n",
121 | " 2.03796737e-02, 4.60575111e-02, -5.39578963e-03,\n",
122 | " 5.13799861e-02, 4.98849079e-02, -1.53071098e-02,\n",
123 | " -2.55209878e-02, -9.37783793e-02, 6.80431351e-02,\n",
124 | " 5.42037040e-02, 5.88915544e-03, 3.77027579e-02,\n",
125 | " 2.97001610e-03, -2.73854788e-02, -2.25164257e-02,\n",
126 | " 2.94775404e-02, -4.49141003e-02, -1.22707179e-02,\n",
127 | " 3.51123661e-02, -9.62661114e-03, -4.74585360e-03,\n",
128 | " -6.34014159e-02, 2.71070562e-02, -8.06257129e-03,\n",
129 | " -6.70747459e-02, -5.07746078e-02, -4.76036221e-02,\n",
130 | " 7.18707684e-03, 3.65909301e-02, -3.67699936e-02,\n",
131 | " 3.70868184e-02, 1.12690397e-01, -1.10753492e-01,\n",
132 | " 1.88780073e-02, 6.59464002e-02, 4.53360453e-02,\n",
133 | " 5.63650019e-02, 4.43356484e-02, 1.30171627e-02,\n",
134 | " -2.71456428e-02, -2.89043244e-02, 1.64611451e-02,\n",
135 | " -1.92087479e-02, -5.28771989e-02, -6.49906620e-02,\n",
136 | " 3.41170616e-02, -1.70326754e-02, -3.64219025e-02,\n",
137 | " -6.40340447e-02, -4.20621075e-02, 5.49546070e-02,\n",
138 | " -2.71569863e-02, 1.50894774e-02, -8.34458917e-02,\n",
139 | " 8.04520994e-02, -1.96695887e-02, -1.00700237e-01,\n",
140 | " -4.36882209e-03, 3.58170122e-02, -6.73646182e-02,\n",
141 | " -3.49581279e-02, -3.14205624e-02, -3.03178281e-02,\n",
142 | " 4.55205292e-02, -4.74548228e-02, -3.40684615e-02,\n",
143 | " 1.98236890e-02, 2.36651860e-02, -2.66083088e-02,\n",
144 | " -7.57225677e-02, -1.35216266e-02, -1.82724686e-03,\n",
145 | " 2.93097384e-02, -3.48166339e-02, 2.47215275e-02,\n",
146 | " 3.22892033e-02, 2.67713480e-02, -5.79769351e-02,\n",
147 | " -2.56844629e-02, 8.45318958e-02, -2.28709877e-02,\n",
148 | " -9.08638537e-03, 1.39732165e-02, 2.45238561e-02,\n",
149 | " 3.46098132e-02, 5.28965704e-02, -7.04566389e-02,\n",
150 | " -4.86870706e-02, 2.34722588e-02, 2.37552300e-02,\n",
151 | " -5.92066869e-02, 8.27051178e-02, -4.00973856e-03,\n",
152 | " -3.28391939e-02, -6.41322583e-02, 5.77677647e-03,\n",
153 | " -8.03356245e-02, 2.29892787e-02, -9.79695190e-03,\n",
154 | " -6.78975321e-03, -8.75867438e-03, -4.90265042e-02,\n",
155 | " 2.29285266e-02, -4.98827323e-02, -1.09793276e-01,\n",
156 | " -3.85776274e-02, 4.46549704e-04, 1.88573524e-02,\n",
157 | " -5.65416738e-02, 4.27371226e-02, -7.73091055e-03,\n",
158 | " 9.66695976e-03, -8.21083859e-02, 6.33048778e-03,\n",
159 | " -3.02886646e-02, 1.40163992e-02, -5.77669404e-02,\n",
160 | " 1.12527721e-01, -1.80120803e-02, 2.36992892e-02,\n",
161 | " -1.32897524e-02, -2.57620830e-02, 6.66455459e-03,\n",
162 | " 4.19999138e-02, -4.12883908e-02, -9.73793678e-03,\n",
163 | " 5.12045547e-02, -4.39418741e-02, -2.89999340e-02,\n",
164 | " -3.28261144e-02, 4.97053796e-03, -3.94377932e-02,\n",
165 | " 7.66094103e-02, -1.74061339e-02, -4.58508246e-02,\n",
166 | " 2.20106803e-02, 1.01143029e-02, 3.49179357e-02,\n",
167 | " -2.76056118e-02, 4.00061607e-02, -3.07031441e-02,\n",
168 | " -9.87935532e-03, -3.51552591e-02, 6.12977035e-02,\n",
169 | " -1.34940445e-02, 1.69758976e-03, -9.62444022e-03,\n",
170 | " -1.15804393e-02, 3.89520489e-02, 8.71613845e-02,\n",
171 | " 6.13522753e-02, -3.57693098e-02, 5.38780093e-02,\n",
172 | " -9.76623152e-04, -3.23415212e-02, 6.76710904e-02,\n",
173 | " 9.33619123e-03, -8.11213255e-03, 6.82704076e-02,\n",
174 | " 5.05042709e-02, -8.47424865e-02, -5.89879490e-02,\n",
175 | " 7.25789368e-02, -2.18459424e-02, 4.00722250e-02,\n",
176 | " -7.63654150e-03, 1.03146099e-02, 5.40494919e-02,\n",
177 | " 1.61888842e-02, 4.32131365e-02, 5.60503006e-02,\n",
178 | " -8.37420970e-02, -1.66589953e-02, -1.09322891e-02,\n",
179 | " 3.10896002e-02, 2.87623964e-02, 7.79771879e-02,\n",
180 | " -3.36286873e-02, 7.37195835e-02, -1.33916633e-02,\n",
181 | " 4.63935211e-02, 6.50074799e-03, 3.98444422e-02,\n",
182 | " -3.78602743e-02, 4.35293913e-02, -2.90157180e-02,\n",
183 | " 1.25429835e-02, 3.68853062e-02, -1.06367087e-02,\n",
184 | " -7.20745325e-02, -2.14768406e-02, -5.44496030e-02,\n",
185 | " -6.13776930e-02, 1.05972447e-01, 1.43837687e-02,\n",
186 | " 1.63943495e-03, -7.05509707e-02, -2.42533088e-02,\n",
187 | " 3.51534374e-02, 1.08488565e-02, -3.42009105e-02,\n",
188 | " -6.08277731e-02, 9.20248553e-02, 2.36441251e-02,\n",
189 | " 6.30925670e-02, 6.67787269e-02, -6.49841651e-02,\n",
190 | " 3.18379910e-03, 1.96745917e-02, -1.01103224e-02,\n",
191 | " 1.94480140e-02, -8.43841955e-02, -8.75772089e-02,\n",
192 | " -3.86252701e-02, 1.45352371e-02, 9.57477372e-03,\n",
193 | " -4.48818831e-03, -3.39164175e-02, -3.36552039e-02,\n",
194 | " -1.33386850e-02, -1.90982111e-02, 4.97365110e-02,\n",
195 | " 2.88681649e-02, 5.77684073e-03, -7.05776513e-02,\n",
196 | " 6.44142255e-02, 6.41829073e-02, -5.80542684e-02,\n",
197 | " -2.47256700e-02, -8.52649808e-02, 1.60062127e-02,\n",
198 | " -1.22919763e-02, 2.45415065e-02, 1.95066840e-03,\n",
199 | " -1.10592833e-02, 1.55704357e-02, 1.52007127e-02,\n",
200 | " -4.29791175e-02, -3.13911252e-02, -2.85093561e-02,\n",
201 | " -3.46784219e-02, 1.07909925e-02, -5.69052845e-02,\n",
202 | " -6.56142086e-02, -2.42444035e-03, -2.36847496e-04,\n",
203 | " -2.00943090e-02, -1.87727269e-02, 2.44390406e-02,\n",
204 | " -3.73762026e-02, -4.07696702e-02, 6.48761019e-02,\n",
205 | " 3.38231586e-02, 6.30460605e-02, 5.82951354e-03,\n",
206 | " -2.62612291e-02, 6.19867910e-03, -2.10380126e-02,\n",
207 | " 1.71352222e-04, 1.86081007e-02, 4.34052311e-02,\n",
208 | " -4.80737984e-02, 6.99277669e-02, 4.66579907e-02,\n",
209 | " -1.48551473e-02, 3.29916701e-02, -9.36777145e-03,\n",
210 | " 7.43718967e-02, -5.12492396e-02, -6.02555387e-02,\n",
211 | " -6.58557117e-02, -3.25691588e-02, -9.58766192e-02,\n",
212 | " 5.89718446e-02, -9.34590474e-02, -2.11967360e-02,\n",
213 | " -5.53228594e-02, 7.27902120e-03, -5.82117960e-03,\n",
214 | " 1.51520390e-02, 3.10048033e-02, 2.35684924e-02,\n",
215 | " -1.24157164e-02, -3.03980522e-02, -1.04722142e-01,\n",
216 | " 1.43642910e-02, 4.62585362e-03, -7.37912394e-03,\n",
217 | " -5.35621382e-02, 3.15730758e-02, -8.77389833e-02,\n",
218 | " 5.22329099e-02, -9.48735885e-03, 4.54171449e-02,\n",
219 | " -8.38277936e-02, -3.25404741e-02, 7.16998801e-03,\n",
220 | " 6.80265725e-02, -2.07673144e-02, -4.05646153e-02,\n",
221 | " 1.34903835e-02, 3.22747529e-02, -4.12309058e-02,\n",
222 | " 2.79887812e-03, 7.98721611e-03, 6.17843941e-02,\n",
223 | " 4.60151024e-03, 9.92045365e-03, 5.00864871e-02,\n",
224 | " -5.63305654e-02, -3.88379730e-02, 3.02622397e-03,\n",
225 | " 2.20519323e-02, 1.54148676e-02, 4.85269316e-02,\n",
226 | " -5.63364588e-02, -3.73017862e-02, -3.11127473e-02,\n",
227 | " 1.61838830e-02, -6.77759647e-02, -9.11579654e-02,\n",
228 | " 4.67085131e-02, -4.00679782e-02, -3.72959077e-02,\n",
229 | " -3.94075289e-02, -1.12072146e-02, 1.26367714e-02,\n",
230 | " 4.40460369e-02, -7.77020901e-02, -2.46636514e-02,\n",
231 | " -1.49408458e-02, 5.86274220e-03, -7.11899400e-02,\n",
232 | " -1.15099251e-02, 3.33920382e-02, 5.09477453e-03,\n",
233 | " 1.51081178e-02, 1.04949502e-02, -5.80682866e-02,\n",
234 | " -3.40924747e-02, -3.48201320e-02, 2.49468200e-02,\n",
235 | " 4.42005768e-02, -2.37165336e-02, 3.79255484e-03,\n",
236 | " -8.86938721e-02, -1.56422518e-02, 4.10543345e-02,\n",
237 | " 4.47053164e-02, 5.43537475e-02, 5.49245300e-03,\n",
238 | " 7.09640309e-02, 1.93180814e-02, -3.05815432e-02,\n",
239 | " -6.89341733e-03, -3.62095423e-02, -3.08503956e-03,\n",
240 | " 6.30579367e-02, 4.35884781e-02, 1.84933823e-02,\n",
241 | " -7.83578958e-03, 2.59191096e-02, -5.52807143e-03,\n",
242 | " -4.72284009e-05, 2.06883010e-02, -1.38790896e-02,\n",
243 | " 5.72590455e-02, 3.44927758e-02, 2.15114728e-02,\n",
244 | " 2.95498725e-02, -5.41498102e-02, -8.79013725e-03,\n",
245 | " 7.38454312e-02, 1.96587350e-02, 1.34385750e-02,\n",
246 | " -5.90348169e-02, -5.32622188e-02, 3.93599793e-02,\n",
247 | " -4.86550853e-02, -3.91548872e-02, 4.74032760e-02,\n",
248 | " 1.50756622e-02, 6.87927082e-02, -3.02066337e-02,\n",
249 | " -2.66485778e-03, -2.01581307e-02, 5.31393997e-02,\n",
250 | " 1.00522246e-02, -1.83966588e-02, -4.26581167e-02,\n",
251 | " -2.71374499e-03, 9.05769784e-03, -4.29850779e-02,\n",
252 | " -1.37065900e-02, -6.19315952e-02, -6.49061725e-02,\n",
253 | " 4.96972874e-02, -9.45900939e-03, 7.37345219e-02,\n",
254 | " 5.60122356e-02, -2.91699544e-02, -1.58697236e-02,\n",
255 | " -6.30429089e-02, 4.82642651e-02, 5.28050645e-04,\n",
256 | " 3.94601114e-02, -7.31267557e-02, 3.35745700e-02,\n",
257 | " 2.48057507e-02, -4.80459072e-02, -1.18432520e-02,\n",
258 | " -4.43868563e-02, 3.98386568e-02, -5.34126982e-02,\n",
259 | " 5.74409105e-02, 1.32571915e-02, -2.18527261e-02,\n",
260 | " 1.10984361e-02, -1.16096223e-02, 6.81838691e-02,\n",
261 | " 3.42932194e-02, -8.47309604e-02, -4.01029214e-02,\n",
262 | " -3.77797745e-02, -6.41229227e-02, 3.81232128e-02,\n",
263 | " 2.52712779e-02, 1.10559305e-02, 9.84640513e-03,\n",
264 | " -1.20055480e-02, -9.66666546e-03, -5.53334281e-02,\n",
265 | " 2.41286773e-02, 1.00961186e-01, -1.27077922e-02,\n",
266 | " -4.23806421e-02, -1.07549950e-02, -3.54763754e-02,\n",
267 | " 5.58016337e-02, -7.87500739e-02, -3.64025608e-02,\n",
268 | " -2.90403571e-02, 5.37508540e-02, -2.00727507e-02,\n",
269 | " 3.00442167e-02, 3.45369503e-02, -3.68632935e-02,\n",
270 | " 1.22389954e-03, -6.67770281e-02, 2.16749627e-02,\n",
271 | " 3.61376889e-02, -4.56607640e-02, 2.02212632e-02,\n",
272 | " 4.63767387e-02, -4.86524552e-02, 5.23989350e-02,\n",
273 | " 1.38630597e-02, 5.03290556e-02, 5.27634881e-02,\n",
274 | " -3.88095379e-02, -1.34635530e-02, -7.79085681e-02,\n",
275 | " 1.63281877e-02, -7.12259766e-03]], dtype=float32)"
276 | ]
277 | },
278 | "execution_count": 5,
279 | "metadata": {},
280 | "output_type": "execute_result"
281 | }
282 | ],
283 | "source": [
284 | "encoder.encode(\"hello world\")"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 10,
290 | "metadata": {
291 | "ExecuteTime": {
292 | "end_time": "2020-06-07T23:58:33.082587Z",
293 | "start_time": "2020-06-07T23:58:33.053011Z"
294 | }
295 | },
296 | "outputs": [
297 | {
298 | "data": {
299 | "text/html": [
300 | "
\n",
301 | "\n",
314 | "
\n",
315 | " \n",
316 | " \n",
317 | " | \n",
318 | " username | \n",
319 | " text | \n",
320 | "
\n",
321 | " \n",
322 | " \n",
323 | " \n",
324 | " 0 | \n",
325 | " user1 | \n",
326 | " hello world | \n",
327 | "
\n",
328 | " \n",
329 | " 1 | \n",
330 | " user1 | \n",
331 | " merhaba dunya | \n",
332 | "
\n",
333 | " \n",
334 | " 2 | \n",
335 | " user2 | \n",
336 | " Bonjour le monde | \n",
337 | "
\n",
338 | " \n",
339 | " 3 | \n",
340 | " user2 | \n",
341 | " مرحبا بالعالم | \n",
342 | "
\n",
343 | " \n",
344 | "
\n",
345 | "
"
346 | ],
347 | "text/plain": [
348 | " username text\n",
349 | "0 user1 hello world\n",
350 | "1 user1 merhaba dunya\n",
351 | "2 user2 Bonjour le monde\n",
352 | "3 user2 مرحبا بالعالم"
353 | ]
354 | },
355 | "execution_count": 10,
356 | "metadata": {},
357 | "output_type": "execute_result"
358 | }
359 | ],
360 | "source": [
361 | "import pandas as pd\n",
362 | "\n",
363 | "df = pd.DataFrame({\n",
364 | " \"username\":[\"user1\", \"user1\", \"user2\", \"user2\"],\n",
365 | " \"text\": [\"hello world\", \"merhaba dunya\", \"Bonjour le monde\", \"مرحبا بالعالم\"]\n",
366 | "})\n",
367 | "df"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 11,
373 | "metadata": {
374 | "ExecuteTime": {
375 | "end_time": "2020-06-07T23:59:26.926654Z",
376 | "start_time": "2020-06-07T23:59:26.886003Z"
377 | }
378 | },
379 | "outputs": [
380 | {
381 | "name": "stderr",
382 | "output_type": "stream",
383 | "text": [
384 | "100%|██████████| 2/2 [00:00<00:00, 95.93it/s]\n"
385 | ]
386 | }
387 | ],
388 | "source": [
389 | "encoder.encode_df(df, user_col=\"username\", text_col=\"text\", out_path=\"demo.npz\")"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 12,
395 | "metadata": {
396 | "ExecuteTime": {
397 | "end_time": "2020-06-07T23:59:38.423912Z",
398 | "start_time": "2020-06-07T23:59:38.418479Z"
399 | }
400 | },
401 | "outputs": [
402 | {
403 | "data": {
404 | "text/plain": [
405 | "array(['user1', 'user2'], dtype='"
615 | ]
616 | },
617 | "execution_count": 34,
618 | "metadata": {},
619 | "output_type": "execute_result"
620 | },
621 | {
622 | "data": {
623 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1EAAAI3CAYAAAB6X9FZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAgAElEQVR4nO3df5RddX0v/PeZaIafYW5CwwwJygMqpEkVYhB9kLtYQQxYFHQh4SbiLYq09SZyVUCqNEFQIYFSxULRh0eKNMIDUkGCkCjW2tu6rNRydaCC8kuB/GgIgYaEIHP280dkLiEB5szMPufMPq+Xa7sye89893dYWTCfeX+/n2+tKIoiAAAADElXqycAAAAwliiiAAAAGqCIAgAAaIAiCgAAoAGKKAAAgAYoogAAABrQNkXUgw8+mLlz52bOnDmZO3duHnrooVZPCQAAYDttU0QtXrw48+bNy4oVKzJv3rwsWrSo1VMCAADYTq0dDtt9/PHHM2fOnPz4xz/OuHHjMjAwkEMPPTQrV67MxIkTX/Zrn3vuuaxevTq9vb151ate1aQZAwBAe9qwYUM2btzYlHfttttu6enpacq72klbVB2rVq3KXnvtlXHjxiVJxo0bl8mTJ2fVqlWvWEStXr06Rx55ZO64445MnTq1GdOFMaG++g2tnsI2unrva/UUAKDyNmzYkHe+4y158j9rTXnfHnvskZUrV3ZcIdUWRRQAADByGzduzJP/WcvffrlI7++V+67V/5F8YOGT2bhxoyKqFfr6+rJmzZoMDAwMLudbu3Zt+vr6Wj01GLN+WzzX6ilso7vVEwCADtL7e8kUP0qXpi0aS0yaNCnTpk3L8uXLkyTLly/PtGnTXnEpHwAAsL16k/7XqdoiiUqSc889N2effXYuv/zyTJgwIUuWLGn1lAAAALbTNkXU/vvvnxtuuKHV0wAAgDFvoKhnoOQe3FvHH1fuS9pUWyznAwAAGCvaJokCAABGRz1F6TuWOndHlCQKAACgIZIoAAComCJF6il3U1SR5hzo244kUQAAAA2QRAEAQMUMpMhAUW4SNVDq6O1NEgUAANAASRQAAFRMvQl7onTnAwAAYEgUUQAAAA2wnA8AACqmniIDlvOVRhIFAADQAEkUAABUjMYS5ZJEAQAANEASBQAAFTNQNOGw3XKHb2uSKAAAgAZIogAAoGKKlL9nqYODKEkUAABAIyRRAABQMQNNOCdqoNTR25skCgAAoAGSKAAAqJiBovzuebrzAQAAMCSSKAAAqBjd+coliQIAAGiAJAoAACpmILXSu+cNpFbyG9qXJAoAAKABiigAAIAGWM4HAAAVUy+2XmW/o1NJogAAABogiQIAgIqpN6GxRL2DG0sooqCi6h19egMAQHkUUQAAUDFanJfLnigAAIAGSKIAAKBi6kWtCd35JFEAAAAMgSQKAAAqRne+ckmiAAAAGiCJAgCAitnana/cpKjspKudSaIAAAAaIIkCAICKaU53vnLHb2eKKKioeuqtngIAQCUpogAAoGKa052vc9kTBQAA0ABFFFTUrn0Pt3oKAACVZDkfAABUzEDRlYGSGz+UPX47k0QBAAA0QBLVYrt3/142Pruu1dOAl9Td3Z0tW7aMwkjlHvgHjZo+fXr6+/tbPQ2AUtTTVXrjh05uLKGIarG3v+aUbT6+7ZdLWzQTgOa6+YGDWvr+4/a7q6XvB2DsUkQBAEDFaHFeLnuiAAAAGiCJAgCAihkoahkoyt2PPFAUSTqzRZ8kCgAAoAGSKABaouzfkAJ0snpqqZfcGXfrnihJFAAAAK9AEgUAABVTT1cGSk+iinRqjz5JFAAAQAMkUa1WdOY6UgAAyjOQrvK783XofqhEEQVj2q7/157Z9NDjrZ4GtER3d3e2bNkyghE0tqA9TZ8+Pf39/a2eBvAyFFEwhr35ylNKf8cPj7yo9HcAdJojj/jCSz674wefbuJMqKqt3fnK3blT79D9UIk9UQAAAA1RRAEAADTAcj4AAKiYelErvbFEvYMPTZdEAQAANEASBWNYJ/8GCAB4aQPpykDJeclAqaO3N0kUAABAAyRRAABNVuvcM0ppknrRlXpRcovzDv57LIkCAABogCQKAAAqpp5a6Xui6uncKEoSBQAA0ABJFAAAVMxAE86JKnv8dqaIAgA62u6vnpSNz61v2fu7u7uzZcuWwY9rtc+0ZB7Tp09Pf39/S94NY40iCsaw//WOpTnsu59q9TQAxrS3T56XJLnt0S+3eCbNdcy0P9vm49v6L2jRTChDPV2p2xNVGnuiAAAAGiCJAgCAiqkXXRko/ZwoSRQAAABDIIkCAICKqaeWesrtnlf2+O1MEQVjXL2D24sCALSCIqrV6p27lhQA2kK93uoZAGOMIoq21uqzO6roxeeRvJJalpQ4m87h/BUAmmnrYbvltj8YKDr3FxCKKNra2/eaP/jn2x65tIUzoVPs//99vpRx++e25vBMAGgns2fPzvjx49Pd3Z0kOeOMM3L44YfnrrvuyqJFi7Jly5ZMmTIlF110USZNmpQkw35WJt35AACgYurpykDJ13AP87300ktz88035+abb87hhx+eer2eM888M4sWLcqKFSsya9asXHzxxVu/j2E+K5siCgAAGLbVq1fnkUce2eZ66qmnhvz1/f396e7uzqxZs5IkJ510Um6//fYRPSub5XwAL1RCt8P7T/r0qI8JAC+nXtRK7+D7/Pjz58/f7tmCBQuycOHCHX7dGWeckaIo8uY3vzmf+MQnsmrVquy9996DzydOnJh6vZ4NGzYM+1lPT89ofZs7pIgCAACGbdmyZent7d3m3oQJE17yc/v6+vLss8/m85//fM4777wcddRRzZjmqFJEAQCdrXDcCNXz/L6lst+RJL29vZk6deqQvqavry9JMn78+MybNy9/+qd/mg9+8IN57LHHBj9n/fr16erqSk9PT/r6+ob1rGz2RAEAAKXbtGlT/vM//zNJUhRFvvOd72TatGmZMWNGnnnmmdx5551Jkuuuuy5HH310kgz7WdkkUQAAUDH1oiv1ks+JanT8xx9/PAsXLszAwEDq9Xr233//LF68OF1dXVm6dGkWL168TavyJMN+VjZFFAAAULp99tknN9100w6fzZw5M7fccsuoPitT2xRRL3XwFgAA0Jh6koGU3J2v1NHbW9sUUcnWg7fe8IY3tHoazVd08l9BAAAYWzSWaLHbHmjOqcoAAHSOeroG90WVdnVwKdFWSdSLD956qf7yAAAArdI25eOyZcvy7W9/OzfeeGOKosh5553X6ikBHagoRv8CgGYbSK0pV6dqmyLqxQdv/fSnP23xjAAAALbXFsv5Nm3alIGBgey+++7bHLwFAEBJ6qLyKiuacE5UUfL47awtiqiXOngLoOn8TAEAvIK2KKJe7uAtAACAdtIWRRS8JLvyAQAaNlB0ZaDk5XZlj9/OFFEAL1AUndtpCAAYGkUUAABUTD1JveQW5PVSR29viqh2YMkaAACMGYoogBfyOw0AKqDehD1RZbdQb2eKKIAXsCcKAHglnVs+AuzAQyefvTWNGs0LAJqsXtSacnUqRRQAAEADLOejvdU7ue8LAE0x4L81VM9AujJQcl5S9vjtTBEFsJ3OXZ4AALwyRRQAAFRM0YQ9S53cjEkRBfBimkEAAC9DEQUAABVTTy31kvcs1Tt4+bsiCuDFJFEAwMtQRAEAQMUMFLUMlLxnqezx25kiCgCgExUviN1rnfvDMAyHIgrgRYoBP0wAneW2ey9s9RRgTFFEAbzYKC5PePjUM0dtLAAYqnoTWpyXPX4769xjhgEAAIZBEgXwIjXd+QAY44p0pV6Um5cUHZzHKKLaQd1PbAAAMFYoogBezO81ABjjBlLLQMmH4ZY9fjvr3AwOAABgGCRRMEbtvvPkbHzmP1o9jY7W3d2dLVu2vOzn1P74rCbNhudNnz49/f39rZ4GQEvpzlcuRRSMUYcd8CfbfHz7Xee1aCZQjumf+sthfV3/ko+P8kwAYFuKqDZy26+/2OopAEDHKYp6q6cAo64oauV355NEQZsq7PCHTqXVPADtShEFAAAVU08t9ZK755U9fjtTRLUDaQvA9obxr8b+i+yHAqB8iigYqxTfAMBLGChqGSh5z1LZ47czRRQAbcmeKADalSIKAAAqpl50ld6dr+zx25kiCoD2JImCclkWDsPWueUjAADAMEiiAGhPfkkOMGxbD9stt/GDw3YBoM107n+aAWh3iqgR2P1VE7Nx4IlRGau7uzu12qUjHmf69Onp7+8fhRkBADBWOWy3XIqoEXj7niclSW5bfXmLZwKUZZeevmx+cnWrp8GLdHd3Z8uWLdvdr/3lJ1owm87kl3ZAJ1NEwVilq1JTbNqwKoeefMmQPvfH1/gBnmo7Zp/TB/98W/+XWjgT4JXUm7Anquzx25kiitLs1tWTp4snR2Wsrcsd/3rE4/jNKcPh0FcA4IUUUZTmsO53Z8Xma1o9jUFH7/Gh3N7/tVZPg7FI6gfAGLM1iSr7sN3OTaKcEwUAANAASRQAAFRMkSacE6U7HwAvxZ4o+J2qLm2tV/T7AkqjiAIAgIpxTlS5FFF0jGJgoNVTGF1V/Y1wO/KPGgB4AUUUwCuoDaH+/ufrP1n+RABgiJwTVS5FFMArsCcKAHghRRQAAFSMJKpczokCAABogCRqJGzsh1LMmDEjd999d6un8bK6u7uzZcuWwY9rtTNaOJtXNn369PT397d6Gox1/rsHkEQRBbQhP+w35u3vu/gVP+d//V17F3kAjK6iCcv5ig5ezqeIgrGq3uoJ0C40vgCA5lJEAQBDU/fbGxgrNJYol8YSI1Evtl4ALfSP3zpj616Vl7tgFNy26rKthZRiCuhwkiiACugawoHAMBqKgYHcvu6rrZ4G8ArqSeopOYkqdfT2pogCqAKpOAA0jSIKAAAqxp6ocimiYIyq2efCC/nrAABNo4gCAIZuoJN3QVSMX8ZVmnOiyqWIAqgAZ0UBQPMoogCqwG+UAXgBe6LKpYiCscoPzQCjw79PgQYpogAAoGLqaUISVfI5VO2sq9UTAGDkasXLX5AkM2bMSK1WG9G1YsP/m5122mnE49RqtcyYMaPV/0gAhkUSBQAdor+/v9VTGDRnt/+eFf1Xt3oaUFlFUSu9e57ufACMbS+TNv39yk81bx4A0AEUUSNQFM7KoIXq1mgBALSCIgqgAmq6iwHwAkVqpTd+KDq4sYQiCgDobPUOXVliRQMMmyIKAAAqpl7UUnPYbmkUUdACu79qYjYOPDEqY3V3d2fLli2p1ZaMeKzp06e3VfcuGtChv0hnDLMEFRjDFFHQAm+fNDdJctuav27xTLY6euJHkiS39/8/LZ4JADAaiqL8FuSd/LsQh+0CVMD37zg7tYH6Di8AaCd/9Vd/lQMOOCD33XdfkuSuu+7Ke97znsyZMycf+tCH8vjjjw9+7nCflU0RBQAAFVMvak25GnX33XfnrrvuypQpU7bOs17PmWeemUWLFmXFihWZNWtWLr744hE9awZF1EjUC51tXo5ztKCpasWOLwAo0+rVq/PII49scz311FPbfd6zzz6b8847L+eee+7gvf7+/nR3d2fWrFlJkpNOOim33377iJ41gz1RAFXRyYvTAdhGUdSasCdq6/jz58/f7tmCBQuycOHCbe596Utfynve855MnTp18N6qVauy9957D348ceLE1Ov1bNiwYdjPenp6Ru17fCmKKICKqAl/AWiBZcuWpbe3d5t7EyZM2Objf/u3f0t/f3/OOOOMZk6tNIooaAXLQMeU3bp68nTxZKun0ZDnW98nSa12Totn0xit9gFGrhjmnqVG35Ekvb2926RLO/KTn/wk999/f4488sgkW5cAfvjDH87JJ5+cxx57bPDz1q9fn66urvT09KSvr29Yz5pBEQXwCg4b/4dZ8cyyVk+jkuYc8tnt7q34yeIWzASAMp122mk57bTTBj+ePXt2rrjiirzuda/L9ddfnzvvvDOzZs3Kddddl6OPPjpJMmPGjDzzzDMNP2sGRRQAAFTM1nOiyn/HSHV1dWXp0qVZvHhxtmzZkilTpuSiiy4a0bNmUESNhE3cVIW/ywBAE33/+98f/PPMmTNzyy237PDzhvusbIooAFpHAQ9QinpqScrdE1Uvefx25pwoaIGiqKdwjtaYYT8UlKCNGuzc/uTXOrOgL+r/5wIaIokC6EC7j5uYjfUnWj2NQdt2E9y+2cRYptvgjhUDA62eAsCwKaIAOtBhux6XjBuX25+4stVTGVVzdv1gVjz99VZPI0lyzNSPJUlu67+0xTMBOlGRJhy2azkfAJ2magUUY8vKZ7/R6ikADJskClphoM3Wn9frSZffqVABbbTPBqCV6kUtKTmJKvsw33bWtJ+alixZktmzZ+eAAw7IfffdN3j/wQcfzNy5czNnzpzMnTs3Dz30ULOmBLyAVAJGWTMOaQGgJZpWRB155JFZtmxZpkyZss39xYsXZ968eVmxYkXmzZuXRYsWNWtKAFSMZgUAWz3/e5yyr07VtCJq1qxZ6evr2+be448/nnvuuSfHHntskuTYY4/NPffck/Xr1zdrWgAAKepttswaaGst3QSxatWq7LXXXhk3blySZNy4cZk8eXJWrVrVymkBtLUZM2akVquN6Frxn3+TnXbaacTj1Gq1zJgxo9X/SAB4sWJrd74yr7L3XLUzjSXoHJ2cOVMp7XTm0DtffVJW9l/X6mkAw+G/izBsLS2i+vr6smbNmgwMDGTcuHEZGBjI2rVrt1v2BwAADF1RpPSkqJPr8JYu55s0aVKmTZuW5cuXJ0mWL1+eadOmZeLEia2cFgAAwEtqWhL1uc99LitXrsy6detyyimnpKenJ7feemvOPffcnH322bn88sszYcKELFmypFlTAgDoXM5Vq7R6UUut9CTKnqjSnXPOOTnnnHO2u7///vvnhhtuaNY0ABhFK39rPxSMWZ28FgtGSGMJAAComK17oprwjg7V0j1RY93tT1zZ6ikwVtXrWy9gVK189hutngJjVSf/NAg0TBIFAGXwixKghZpxjpM9UQAAdBaFPgyb5XyAZSwAAA2QRAEAQMUUacJyvljOBzST5AcAYMxSRAEAQAX5lW157IkCAABogCQKAMpg2S7QQs1ocZ6i1rG7oiRRI6U9KAAAdBRJFAAAVE2R8jdFdXDgLomCFrj9qataPQUAAIZJEgUA0IlsSag0e6LKJYkCAABogCQKAAAqprAnqlSSKAAAgAZIooAU1sUDtJUZM2bk7rvvbsq7uru7U6td3pR3TZ8+Pf39/U15V6dr1p6oTqWIAgBoM+1UaOxW68nTeXLE43R3d+fuu+9OrTbyH7wVY7Sa5XwAALyk//vV70pRFCO+nnnmmVEZZ87uf6SAGornk6iyrw6liKJz1Dt49yMAAKNGETVShR/MGR77kAAAxiZFFAAAVExRNOcaa5YvX577778/SfLAAw9k/vz5OfnkkwfvDZUiCgAA6Ahf/OIXs8ceeyRJli5dmje+8Y15y1veks9+9rMNjaM7H5A891yrZwBAm1r57DdaPQWGawwmRWVbv3599txzz2zZsiX/+q//mksvvTSvetWr8ta3vrWhcRRRAABAR5g4cWIefvjh3HffffmDP/iDjB8/Pps3b07R4NpERRQAlGEsbhYAKqMo0oTDdssdvgwf/ehH8773vS/jxo3LX/7lXyZJ/vmf/zkHHnhgQ+MoogAAGDN0t2Uk3ve+9+WYY45Jkuy8885JkoMOOmiwoBoqjSUAAKBqiiZdY8zxxx+fnXfeebCASpJJkybltNNOa2gcRRS0yIqNV7d6CgAAHeXhhx/e7l5RFHnkkUcaGsdyPgAAqJiiqDVhT1TJ44+is846K0ny29/+dvDPz3v00Ufzute9rqHxFFEAAEClveY1r9nhn5Nk5syZOfrooxsaTxEFAGWoj8HNAkB1NGPP0hj619yCBQuSJG9605ty+OGHj3g8RRQAANARDj/88DzwwAP5xS9+kU2bNm3z7IQTThjyOIooAAConNrvrrLfMbZcccUVueyyy3LggQdmp512Grxfq9UUUQDQasVzz7V6ClBNDrJmBK6++urccMMNDR+u+2KKqBFy4BsAAG3Hnqgd2mmnnbLffvuNeBznRAEAaAQCHeH000/P5z73uaxduzb1en2bqxGSKCArnlnW6ikAAJTu7LPPTpLccMMNg/eKokitVsu///u/D3kcRRQAAFSN5Xw7dMcdd4zKOIooAACgI0yZMmVUxlFEAQBA1RS1rVfZ7xgD/vzP/zznn39+kuTMM89MrbbjeS9dunTIYyqiAACAypo6dergn1/72teOypiKKDpHoR09ANA5yj5Sa2zkUMkf//EfD/55wYIFozKmIgoAAOgYP/7xj3PTTTdl7dq1mTx5co477ri89a1vbWgM50QBAEDVFE26xpgbbrgh//N//s/83u/9Xo466qhMnjw5n/zkJ3P99dc3NI4kCgAA6AhXXnllrrrqqhx44IGD94455ph87GMfy4knnjjkcRRRAABQNbrz7dCGDRuy//77b3Nvv/32y5NPPtnQOJbzAUAZ6vWtFwBtY+bMmbnwwguzefPmJMmmTZuydOnSHHzwwQ2NI4kaqbLbngAAQKOKpFb2j6lj8Mfgz372s/n4xz+eWbNmZY899siTTz6Zgw8+OH/xF3/R0DiKKDpGMTDQ6ikAANBCkydPzrJly7J69erB7ny9vb0Nj6OIAgCAqmlG97wxmEQlyVNPPZV/+Zd/GSyijjjiiEyYMKGhMeyJAgAAOsKPfvSjzJ49O9dcc01+/vOf52//9m8ze/bs/OhHP2ponIaSqH/6p3/KrbfemvXr1+eKK67Iz3/+82zcuDFve9vbGnoptMLK317X6ikA0KZWbL6m1VOA0aU73w6df/75Oe+88/Kud71r8N5tt92Wz372s7n99tuHPM6Qk6hrrrkm5557bvbdd9/85Cc/SZLstNNO+dKXvtTAtAEAAFpj7dq1mTNnzjb3jjrqqKxbt66hcYZcRF199dW56qqrctppp6Wra+uX7bfffnnwwQcbeiEAdISi0MEVoM0cd9xxWbZs2Tb3rr322hx//PENjTPk5XxPP/10+vr6kiS12tbo7rnnnsurX/3qhl4IAACUTGOJHbrnnnty3XXX5corr8xee+2VNWvWZP369XnjG9+Y+fPnD37eiwutFxtyEXXIIYfkq1/9av70T/908N7Xv/71HHroocOYPgAAQHOdeOKJOfHEE0c8zpCLqHPOOSd/8id/khtuuCFPP/105syZk1133TVf+cpXRjwJAABgFEmidui9733vqIwz5CJq8uTJufHGG/Pzn/88jz76aPr6+vLGN75xcH9Ux3KAKwAAdJQhF1Ev7sJ333335R/+4R+SJKeffvrozgoAxriiXm/1FIBONwaTorFiyEXU6tWrt/n4P/7jP/KTn/wk73jHO0Z9UgAAAO1qyEXUBRdcsN29H/7wh7n11ltHdUIAAPCS6uKVIXHY7qATTzwx119/fZLkr/7qr7JgwYIRjzmiDU1vf/vb873vfW/EkwAAACjDQw89lC1btiRJvva1r43KmENOon7zm99s8/HmzZuzfPnywbOjAACA9lDTnW/QkUcemTlz5mTKlCnZsmXLNudBvdArnQ31QkMuoo466qjUarUUvzt9feedd860adNy4YUXDvllANApVmy8OnN2+++tngZAx7vgggty55135tFHH83Pf/7znHDCCSMec8hF1C9+8YsRvwwAAGgCSdQ2Zs2alVmzZuW3v/3tqJwVNeQiCgAAYCQ++tGP5pFHHklXV1d22WWX/Pmf/3mmTZuWBx98MGeffXY2bNiQnp6eLFmyJPvuu2+SDPvZjpxwwgn58Y9/nJtuuilr167N5MmTc9xxx+Wtb31rQ9/HyxZRP/rRj4Y0yNve9raGXgoAnWDFxqtbPQWAtrJkyZLsvvvuSZLvfe97+fSnP51vfetbWbx4cebNm5fjjjsuN998cxYtWpSvf/3rSTLsZztyww035JJLLsn73//+vOlNb8qqVavyyU9+MqeffnpOPPHEIX8fL1tEfeYzn3nFAWq1Wu64444hvxAAAKiOF58nmyQTJkzIhAkTtrv/fAGVJBs3bkytVsvjjz+ee+65J1dddVWS5Nhjj83555+f9evXpyiKYT2bOHHiDud65ZVX5qqrrsqBBx44eO+YY47Jxz72sdEror7//e8PeSAAAKA9NLM734663S1YsCALFy7c4Zd95jOfyT/90z+lKIpceeWVWbVqVfbaa6+MGzcuSTJu3LhMnjw5q1atSlEUw3r2UkXUhg0bsv/++29zb7/99suTTz7Z0Lfe0J6odevW5Wc/+1meeOKJwS59SUalwwUAADD2LFu2LL29vdvc21EK9bzPf/7zSZKbbropS5cuzemnn17q/F5o5syZufDCC3PGGWdk5513zqZNm3LJJZfk4IMPbmicIRdR3/ve93LmmWfmta99bX71q1/lda97XX75y19m5syZiigAAGgnRW3rVfY7kvT29mbq1KkNf/nxxx+fRYsWpbe3N2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPXspnP/vZfPzjH8+sWbOyxx575Mknn8zBBx+cv/iLv2ho7kMuor74xS/mC1/4Qo455pgccsghuemmm3LjjTfmV7/6VUMvBAAAOs/TTz+dp556arDI+f73v5899tgjkyZNyrRp07J8+fIcd9xxWb58eaZNmza4JG+4z3Zk8uTJWbZsWVavXj3Yne/FKdpQDLmIeuyxx3LMMcdsc++9731vDjvssHzqU59q+MUAAEDn2Lx5c04//fRs3rw5XV1d2WOPPXLFFVekVqvl3HPPzdlnn53LL788EyZMyJIlSwa/brjPXk5vb++wiqfnDbmImjRpUtatW5c999wzU6ZMyb/927/lv/yX/5J6vT7slwMAACVow8N299xzz1x//fU7fLb//vvnhhtuGNVnZeoa6ie+//3vz7/+678mSf7oj/4oH/zgB3Pcccflv/23/1ba5MaCoj6GjmoGAABGbMhJ1GmnnTb454P3InoAABc+SURBVOOPPz5vectbsnnz5u1aBAIAQGkKq6CGpA2TqFar1+v58Y9/nDe/+c0ZP378iMYachL1+c9/Pj/72c8GP957770VUAAAwJjQ1dWVj370oyMuoJIGiqiiKPLRj34073znO3PppZfmgQceGPHLAQCA0VfL1gN3S71a/U0OwyGHHJK77rprxOMMeTnfOeeck09/+tP50Y9+lOXLl2fu3LnZZ5998u53vzunnHLKiCcCAABQpr333jsf+chHcuSRR6a3tze12v8pBRs59HfISVSyNQI77LDDcsEFF2T58uXp6enJ0qVLGxkCAAAoW9Gka4zZsmVL3vGOd6RWq2XNmjVZvXr14NWIISdRSbJp06Z897vfza233pp/+Zd/ySGHHJILL7ywoRcCAAC0wgUXXDAq4wy5iPrYxz6Wf/zHf8zv//7v5w//8A9z4YUXvuxpwAAAQIvozveS7r///tx+++15/PHHs2jRojzwwAN59tlnc+CBBw55jCEv5/uDP/iD3HrrrVm2bFnmzZungAIAAMaU2267LfPnz8+aNWty0003JUmefvrphlfXDTmJ+shHPtLYDAEAgJaoSaJ26NJLL83f/M3f5MADD8xtt92WJDnwwAPzi1/8oqFxGmosAQAAMFatX78+BxxwQJIMduar1WrbdOkbCkUUAABUTZGkqJV8tfqbbNz06dNz8803b3Pv1ltvzRvf+MaGxmmoOx8AAMBY9ZnPfCYf/vCH881vfjObNm3Khz/84Tz44IP52te+1tA4TS2ilixZkhUrVuTRRx/NLbfckje84Q1JktmzZ2f8+PHp7u5Okpxxxhk5/PDDmzk1AACoDnuidmj//ffPbbfdlr//+7/PEUcckb6+vhxxxBHZddddGxqn4SJq8+bNue+++3LvvffmF7/4Re69994sW7ZsSF975JFH5oMf/GDmz5+/3bNLL710sKgCAAAow84775w3v/nNmTp1avbaa6+GC6hkCEXU833U77333tx77735zW9+k9122y0HHHBApk2blhNOOGHIL5s1a1bDEwQAABgNjz32WM4444z87//9vzNhwoQ89dRTedOb3pSLLrooU6ZMGfI4r1hEffjDH86+++6bmTNn5t57780hhxySL3/5y5kwYcKIvoEXO+OMM1IURd785jfnE5/4xKiPDwAAnUKL8x371Kc+lenTp+fKK6/MLrvskqeffjpf+tKXcvbZZ+eaa64Z8jiv2J1vl112yVVXXZWPfexj+fa3v50DDzwwxx13XH74wx+O6Bt4oWXLluXb3/52brzxxhRFkfPOO2/UxgYAAEiSu+++O2eddVZ22WWXJMmuu+6aM844I/39/Q2N84pF1He+853Bvund3d35sz/7s1xyySW54IIL8pnPfCYbN24cxvS31dfXlyQZP3585s2bl5/+9KcjHhMAADpW0aRrjDnooIPys5/9bJt7/f39OfjggxsaZ1jd+Q4++ODcfPPN+eIXv5jjjz8+3/ve94YzTJJk06ZNGRgYyO67756iKPKd73wn06ZNG/Z4AAAAz/vSl740+Od99tknp512Wo444oj09vZm9erV+Yd/+Icce+yxDY057Bbn48ePz1lnnZWjjz56yF/zuc99LitXrsy6detyyimnpKenJ1dccUUWLlyYgYGB1Ov17L///lm8ePFwp0UbKepj8NcTAAAVUEvGZFJUhtWrV2/z8Tvf+c4kyfr16zN+/PgcddRR2bJlS0NjjvicqEZO9z3nnHNyzjnnbHf/pptuGuk0AAAAtnPBBReM+phNPWwXAABoAt35XtLmzZvz8MMPZ9OmTdvcnzlz5pDHUEQBAAAd4aabbsp5552XV7/61dlpp50G79dqtfzgBz8Y8jiKKAAAqBpJ1A5ddNFF+fKXv5zDDjtsROO8YotzAACAKnj1q1+dt7zlLSMeRxEFAAAVUyuac401p59+ei688MKsX79+RONYzgcAAHSEfffdN5deemm+8Y1vDN4riiK1Wi3//u//PuRxFFEAAEBHOOuss3LcccflXe961zaNJRqliAIAADrChg0bcvrpp6dWq41oHHuiAACgaoomXWPM+973vtx8880jHkcSBQAAdISf/exnWbZsWf76r/86e+655zbPli1bNuRxFFEjtPLZb7zyJwEAAC134okn5sQTTxzxOIooAAComia0IC/G4HK+9773vaMyjiIKAADoCN/85jdf8tkJJ5ww5HEUUQAAUDXNaPwwBpOoFzeVWLduXX7zm9/k4IMPVkQBAAC82DXXXLPdvW9+85u5//77GxpHi3MAAKgaLc6H7H3ve19uvPHGhr5GEgUAAHSEer2+zcebN2/Ot7/97ey+++4NjaOIAgCAiqml/O58ydgLo37/938/tVptm3t77bVXzj///IbGUUQBAAAd4Y477tjm45133jkTJ05seBxFFAAAVI3ufDs0ZcqUURlHEQUAwJhR1MfgT+603Mknn7zdMr4XqtVqufrqq4c8niIKAAAqplY0YU9UMXbCqPe85z07vL9mzZpcc801eeaZZxoaTxEFAABU2vvf//5tPn7iiSfy1a9+Nddff33e9a535X/8j//R0HiKKAAAqBp7onZo48aNufLKK7Ns2bIcccQR+da3vpXXvOY1DY+jiAIAACrtmWeeydVXX52vfe1rOfTQQ/ONb3wjr3/964c9niIKAACqRhK1jdmzZ6der+fUU0/NjBkzsm7duqxbt26bz3nb29425PEUUQAAQKXttNNOSZJrr712h89rtdp2Z0i9HEUUAABQad///vdHdTxFFAAAVEyzWpx3qq5WTwAAAGAskUQBAEDVaCxRKkkUAABAAyRRAABQNZKoUkmiAAAAGiCJAgCAiqmlCd35OpgkCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqmVjRhT5QkCgAAgKGQRAEAQNXYE1UqSRQAAEADJFEAAFA1kqhSSaIAAAAaoIgCAGDMWPnsN1o9BbCcDwAAqqb2u4tySKIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgaoqkVnJSVEiiAAAAGApJFAAAVI09UaWSRAEAADRAEgUAAFXUwUlR2SRRAAAADZBEAQBAxdSa0J2v7PHbmSQKAACgAZIoAACoGt35SiWJAgAAaIAkCgAAKsaeqHJJogAAgNI98cQT+chHPpI5c+bk3e9+dxYsWJD169cnSe6666685z3vyZw5c/KhD30ojz/++ODXDfdZmRRRAABA6Wq1Wk499dSsWLEit9xyS/bZZ59cfPHFqdfrOfPMM7No0aKsWLEis2bNysUXX5wkw35WNkUUAABUTdGkK8nq1avzyCOPbHM99dRT202pp6cnhx566ODHBx10UB577LH09/enu7s7s2bNSpKcdNJJuf3225Nk2M/KZk8UAAAwbPPnz9/u3oIFC7Jw4cKX/Jp6vZ5rr702s2fPzqpVq7L33nsPPps4cWLq9Xo2bNgw7Gc9PT2j9N3tmCIKAAAqppmNJZYtW5be3t5tnk2YMOFlv/b888/PLrvskg984AP57ne/W9YUS6OIAgAAhq23tzdTp04d8ucvWbIkDz/8cK644op0dXWlr68vjz322ODz9evXp6urKz09PcN+VjZ7ogAAoGqauCeqEZdcckn6+/tz2WWXZfz48UmSGTNm5Jlnnsmdd96ZJLnuuuty9NFHj+hZ2SRRAABA6X75y1/mK1/5Svbdd9+cdNJJSZKpU6fmsssuy9KlS7N48eJs2bIlU6ZMyUUXXZQk6erqGtazsimiAACgaoaZFDX8jga8/vWvz7333rvDZzNnzswtt9wyqs/KZDkfAABAAyRRAABQMbU0oTtfucO3NUkUAABAAyRRAABQNW24J6pKJFEAAAANkEQBAEDF1IqkVpQbFZW956qdSaIAAAAaIIkCAICqsSeqVJIoAACABiiiAAAAGmA5HwAAVMzWxhLlv6NTSaIAAAAaIIkCAICq0ViiVJIoAACABkiiAACgYuyJKpckCgAAoAGSKAAAqBp7okoliQIAAGiAJAoAACqok/csla1pRdQTTzyRs846K7/+9a8zfvz4vPa1r815552XiRMn5q677sqiRYuyZcuWTJkyJRdddFEmTZrUrKkBAAAMWdOW89VqtZx66qlZsWJFbrnlluyzzz65+OKLU6/Xc+aZZ2bRokVZsWJFZs2alYsvvrhZ0wIAgOopmnR1qKYVUT09PTn00EMHPz7ooIPy2GOPpb+/P93d3Zk1a1aS5KSTTsrtt9/erGkBAAA0pCWNJer1eq699trMnj07q1atyt577z34bOLEianX69mwYUMrpgYAAGPe8+dElX11qpYUUeeff3522WWXfOADH2jF6wEAAIat6d35lixZkocffjhXXHFFurq60tfXl8cee2zw+fr169PV1ZWenp5mTw0AAKqhKLZeZb+jQzU1ibrkkkvS39+fyy67LOPHj0+SzJgxI88880zuvPPOJMl1112Xo48+upnTAgAAGLKmJVG//OUv85WvfCX77rtvTjrppCTJ1KlTc9lll2Xp0qVZvHjxNi3OAQAA2lHTiqjXv/71uffee3f4bObMmbnllluaNRUAAKi0ZjR+0FgCAACAIWl6YwkAAKBkzTgMVxIFAADAUEiiAACgYmr1rVfZ7+hUkigAAIAGSKIAAKCKOnjPUtkkUQAAAA2QRAEAQMU4J6pckigAAIAGSKIAAKBqimLrVfY7OpQkCgAAoAGSKAAAqBh7osoliQIAAGiAJAoAAKqmSPnnREmiAAAAGApFFAAAQAMs5wMAgIrRWKJckigAAIAGSKIAAKBqHLZbKkkUAABAAyRRAABQMfZElUsSBQAA0ABJFAAAVFEHJ0Vlk0QBAAA0QBIFAABV04Q9UZ2cdEmiAAAAGiCJAgCAqqkXW6+y39GhJFEAAAANkEQBAEDVFCl/z1LnBlGSKAAAgEZIogAAoGJqTejOV3r3vzYmiQIAAGiAIgoAAKABlvMBAEDVFMXWq+x3dChJFAAAQAMkUQAAUDEaS5RLEgUAANAASRQAAFRRBydFZZNEAQAANEASBQAAFVMritRK7p5X9vjtTBIFAADQAEkUAABUTf13V9nv6FCSKAAAgAZIogAAoGLsiSqXJAoAAKABkigAAKiaIuWfE9W5QZQkCgAAoBGSKAAAqJoiSdl7liRRAAAADIUkCgAAKqZWbL3KfkenkkQBAAA0QBEFAADQAMv5AACgcoryG0t0cGcJSRQAAEADJFEAAFAxtXpSq5X/jk4liQIAAEq3ZMmSzJ49OwcccEDuu+++wfsPPvhg5s6dmzlz5mTu3Ll56KGHRvysbIooAAComqJoztWAI488MsuWLcuUKVO2ub948eLMmzcvK1asyLx587Jo0aIRPyubIgoAABi21atX55FHHtnmeuqpp7b7vFmzZqWvr2+be48//njuueeeHHvssUmSY489Nvfcc0/Wr18/7GfNYE8UAABUTZHym+f9bvz58+dv92jBggVZuHDhKw6xatWq7LXXXhk3blySZNy4cZk8eXJWrVqVoiiG9WzixImj9A2+NEUUAAAwbMuWLUtvb+829yZMmNCi2TSHIgoAACqmVhSplXxO1PPj9/b2ZurUqcMao6+vL2vWrMnAwEDGjRuXgYGBrF27Nn19fSmKYljPmsGeKAAAoCUmTZqUadOmZfny5UmS5cuXZ9q0aZk4ceKwnzWDJAoAAKpmGN3zhvWOBnzuc5/LypUrs27dupxyyinp6enJrbfemnPPPTdnn312Lr/88kyYMCFLliwZ/JrhPitbrSjK/qdbrkceeSRHHnlk7rjjjmHHiJTjnePnZeWz32j1NAAAOsbzPxtPmXBcXjVut1Lf9dzAxjz61M0d+XO4JAoAAKqmSFJvwjs6lD1RAAAADZBEAQBA1TShO1/pe67amCQKAACgAYooAACABljOBwAAVdOGLc6rRBIFAADQAEkUAABUjSSqVJIoAACABkiiAACgaso+aLdZ72hTkigAAIAGSKIAAKBiakVKP2y31rlboiRRAAAAjZBEAQBA1ejOVypJFAAAQAMkUQAAUDlNSKIiiQIAAGAIJFEAAFA1RZHSkyJ7ogAAABgKSRQAAFRNPUmt5Hd0bhAliQIAAGiEIgoAAKABlvMBAEDF1IoitZLX29U0lgAAAGAompZEPfHEEznrrLPy61//OuPHj89rX/vanHfeeZk4cWIOOOCAvOENb0hX19aabunSpTnggAOaNTXKUtRbPQMAgM6kxXmpmlZE1Wq1nHrqqTn00EOTJEuWLMnFF1+cL3zhC0mS6667LrvuumuzpgMAADAsTVvO19PTM1hAJclBBx2Uxx57rFmvBwCAzlEUSb3kSxLVXPV6Pddee21mz549eO/kk0/OwMBA/ut//a9ZuHBhxo8f34qpAQAAvKyWNJY4//zzs8suu+QDH/hAkuQHP/hB/u7v/i7Lli3Lr371q1x22WWtmBYAAFRDUTTn6lBNL6KWLFmShx9+OF/84hcHG0n09fUlSXbbbbe8//3vz09/+tNmTwsAAGBImrqc75JLLkl/f3+++tWvDi7Xe/LJJ9Pd3Z2ddtopzz33XFasWJFp06Y1c1oAAFAtxeD/latW/ivaUdOKqF/+8pf5yle+kn333TcnnXRSkmTq1Kk59dRTs2jRotRqtTz33HM5+OCDc/rppzdrWgAAAA1pWhH1+te/Pvfee+8On91yyy3NmgYAAFRfM86JakbS1aZa0lgCAABgrGpJi3MAAKBE9SYlUeNKfkWbkkQBAAA0QBIFAABVU9ST1Et+Sdnjty9JFAAAQAMUUQAAAA2wnA8AAKpGi/NSSaIAAAAaIIkCAICqKYrfpVElqkmiAAAAGAJJFAAAVE0zkih7ogAAABgKSRQAAFRNkSYkUZ1LEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAABVU68nRb3cd9RKHr+NSaIAAAAaIIkCAICqsSeqVJIoAACABkiiAACgcpqQRNUkUQAAAAyBIgoAAKABlvMBAEDV1IutV6ks5wMAAGAIJFEAAFAxRVGkKPmw3aL0FurtSxJFaVb+9rpWTwEAAEadJAoAAKrGnqhSSaIAAAAaIIkCAICqKZpw2K49UQAAAAyFJAoAAKqmqCf1crvzpVby+G1MEgUAANAASRQAAFSNPVGlkkQBAAA0QBIFAAAVU9TrKUreE1XYEwUAAMBQSKIAAKBqijRhT1S5w7czSRQAAEADFFEAAAANsJwPAACqpl5svcpU69z1fJIoAACABkiiAACgaor61qvsd3QoSRQAAEADJFEAAFA1RZGi7D1RXfZEAQAAMASSKAAAqBp7okoliQIAAGiAJAoAACqmqJe/J6r0PVdtTBIFAAA0xYMPPpi5c+dmzpw5mTt3bh566KFWT2lYFFEAAFA1RfF/9kWVdjWeRC1evDjz5s3LihUrMm/evCxatKiEb758lvMBAEDFPPeq3zbtHatXr97u2YQJEzJhwoRt7j3++OO55557ctVVVyVJjj322Jx//vlZv359Jk6cWPp8R5MiCgAAKmK33XbLHnvskbW5vynv6+7uzvz587e7v2DBgixcuHCbe6tWrcpee+2VcePGJUnGjRuXyZMnZ9WqVYqoZuvt7c0dd9yR3t7eVk8FAABaqqenJytXrszGjRub8r6iKFKr1ba7/+IUqmrGfBH1qle9KlOnTm31NAAAoC309PSkp6en1dPYTl9fX9asWZOBgYGMGzcuAwMDWbt2bfr6+lo9tYZpLAEAAJRu0qRJmTZtWpYvX54kWb58eaZNmzbmlvIlSa0ohtFWAwAAoEH3339/zj777Dz11FOZMGFClixZkv3226/V02qYIgoAAKABlvMBAAA0QBEFAADQAEUUAABAAxRRAAAADVBEAQAANEARBQAA0ABFFAAAQAP+f98plyJiy31ZAAAAAElFTkSuQmCC\n",
624 | "text/plain": [
625 | ""
626 | ]
627 | },
628 | "metadata": {},
629 | "output_type": "display_data"
630 | }
631 | ],
632 | "source": [
633 | "clusterer.cluster(tree_path=\"tree.pkl\")\n",
634 | "\n",
635 | "clusterer.plot_tree(path=\"tree.pkl\")"
636 | ]
637 | },
638 | {
639 | "cell_type": "markdown",
640 | "metadata": {},
641 | "source": [
642 | "## Injecting labels"
643 | ]
644 | },
645 | {
646 | "cell_type": "code",
647 | "execution_count": 36,
648 | "metadata": {
649 | "ExecuteTime": {
650 | "end_time": "2020-06-08T01:01:34.529613Z",
651 | "start_time": "2020-06-08T01:01:34.509232Z"
652 | }
653 | },
654 | "outputs": [
655 | {
656 | "data": {
657 | "text/html": [
658 | "\n",
659 | "\n",
672 | "
\n",
673 | " \n",
674 | " \n",
675 | " | \n",
676 | " username | \n",
677 | " label | \n",
678 | "
\n",
679 | " \n",
680 | " \n",
681 | " \n",
682 | " 0 | \n",
683 | " user1 | \n",
684 | " pro | \n",
685 | "
\n",
686 | " \n",
687 | " 1 | \n",
688 | " user2 | \n",
689 | " anti | \n",
690 | "
\n",
691 | " \n",
692 | "
\n",
693 | "
"
694 | ],
695 | "text/plain": [
696 | " username label\n",
697 | "0 user1 pro\n",
698 | "1 user2 anti"
699 | ]
700 | },
701 | "execution_count": 36,
702 | "metadata": {},
703 | "output_type": "execute_result"
704 | }
705 | ],
706 | "source": [
707 | "# this is just an example. We are hiding the actual labels of real users here\n",
708 | "labels = pd.DataFrame({\"username\":[\"user1\", \"user2\"], \"label\":[\"pro\", \"anti\"]})\n",
709 | "labels"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": 38,
715 | "metadata": {
716 | "ExecuteTime": {
717 | "end_time": "2020-06-08T01:21:09.279291Z",
718 | "start_time": "2020-06-08T01:21:09.274152Z"
719 | }
720 | },
721 | "outputs": [],
722 | "source": [
723 | "clusterer.inject_labels(users=labels.username, labels=labels.label)\n",
724 | "\n",
725 | "clusterer.align_clusters_with_labels(\n",
726 | " # this means multiple clusters can be assigned the same label\n",
727 | " allow_multiple_clusters=True\n",
728 | ")"
729 | ]
730 | },
731 | {
732 | "cell_type": "markdown",
733 | "metadata": {},
734 | "source": [
735 | "## Example on Turkish Election dataset"
736 | ]
737 | },
738 | {
739 | "cell_type": "code",
740 | "execution_count": 39,
741 | "metadata": {
742 | "ExecuteTime": {
743 | "end_time": "2020-06-08T01:23:28.426960Z",
744 | "start_time": "2020-06-08T01:23:28.419726Z"
745 | }
746 | },
747 | "outputs": [],
748 | "source": [
749 | "clusterer.plot()"
750 | ]
751 | },
752 | {
753 | "cell_type": "markdown",
754 | "metadata": {},
755 | "source": [
756 | "
"
757 | ]
758 | },
759 | {
760 | "cell_type": "markdown",
761 | "metadata": {},
762 | "source": [
763 | "## Example on Trump dataset"
764 | ]
765 | },
766 | {
767 | "cell_type": "code",
768 | "execution_count": 37,
769 | "metadata": {
770 | "ExecuteTime": {
771 | "end_time": "2020-06-08T01:05:41.558112Z",
772 | "start_time": "2020-06-08T01:05:41.554222Z"
773 | }
774 | },
775 | "outputs": [],
776 | "source": [
777 | "# this calculates the micro f1 score for all umap configurations in the grid search\n",
778 | "# and plots the result of each configuration\n",
779 | "# then returns the results matrix and a heatmap plot of it\n",
780 | "results, hm = cluster_projection_grid_search(\n",
781 | " \"trials\", users=labels.username, labels=labels.label,\n",
782 | " # this means multiple clusters can be assigned the same label\n",
783 | " allow_multiple_clusters=True\n",
784 | ")"
785 | ]
786 | },
787 | {
788 | "cell_type": "markdown",
789 | "metadata": {},
790 | "source": [
791 | "Example of plotted projections and grid search heatmap\n",
792 | "
\n",
793 | "
\n",
794 | "
"
795 | ]
796 | }
797 | ],
798 | "metadata": {
799 | "kernelspec": {
800 | "display_name": "Python 3",
801 | "language": "python",
802 | "name": "python3"
803 | },
804 | "language_info": {
805 | "codemirror_mode": {
806 | "name": "ipython",
807 | "version": 3
808 | },
809 | "file_extension": ".py",
810 | "mimetype": "text/x-python",
811 | "name": "python",
812 | "nbconvert_exporter": "python",
813 | "pygments_lexer": "ipython3",
814 | "version": "3.6.9"
815 | },
816 | "toc": {
817 | "base_numbering": 1,
818 | "nav_menu": {},
819 | "number_sections": true,
820 | "sideBar": true,
821 | "skip_h1_title": false,
822 | "title_cell": "Table of Contents",
823 | "title_sidebar": "Contents",
824 | "toc_cell": false,
825 | "toc_position": {},
826 | "toc_section_display": true,
827 | "toc_window_display": false
828 | },
829 | "varInspector": {
830 | "cols": {
831 | "lenName": 16,
832 | "lenType": 16,
833 | "lenVar": 40
834 | },
835 | "kernels_config": {
836 | "python": {
837 | "delete_cmd_postfix": "",
838 | "delete_cmd_prefix": "del ",
839 | "library": "var_list.py",
840 | "varRefreshCmd": "print(var_dic_list())"
841 | },
842 | "r": {
843 | "delete_cmd_postfix": ") ",
844 | "delete_cmd_prefix": "rm(",
845 | "library": "var_list.r",
846 | "varRefreshCmd": "cat(var_dic_list()) "
847 | }
848 | },
849 | "types_to_exclude": [
850 | "module",
851 | "function",
852 | "builtin_function_or_method",
853 | "instance",
854 | "_Feature"
855 | ],
856 | "window_display": false
857 | }
858 | },
859 | "nbformat": 4,
860 | "nbformat_minor": 4
861 | }
862 |
--------------------------------------------------------------------------------
/ed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/ed.png
--------------------------------------------------------------------------------
/methodology_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/methodology_diagram.png
--------------------------------------------------------------------------------
/src/AR_STOPWORDS.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/AR_STOPWORDS.pkl
--------------------------------------------------------------------------------
/src/__pycache__/clustering.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/clustering.cpython-36.pyc
--------------------------------------------------------------------------------
/src/__pycache__/encoder.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/encoder.cpython-36.pyc
--------------------------------------------------------------------------------
/src/__pycache__/preprocessing.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/preprocessing.cpython-36.pyc
--------------------------------------------------------------------------------
/src/__pycache__/projection.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/projection.cpython-36.pyc
--------------------------------------------------------------------------------
/src/__pycache__/top_terms.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/src/__pycache__/top_terms.cpython-36.pyc
--------------------------------------------------------------------------------
/src/clustering.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | from typing import Optional
4 |
5 | import hdbscan
6 | import matplotlib.pyplot as plt
7 | import numpy as np
8 | import pandas as pd
9 | import seaborn as sns
10 | from sklearn.metrics import classification_report, f1_score
11 | from tqdm import tqdm
12 |
13 | from projection import Projector
14 |
15 |
16 | class Clusterer:
17 |
18 | def __init__(self, projection_path):
19 | self.projection_path = projection_path
20 | self._params = self._load_standard_embeddings()
21 | self.N: int = len(self._params["users"])
22 |
23 | def _load_standard_embeddings(self):
24 | file = np.load(self.projection_path, allow_pickle=True)
25 | params = dict()
26 | for k in file.keys():
27 | params[k] = file[k]
28 | return params
29 |
30 | @staticmethod
31 | def _cluster(standard_embeddings, **kwargs):
32 | return hdbscan.HDBSCAN(**kwargs).fit(standard_embeddings)
33 |
34 | def cluster(self, min_samples: Optional[int] = None, min_cluster_size: Optional[int] = None,
35 | min_samples_divisor: int = 1000, min_cluster_size_divisor: int = 100,
36 | tree_path=None,
37 | **kwargs):
38 | if min_samples is None:
39 | kwargs["min_samples"] = max(10, self.N // min_samples_divisor)
40 | if min_cluster_size is None:
41 | kwargs["min_cluster_size"] = max(10, self.N // min_cluster_size_divisor)
42 |
43 | model = self._cluster(standard_embeddings=self._params["umap"],
44 | **kwargs
45 | )
46 |
47 | self._params["clusters"] = model.labels_
48 | np.savez(open(self.projection_path, 'wb'), **self._params)
49 | if tree_path is not None:
50 | pickle.dump(model.condensed_tree_, open(tree_path, 'wb'), protocol=3)
51 |
52 | @staticmethod
53 | def plot_tree(path):
54 | sns.set(context='notebook', style='white', rc={'figure.figsize': (15, 10)})
55 | return pickle.load(open(path, 'rb')).plot()
56 |
57 | def plot(self, labels_col="clusters"):
58 | return Projector.plot(embeddings=self._params["umap"], labels=self._params[labels_col])
59 |
60 | def inject_labels(self, users, labels):
61 | labels_dict = dict(zip(users, labels))
62 | self._params["labels"] = np.array(
63 | [labels_dict[u] if u in labels_dict else 'unk' for u in self._params["users"]]
64 | )
65 |
66 | def align_clusters_with_labels(self, allow_multiple_clusters=True):
67 | labels = self._params["labels"]
68 | ind = labels != 'unk'
69 | users = self._params["users"][ind]
70 | labels = labels[ind]
71 |
72 | df = pd.DataFrame(
73 | {"username": users, "labels": labels}
74 | ).merge(
75 | pd.DataFrame({"username": self._params["users"], "clusters": self._params["clusters"]})
76 | )
77 |
78 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False)
79 |
80 | d = {}
81 | while len(g) > 0:
82 | label, cluster = g.index[0]
83 | d[cluster] = label
84 | g = g.reset_index()
85 | g = g[(g.label != label) & (g.clusters != cluster)].set_index(["label", "clusters"]).sort_values("username",
86 | ascending=False)
87 | unlabeled_clusters = set(df.clusters) - set(d.keys())
88 | if allow_multiple_clusters and len(unlabeled_clusters) > 0:
89 | g = df.groupby(["label", "clusters"]).count().sort_values("username", ascending=False).reset_index()
90 | for c in unlabeled_clusters:
91 | l = g.set_index("clusters").loc[c].label
92 | if isinstance(l, pd.Series):
93 | l = l.iloc[0]
94 | d[c] = l
95 |
96 | g = g[g.clusters != c]
97 |
98 | self._params["predictions"] = np.array([d[x] if x in d else 'unk' for x in self._params['clusters']])
99 |
100 | def evaluate(self, metric=f1_score, report=True):
101 | if "predictions" not in self._params:
102 | raise Exception("No labels aligned with clusters")
103 |
104 | y = self._params["labels"]
105 | p = self._params["predictions"]
106 |
107 | ind = y != 'unk'
108 | y = y[ind]
109 | p = p[ind]
110 |
111 | s = set(y)
112 | if report:
113 | return pd.DataFrame(classification_report(y, p, labels=s, output_dict=True))
114 |
115 | return metric(y, p, labels=s, average='micro')
116 |
117 | @staticmethod
118 | def cluster_projection_grid_search(trials_dir, users=None, labels=None, allow_multiple_clusters=True):
119 | results = dict()
120 | for fn in tqdm(os.listdir(trials_dir)):
121 | if not fn.endswith("npz"):
122 | continue
123 | min_dist, n_neighbors = fn.replace(".npz", '').split("_")
124 | projection_path = os.path.join(trials_dir, fn)
125 | c = Clusterer(projection_path)
126 | c.cluster()
127 | # title = f"min_dist:{min_dist}\tn_neighbors:{n_neighbors}".expandtabs()
128 | plot_path = os.path.join(trials_dir, f"{min_dist}_{n_neighbors}.png")
129 | c.inject_labels(users=users, labels=labels)
130 | c.align_clusters_with_labels(allow_multiple_clusters=allow_multiple_clusters)
131 | fig = c.plot()
132 | plt.savefig(plot_path, bbox_inches='tight')
133 | plt.close()
134 |
135 | score = c.evaluate()
136 | results.setdefault(min_dist, dict())
137 | results[min_dist][n_neighbors] = score
138 | return results
139 |
--------------------------------------------------------------------------------
/src/encoder.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import tensorflow_hub as hub
4 | # noinspection PyUnresolvedReferences
5 | import tensorflow_text
6 | from tqdm import tqdm
7 |
8 |
9 | class Encoder:
10 | DEFAULT_MODEL = "https://tfhub.dev/google/universal-sentence-encoder/4"
11 |
12 | def __init__(self, model_url: str = DEFAULT_MODEL):
13 | """
14 | Args:
15 | model_url: str, url to the Universal Sentence Encoder model
16 | Default is English USE >> https://tfhub.dev/google/universal-sentence-encoder/4
17 | for the multilingual version, use: https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
18 | more models are available at: https://tfhub.dev/google/collections/universal-sentence-encoder/1
19 | """
20 | self.model_url = model_url
21 | self.encoder = self._load_model()
22 |
23 | def _load_model(self):
24 | return hub.load(self.model_url)
25 |
26 | def encode(self, text):
27 | return np.array(self.encoder(text))
28 |
29 | def encode_df(self, df: pd.DataFrame, out_path: str, user_col: str = "username", text_col: str = "text"):
30 | users = list()
31 | vectors = list()
32 | counts = list()
33 |
34 | for user, tweets in tqdm(df.groupby(user_col)[text_col]):
35 | try:
36 | vs = np.array(self.encoder(tweets.tolist()))
37 | users.append(user)
38 | vectors.append(np.mean(vs, axis=0))
39 | counts.append(len(tweets))
40 | except Exception as e:
41 | print(user)
42 | print(e)
43 |
44 | np.savez(out_path, users=np.array(users), vectors=np.array(vectors), counts=np.array(counts),
45 | allow_pickle=True)
46 |
47 |
48 | class EncoderBERT(Encoder):
49 | DEFAULT_MODEL = "roberta-base-nli-stsb-mean-tokens"
50 |
51 | def _load_model(self):
52 | from sentence_transformers import SentenceTransformer
53 | return SentenceTransformer(self.model_url)
54 |
--------------------------------------------------------------------------------
/src/mutual_information.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | import numpy as np
4 | import pandas as pd
5 | import seaborn as sns
6 | from sklearn.metrics import adjusted_mutual_info_score as ami
7 | from tqdm import tqdm
8 |
9 |
10 | def correlate_clustering(df1, df2, metric_func, clusters_col="clusters", user_col="username", **kwargs):
11 | merged = pd.merge(df1[df1[clusters_col] >= 0], df2[df2[clusters_col] >= 0], on=user_col)
12 | y1, y2 = merged.labels_x, merged.labels_y
13 | return metric_func(y1, y2, **kwargs)
14 |
15 |
16 | def calculate_alignment_matrix(dfs, metric_func, **kwargs):
17 | matrix = np.zeros((len(dfs), len(dfs)))
18 | for i, df1 in tqdm(enumerate(dfs)):
19 | for j, df2 in enumerate(dfs):
20 | matrix[i][j] = correlate_clustering(df1, df2, metric_func, **kwargs)
21 | return matrix
22 |
23 |
24 | def plot_heatmap(frames, topics, func=ami):
25 | hm = calculate_alignment_matrix(frames, func)
26 | hm = pd.DataFrame(hm, columns=topics, index=topics).loc[reversed(topics)]
27 | fig = sns.heatmap(
28 | hm.round(2),
29 | annot=True,
30 | cmap="Blues",
31 | annot_kws={"size": 30},
32 | # yticklabels=[i.title() for i in topics]
33 | )
34 | fig.set_yticklabels(labels=reversed(topics), rotation=45)
35 | fig.set_xticklabels(labels=topics, rotation=45)
36 | n = min(len(topics) * 2, 18)
37 | sns.set(context='notebook', style='white', rc={'figure.figsize': (n, n)}, font_scale=3.5)
38 | return fig
39 |
40 |
41 | def mutual_information(topics, root="topicals"):
42 | frames = list()
43 | for topic in tqdm(topics):
44 | f = np.load(os.path.join(root, f"/{topic}.npz"))
45 | users = f["users"]
46 | clusters = f["clusters"]
47 | frames.append(pd.DataFrame({"users": users, "labels": clusters}))
48 |
49 | fig = plot_heatmap(frames, topics)
--------------------------------------------------------------------------------
/src/preprocessing.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | import preprocessor as p
4 |
5 | p.set_options(p.OPT.URL, p.OPT.MENTION)
6 |
7 |
8 | def camel_case_split(identifier):
9 | matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
10 | return [m.group(0) for m in matches]
11 |
12 |
13 | def clean(text):
14 | text = p.clean(text)
15 | text = re.sub(r'^RT ', '', text)
16 | text = ' '.join(camel_case_split(text))
17 | text = re.sub(r'\W+', ' ', text)
18 | text = re.sub(r"\d+", "number", text)
19 | if len(text.strip().split()) < 3:
20 | return None
21 | return text.lower().strip()
22 |
--------------------------------------------------------------------------------
/src/projection.py:
--------------------------------------------------------------------------------
1 | import os
2 | from itertools import product
3 |
4 | import matplotlib.pyplot as plt
5 | import numpy as np
6 | import pandas as pd
7 | import seaborn as sns
8 | from tqdm import tqdm
9 | from umap import UMAP
10 |
11 |
12 | class Projector:
13 | DEFAULT_UMAP_PARAMS = dict(
14 | n_components=2,
15 | min_dist=0,
16 | n_neighbors=90,
17 | metric="cosine",
18 | random_state=42
19 | )
20 | DEFAULT_DIST_RANGE = [0.0, 0.1, 0.25, 0.5, 0.75, 0.8, 0.9, 0.99]
21 | DEFAULT_NEIGHBORS_RANGE = [20, 30, 40, 50, 60, 70, 80, 90, 100]
22 |
23 | def __init__(self, vectors_path):
24 | self.vectors_path = vectors_path
25 | self.users, self.vectors, self.counts = self._load_vectors(vectors_path)
26 |
27 | @staticmethod
28 | def _load_vectors(vectors_path):
29 | file = np.load(vectors_path)
30 | users: np.ndarray = file['users']
31 | vectors: np.ndarray = file['vectors']
32 | counts: np.ndarray = file['counts']
33 | return users, vectors, counts
34 |
35 | @staticmethod
36 | def _project(vectors, **kwargs):
37 | return UMAP(**kwargs).fit_transform(vectors)
38 |
39 | def project(self, out_path, min_counts=3, **kwargs):
40 | params = self.DEFAULT_UMAP_PARAMS.copy()
41 | params.update(kwargs)
42 |
43 | ind = self.counts >= min_counts
44 | users = self.users[ind]
45 | vectors = self.vectors[ind]
46 |
47 | standard_embeddings = self._project(
48 | vectors=vectors,
49 | **params
50 | )
51 | np.savez(open(out_path, 'wb'),
52 | umap=standard_embeddings, users=users)
53 |
54 | @staticmethod
55 | def plot_grid_search_heatmap(results, heatmap_destination="temp.png"):
56 | hm = pd.DataFrame(results)
57 | hm.index = hm.index.astype(int)
58 | hm = hm.sort_index(ascending=False)
59 | x = sorted(hm.columns)
60 | hm.index.name = "n_neighbors"
61 | hm.columns.name = "min_dist"
62 |
63 | sns.set(context='notebook', style='white', rc={'figure.figsize': (len(hm) * 2, len(hm.columns) * 2)},
64 | font_scale=2.5)
65 |
66 | sns.heatmap(hm[x], annot=True, cmap="Blues", annot_kws={"size": 30}, vmin=0.3, vmax=1, cbar=False)
67 | plt.savefig(heatmap_destination, bbox_inches='tight')
68 |
69 | @staticmethod
70 | def plot(embeddings, labels):
71 | fig = plt.figure()
72 | ax = fig.add_subplot(111)
73 | scatter = plt.scatter(embeddings[:, 0], embeddings[:, 1],
74 | c=labels, s=0.1, cmap='Spectral')
75 | return scatter
76 |
77 | def grid_search(self, trials_dir, min_dists_range=DEFAULT_DIST_RANGE, n_neighbors_range=DEFAULT_NEIGHBORS_RANGE,
78 | n_components=2,
79 | metric="cosine", min_counts=3,
80 | skip_existing=True, verbose=False):
81 | ind = self.counts >= min_counts
82 | users = self.users[ind]
83 | vectors = self.vectors[ind]
84 |
85 | umap_params = list(product(min_dists_range, n_neighbors_range))
86 | for min_dist, n in tqdm(umap_params, desc="UMAP"):
87 | if verbose:
88 | print(f"{min_dist}_{n}")
89 | out_path = os.path.join(trials_dir, f"{min_dist}_{n}.npz")
90 | if os.path.isfile(out_path) and skip_existing:
91 | continue
92 |
93 | standard_embeddings = self._project(
94 | vectors=vectors,
95 | random_state=42,
96 | n_components=n_components,
97 | n_neighbors=n,
98 | min_dist=min_dist,
99 | metric=metric
100 | )
101 | np.savez(open(out_path, 'wb'),
102 | umap=standard_embeddings, users=users)
103 |
--------------------------------------------------------------------------------
/src/top_terms.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | from collections import Counter
4 |
5 | import matplotlib.pyplot as plt
6 | import numpy as np
7 | import pandas as pd
8 | from PIL import Image
9 | from ar_wordcloud import ArabicWordCloud
10 | from joblib import Parallel, delayed
11 | from tqdm.notebook import tqdm
12 | from wordcloud import STOPWORDS, WordCloud
13 |
14 |
15 | def get_word_counts(file, text_col=None):
16 | res = {}
17 | tfg = 0
18 | pbar = tqdm(desc=file.split("/")[-1])
19 | with open(file) as f:
20 | for i, l in enumerate(f, 1):
21 | l = l.replace('.', '').strip().lower()
22 | if text_col is not None:
23 | l = l.split('\t')[text_col]
24 | for w in l.split():
25 | if len(w) <= 2:
26 | continue
27 | res.setdefault(w, 0)
28 | res[w] += 1
29 | tfg += 1
30 | if i % 10_000 == 0:
31 | pbar.update(i)
32 | return res, tfg
33 |
34 |
35 | def count_words_csv(text_series):
36 | counter = Counter()
37 | text_series.apply(lambda x: counter.update(x.lower().strip().split()))
38 | return dict(counter), sum(counter.values())
39 |
40 |
41 | def valence_step(tfe1, tfg1, tfe2, tfg2, out, e):
42 | a = tfe1 / tfg1
43 | b = tfe2 / tfg2
44 | v1 = 2 * (a / (a + b)) - 1
45 | if v1 >= 0.8:
46 | out.write(f"{v1 * np.log(tfe1)}\t{e}\t{v1}\t{tfe1}\n")
47 |
48 |
49 | def sort_scores(file):
50 | pd.read_csv(
51 | file, sep='\t', names=["score", "term", "valence", "frequency"]
52 | ).sort_values(
53 | "score", ascending=False
54 | ).to_csv(
55 | file.replace("txt", "tsv"), sep='\t', index=None
56 | )
57 | os.remove(file)
58 |
59 |
60 | def valence(tf1, tfg1, tf2, tfg2, out):
61 | with open(out, 'w') as o:
62 | Parallel(n_jobs=-1, backend='threading')(
63 | delayed(valence_step)(
64 | tfe, tfg1, 0 if e not in tf2 else tf2[e], tfg2, o, e
65 | ) for e, tfe in tf1.items() if len(e) > 2
66 | )
67 | print("Sorting terms")
68 | sort_scores(out)
69 |
70 |
71 | def pipeline(df1, df2, out1, out2=None, text_col='text'):
72 | print("Counting terms...")
73 | (tf1, tfg1), (tf2, tfg2) = Parallel(n_jobs=2, backend='threading')(
74 | delayed(count_words_csv)(df[text_col]) for df in [df1, df2])
75 | del df1, df2
76 | print("Calculating valence for group 1 ...")
77 | valence(tf1, tfg1, tf2, tfg2, out1)
78 | if out2 is not None:
79 | print("Calculating valence for group 2 ...")
80 | valence(tf2, tfg2, tf1, tfg1, out2)
81 |
82 |
83 | def plot_worcloud(file, mask_path=None, arabic=False):
84 | params = dict(width=800, height=800,
85 | background_color='white',
86 | min_font_size=10)
87 | if mask_path is not None:
88 | params["mask"] = np.array(Image.open(mask_path))
89 |
90 | scores = pd.read_csv(file, sep='\t').dropna()
91 | is_en = lambda x: bool(re.search('[a-z]', x.lower()))
92 | if arabic:
93 | import pickle
94 | params['stopwords'] = pickle.load(open('AR_STOPWORDS.pkl', 'rb'))
95 | scores = scores[~scores.term.apply(is_en)]
96 | else:
97 | params['stopwords'] = set(STOPWORDS)
98 | scores = scores[scores.term.apply(is_en)]
99 | scores = scores[:500].set_index("term").to_dict()["score"]
100 | if arabic:
101 | wordcloud = ArabicWordCloud(**params)
102 | fig = wordcloud.from_dict(scores)
103 | else:
104 | wordcloud = WordCloud(**params)
105 | fig = wordcloud.generate_from_frequencies(scores)
106 | plt.figure(figsize=(8, 8), facecolor=None)
107 | plt.imshow(wordcloud)
108 | plt.axis("off")
109 | plt.tight_layout(pad=0)
110 | plt.savefig(f"{file}.png")
111 |
112 |
113 | def calculate_top_terms(clusters_path, tweets_path, prefix, user_col, text_col, use_clusters=True, mask_path=None):
114 | enf = np.load(clusters_path)
115 | df = pd.read_pickle(tweets_path)
116 | users, clusters = enf["users"], enf["clusters"]
117 | if use_clusters:
118 | labels = dict(zip(users, clusters))
119 | ind = clusters >= 0
120 | else:
121 | y = np.array(
122 | [1 if re.search("(lfc)|(liverpool)", x.lower()) else 0 if re.search("(cfc)|(chelsea)", x.lower()) else -1
123 | for x in enf["users"]])
124 | ind = y >= 0
125 | labels = dict(zip(users, y))
126 | df = df[df[user_col].apply(lambda x: x in labels)]
127 | df = df.assign(label=df[user_col].apply(lambda x: labels[x]))
128 |
129 | o1 = os.path.join("terms", f"{prefix}.0.txt")
130 | o2 = os.path.join("terms", f"{prefix}.1.txt")
131 | pipeline(df[df.label == 0], df[df.label == 1],
132 | out1=o1,
133 | out2=o2,
134 | text_col=text_col
135 | )
136 |
137 | for o in enumerate(o1, o2):
138 | plot_worcloud(o, mask_path=mask_path)
139 |
--------------------------------------------------------------------------------
/src/turkish_normalizer.py:
--------------------------------------------------------------------------------
1 | # get zemberek from https://github.com/ahmetaa/zemberek-nlp
2 |
3 | from os.path import join
4 |
5 | from jpype import JClass, JString, getDefaultJVMPath, startJVM
6 |
7 | ZEMBEREK_PATH: str = join('zemberek', 'bin', 'zemberek-full.jar')
8 |
9 | startJVM(
10 | getDefaultJVMPath(),
11 | '-ea',
12 | f'-Djava.class.path={ZEMBEREK_PATH}',
13 | convertStrings=False
14 | )
15 |
16 | TurkishMorphology: JClass = JClass('zemberek.morphology.TurkishMorphology')
17 | TurkishSentenceNormalizer: JClass = JClass(
18 | 'zemberek.normalization.TurkishSentenceNormalizer'
19 | )
20 | Paths: JClass = JClass('java.nio.file.Paths')
21 |
22 | normalizer = TurkishSentenceNormalizer(
23 | TurkishMorphology.createWithDefaults(),
24 | Paths.get(
25 | join('zemberek', 'data', 'normalization')
26 | ),
27 | Paths.get(
28 | join('zemberek', 'data', 'lm', 'lm.2gram.slm')
29 | )
30 | )
31 |
32 |
33 | def normalize(text):
34 | return str(normalizer.normalize(JString(text)))
35 |
36 |
37 | def normalize_df(df, text_col):
38 | df[text_col] = df[text_col].apply(normalize)
39 | return df
40 |
--------------------------------------------------------------------------------
/trials/0.0_30.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.0_30.png
--------------------------------------------------------------------------------
/trials/0.1_60.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/0.1_60.png
--------------------------------------------------------------------------------
/trials/hm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/trials/hm.png
--------------------------------------------------------------------------------
/wc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AmmarRashed/UnsupervisedStanceDetection/a8c75bcc31928c58da1485455d0252f16615574f/wc.png
--------------------------------------------------------------------------------