├── .gitignore ├── LICENSE ├── README.md ├── config ├── glove_sample_config.yml ├── lsa_sample_config.yml └── word2vec_sample_config.yml ├── data ├── fake_data.txt └── text8.zip ├── main.py ├── matrix ├── PIP_loss_calculator.py ├── __init__.py ├── glove_matrix.py ├── ppmi_lsa_matrix.py ├── signal_matrix.py ├── signal_matrix_factory.py └── word2vec_matrix.py ├── requirements.txt ├── test ├── __init__.py └── test_tokenizer.py └── utils ├── __init__.py ├── reader.py └── tokenizer.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Zi 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Word Embedding Dimensionality Selection 2 | 3 | This repo implements the dimensionality selection procedure for word embeddings. The procedure is proposed 4 | in the following papers, based on the notion of Pairwise Inner Produce (PIP) loss. No longer pick 300 as your word embedding dimensionality! 5 | 6 | - Paper: 7 | * Conference Version: https://nips.cc/Conferences/2018/Schedule?showEvent=12567 8 | * arXiv: https://arxiv.org/abs/1812.04224 9 | - Slides: https://www.dropbox.com/s/9tix9l4h39k4agn/main.pdf?dl=0 10 | - Video of Neurips talk: https://www.facebook.com/nipsfoundation/videos/vb.375737692517476/745243882514297/?type=2&theater 11 | 12 | ``` 13 | @inproceedings{yin2018dimensionality, 14 | title={On the Dimensionality of Word Embedding}, 15 | author={Yin, Zi and Shen, Yuanyuan}, 16 | booktitle={Advances in Neural Information Processing Systems}, 17 | year={2018} 18 | } 19 | ``` 20 | and 21 | ``` 22 | @article{yin2018pairwise, 23 | title={Understand Functionality and Dimensionality of Vector Embeddings: the Distributional Hypothesis, the Pairwise Inner Product Loss and Its Bias-Variance Trade-off}, 24 | author={Yin, Zi}, 25 | journal={arXiv preprint arXiv:1803.00502}, 26 | year={2018} 27 | } 28 | ``` 29 | 30 | Currently, we implement the dimensionality selection procedure for the following algorithms: 31 | - Word2Vec (skip-gram) 32 | - GloVe 33 | - Latent Semantic Analysis (LSA) 34 | 35 | ## How to use the tool 36 | The tool provides an optimal dimensionality for an algorithm on a corpus. For example, you can use it to 37 | obtain the dimensionality for your Word2Vec embedding on the Text8 corpus. 38 | You need to have the following: 39 | - A corpus (--file [path to corpus]) 40 | - A config file (yaml) for algorithm specific parameters (--config file [path to config file]) 41 | - Name of algorithm (--algorithm [algorithm_name]) 42 | 43 | Run from root directory as package, e.g.: 44 | 45 | `python -m main --file data/text8.zip --config_file config/word2vec_sample_config.yml --algorithm word2vec` 46 | 47 | ## Implement your own 48 | You can extend the implementation if you have another embedding algorithm that is based on matrix factorization. 49 | The only thing to do is to implement your matrix estimator as a subclass of SignalMatrix. 50 | -------------------------------------------------------------------------------- /config/glove_sample_config.yml: -------------------------------------------------------------------------------- 1 | skip_window: 5 2 | vocabulary_size: 10000 3 | min_count: 100 4 | -------------------------------------------------------------------------------- /config/lsa_sample_config.yml: -------------------------------------------------------------------------------- 1 | skip_window: 5 2 | vocabulary_size: 10000 3 | min_count: 100 4 | -------------------------------------------------------------------------------- /config/word2vec_sample_config.yml: -------------------------------------------------------------------------------- 1 | skip_window: 5 2 | # neg_samples works slightly differently than uniform (3/4 power), so the approx. using 3 | # SPMI will be off, especially when it is large. Usually neg_samples <= 4 gives accurate results. 4 | # neg_samples between 1-5 is suggested (Mikolov et al) 5 | neg_samples: 1 6 | vocabulary_size: 10000 7 | min_count: 100 8 | -------------------------------------------------------------------------------- /data/fake_data.txt: -------------------------------------------------------------------------------- 1 | a b d c a s asdf asd a a d f i 2 | -------------------------------------------------------------------------------- /data/text8.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ziyin-dl/word-embedding-dimensionality-selection/cec3fb03deec9fb7117f1fb6a29d725ee8682414/data/text8.zip -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import argparse 6 | import collections 7 | import yaml 8 | 9 | from matrix.signal_matrix_factory import SignalMatrixFactory 10 | from matrix.PIP_loss_calculator import MonteCarloEstimator 11 | from utils.tokenizer import SimpleTokenizer 12 | from utils.reader import ReaderFactory 13 | 14 | if __name__ == "__main__": 15 | parser = argparse.ArgumentParser(description='construct pmi matrix from corpus') 16 | parser.add_argument('--algorithm', required=True, type=str, help='embedding algorithm') 17 | parser.add_argument('--file', required=True, type=str, help='corpus_file') 18 | parser.add_argument('--config_file', required=False, type=str, help='config file for the algorithm containing parameter settings') 19 | args = parser.parse_args() 20 | 21 | config_file = args.config_file 22 | 23 | with open(args.config_file, "r") as f: 24 | cfg = yaml.load(f) 25 | 26 | reader = ReaderFactory.produce(args.file[-3:]) 27 | data = reader.read_data(args.file) 28 | tokenizer = SimpleTokenizer() 29 | indexed_corpus = tokenizer.do_index_data(data, 30 | n_words=cfg.get('vocabulary_size'), 31 | min_count=cfg.get('min_count')) 32 | factory = SignalMatrixFactory(indexed_corpus) 33 | 34 | signal_matrix = factory.produce(args.algorithm.lower()) 35 | path = signal_matrix.param_dir 36 | signal_matrix.inject_params(cfg) 37 | signal_matrix.estimate_signal() 38 | signal_matrix.estimate_noise() 39 | signal_matrix.export_estimates() 40 | 41 | pip_calculator = MonteCarloEstimator() 42 | pip_calculator.get_param_file(path, "estimates.yml") 43 | pip_calculator.estimate_signal() 44 | pip_calculator.estimate_pip_loss() 45 | pip_calculator.plot_pip_loss() 46 | -------------------------------------------------------------------------------- /matrix/PIP_loss_calculator.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | try: 6 | import cPickle as pickle 7 | except ImportError: 8 | import pickle 9 | import os 10 | import matplotlib 11 | matplotlib.use('Agg') 12 | import matplotlib.pyplot as plt 13 | import numpy as np 14 | import random 15 | import time 16 | import yaml 17 | 18 | 19 | class MonteCarloEstimator(): 20 | def __init__(self): 21 | pass 22 | 23 | def _soft_threshold(self, x, tau): 24 | if x > tau: 25 | return x - tau 26 | else: 27 | return 0 28 | 29 | def _generate_random_orthogonal_matrix(self, shape): 30 | assert len(shape) == 2 31 | assert shape[0] >= shape[1] 32 | X = np.random.normal(0, 1, shape) 33 | U, _, _ = np.linalg.svd(X, full_matrices = False) 34 | return U 35 | 36 | 37 | def get_param_file(self, param_path, filename): 38 | param_file = os.path.join(param_path, filename) 39 | with open(param_file, "rb") as f: 40 | cfg = yaml.load(f) 41 | self.param_path = param_path 42 | self.alpha = float(cfg["alpha"]) 43 | self.estimated_sigma = float(cfg["sigma"]) 44 | self.lambda_filename = cfg["lambda"] 45 | with open(os.path.join(param_path, self.lambda_filename), 'rb') as f: 46 | self.empirical_signal = pickle.load(f) 47 | 48 | def estimate_signal(self): 49 | self.estimated_signal = list(map(lambda x: self._soft_threshold(x, 2 * self.estimated_sigma * np.sqrt(len(self.empirical_signal))), self.empirical_signal)) 50 | rank = len(self.estimated_signal) 51 | for i in range(len(self.estimated_signal)): 52 | if self.estimated_signal[i] == 0: 53 | rank = i 54 | break 55 | self.rank = rank 56 | 57 | def estimate_pip_loss(self): 58 | D = self.estimated_signal 59 | rank = self.rank 60 | n = len(self.estimated_signal) 61 | sigma = self.estimated_sigma 62 | shape = (n, n) 63 | alpha = self.alpha 64 | print("n={}, rank={}, sigma={}".format(n, rank, sigma)) 65 | 66 | D_gen = D[:rank] 67 | U_gen = self._generate_random_orthogonal_matrix((n, rank)) 68 | V_gen = self._generate_random_orthogonal_matrix((n, rank)) 69 | true_dims = range(rank) 70 | 71 | X = (U_gen * D_gen).dot(V_gen.T) 72 | 73 | E = np.random.normal(0, sigma, size = shape) 74 | estimation_noise_E = E 75 | 76 | Y = X + estimation_noise_E 77 | 78 | U, D, V = np.linalg.svd(X) 79 | U1, D1, V1 = np.linalg.svd(Y) 80 | 81 | embed_gt = U[:,true_dims] * (D[true_dims] ** alpha) 82 | sim_gt = embed_gt.dot(embed_gt.T) 83 | 84 | spectrum = D ** alpha 85 | spectrum_est = D1 ** alpha 86 | embed = U * spectrum 87 | embed_est = U1 * spectrum_est 88 | 89 | sim_est = None 90 | 91 | """ the "dumb method" does every step as is: 92 | a) loop through every dimensionality k 93 | b) for every k, calculate the dim-k estimated embedding 94 | c) compare it with the oracle ((g)round (t)ruth) embedding 95 | d) record the PIP loss 96 | f) select the dimensionality k that minimizes the PIP loss 97 | 98 | Now, the "smart method" essentially does the same, but only an order of magnitude 99 | faster. We used some simple linear algebra trick here. Readers can verify that the 100 | two methods give the same results. 101 | """ 102 | dumb_method = False 103 | if dumb_method: 104 | time_add = 0.0 105 | time_norm = 0.0 106 | frobenius_list_est_to_gt = [] 107 | for keep_dim in range(1, rank + 1): 108 | t0 = time.time() 109 | if sim_est is None: 110 | sim_est = embed_est[:,:keep_dim].dot(embed_est[:,:keep_dim].T) 111 | else: 112 | sim_est += np.outer(embed_est[:,keep_dim-1], embed_est[:,keep_dim-1]) 113 | time_add += time.time() - t0 114 | t0 = time.time() 115 | sim_diff_est_to_gt = np.linalg.norm(sim_est - sim_gt, 'fro') 116 | time_norm += time.time() - t0 117 | frobenius_list_est_to_gt.append(sim_diff_est_to_gt) 118 | self.estimated_pip_loss = frobenius_list_est_to_gt 119 | else: 120 | time_norm = 0.0 121 | frobenius_list_est_to_gt = [np.linalg.norm(spectrum ** 2) ** 2] 122 | for keep_dim in range(1, rank + 1): 123 | t0 = time.time() 124 | diff = frobenius_list_est_to_gt[keep_dim-1] + spectrum_est[keep_dim-1] ** 4 - 2 * ( 125 | np.linalg.norm(embed_est[:, keep_dim-1].T.dot(embed_gt)) ** 2) 126 | time_norm += time.time() - t0 127 | frobenius_list_est_to_gt.append(diff) 128 | self.estimated_pip_loss = list(map(np.sqrt, frobenius_list_est_to_gt[1:])) 129 | with open(os.path.join(self.param_path, "pip_loss_{}.pkl".format(self.alpha)), 'wb') as f: 130 | pickle.dump(self.estimated_pip_loss, f) 131 | 132 | 133 | def plot_pip_loss(self): 134 | with open(os.path.join(self.param_path, "pip_loss_{}.pkl".format(self.alpha)), 'rb') as f: 135 | frobenius_list_est_to_gt = pickle.load(f) 136 | print("optimal dimensionality is {}".format(np.argmin(frobenius_list_est_to_gt))) 137 | fig = plt.figure() 138 | ax = fig.add_subplot(111) 139 | ax.plot(frobenius_list_est_to_gt, 'aqua', label = r'PIP loss') 140 | lgd = ax.legend(loc='upper right') 141 | plt.title(r'PIP Loss') 142 | fig_path = '{}/pip_{}.pdf'.format(self.param_path, self.alpha) 143 | fig.savefig(fig_path, bbox_extra_artists=(lgd,), bbox_inches='tight') 144 | print("a plot of the loss is saved at {}".format(fig_path)) 145 | plt.close() 146 | 147 | 148 | -------------------------------------------------------------------------------- /matrix/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ziyin-dl/word-embedding-dimensionality-selection/cec3fb03deec9fb7117f1fb6a29d725ee8682414/matrix/__init__.py -------------------------------------------------------------------------------- /matrix/glove_matrix.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import collections 6 | import numpy as np 7 | import warnings 8 | 9 | from matrix.signal_matrix import SignalMatrix 10 | 11 | class GloVeMatrix(SignalMatrix): 12 | 13 | def inject_params(self, kwargs): 14 | self._params = kwargs 15 | if "skip_window" not in self._params: 16 | self._params["skip_window"] = 5 17 | self.check_params() 18 | 19 | def check_params(self): 20 | if isinstance(self._params["skip_window"], int) and self._params["skip_window"] > 0: 21 | pass 22 | else: 23 | raise ValueError("skip_window must be a positive integer") 24 | 25 | def build_cooccurance_dict(self, data): 26 | skip_window = self._params["skip_window"] 27 | vocabulary_size = self.vocabulary_size 28 | cooccurance_count = collections.defaultdict(collections.Counter) 29 | for idx, center_word_id in enumerate(data): 30 | if center_word_id > vocabulary_size: 31 | vocabulary_size = center_word_id 32 | for i in range(max(idx - skip_window - 1, 0), min(idx + skip_window + 1, len(data))): 33 | cooccurance_count[center_word_id][data[i]] += 1 34 | cooccurance_count[center_word_id][center_word_id] -= 1 35 | return cooccurance_count, vocabulary_size 36 | 37 | 38 | def construct_matrix(self, data): 39 | cooccur, vocabulary_size = self.build_cooccurance_dict(data) 40 | 41 | Nij = np.ones([vocabulary_size, vocabulary_size]) 42 | for i in range(vocabulary_size): 43 | for j in range(vocabulary_size): 44 | Nij[i,j] += cooccur[i][j] 45 | with warnings.catch_warnings(): 46 | """log(0) is going to throw warnings, but we will deal with it.""" 47 | # c.f. Pennington et al 48 | log_count = np.log(Nij) 49 | return log_count 50 | 51 | -------------------------------------------------------------------------------- /matrix/ppmi_lsa_matrix.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import collections 6 | import numpy as np 7 | import warnings 8 | 9 | from six.moves import xrange # pylint: disable=redefined-builtin 10 | 11 | from matrix.signal_matrix import SignalMatrix 12 | 13 | class LSAMatrix(SignalMatrix): 14 | 15 | def inject_params(self, kwargs): 16 | self._params = kwargs 17 | if "skip_window" not in self._params: 18 | self._params["skip_window"] = 5 19 | self.check_params() 20 | 21 | def check_params(self): 22 | if isinstance(self._params["skip_window"], int) and self._params["skip_window"] > 0: 23 | pass 24 | else: 25 | raise ValueError("skip_window must be a positive integer") 26 | 27 | def build_cooccurance_dict(self, data): 28 | skip_window = self._params["skip_window"] 29 | vocabulary_size = self.vocabulary_size 30 | cooccurance_count = collections.defaultdict(collections.Counter) 31 | for idx, center_word_id in enumerate(data): 32 | if center_word_id > vocabulary_size: 33 | vocabulary_size = center_word_id 34 | for i in range(max(idx - skip_window - 1, 0), min(idx + skip_window + 1, len(data))): 35 | cooccurance_count[center_word_id][data[i]] += 1 36 | cooccurance_count[center_word_id][center_word_id] -= 1 37 | return cooccurance_count, vocabulary_size 38 | 39 | 40 | def construct_matrix(self, data): 41 | cooccur, vocabulary_size = self.build_cooccurance_dict(data) 42 | 43 | Nij = np.zeros([vocabulary_size, vocabulary_size]) 44 | for i in range(vocabulary_size): 45 | for j in range(vocabulary_size): 46 | Nij[i,j] += cooccur[i][j] 47 | Ni = np.sum(Nij, axis=1) 48 | tot = np.sum(Nij) 49 | with warnings.catch_warnings(): 50 | """log(0) is going to throw warnings, but we will deal with it.""" 51 | warnings.filterwarnings("ignore") 52 | Pij = Nij / tot 53 | Pi = Ni / np.sum(Ni) 54 | PMI = np.log(Pij) - np.log(np.outer(Pi, Pi)) 55 | PMI[np.isinf(PMI)] = 0 56 | PMI[np.isnan(PMI)] = 0 57 | PPMI = PMI 58 | PPMI[PPMI < 0] = 0 59 | return PPMI 60 | 61 | -------------------------------------------------------------------------------- /matrix/signal_matrix.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | try: 6 | import cPickle as pickle 7 | except ImportError: 8 | import pickle 9 | import matplotlib 10 | matplotlib.use('Agg') 11 | import matplotlib.pyplot as plt 12 | import numpy as np 13 | import os 14 | 15 | from six.moves import xrange # pylint: disable=redefined-builtin 16 | 17 | 18 | class SignalMatrix(): 19 | def __init__(self, corpus=None): 20 | self.corpus = corpus 21 | self._param_dir = "params/{}".format(self.__class__.__name__) 22 | if not os.path.exists(self._param_dir): 23 | os.makedirs(self._param_dir) 24 | self._get_vocab_size() 25 | 26 | def _get_vocab_size(self): 27 | """ words are {0, 1, ..., n_words - 1}""" 28 | vocabulary_size = 1 29 | for idx, center_word_id in enumerate(self.corpus): 30 | if center_word_id + 1> vocabulary_size: 31 | vocabulary_size = center_word_id + 1 32 | self.vocabulary_size = vocabulary_size 33 | print("vocabulary_size={}".format(self.vocabulary_size)) 34 | 35 | @property 36 | def param_dir(self): 37 | return self._param_dir 38 | 39 | def estimate_signal(self, enable_plot=False): 40 | matrix = self.construct_matrix(self.corpus) 41 | self.matrix = matrix 42 | U, D, V = np.linalg.svd(matrix) 43 | if enable_plot: 44 | plt.plot(D) 45 | plt.savefig('{}/sv.pdf'.format(self._param_dir)) 46 | plt.close() 47 | self.spectrum = D 48 | with open("{}/sv.pkl".format(self._param_dir), "wb") as f: 49 | pickle.dump(self.spectrum, f) 50 | 51 | def estimate_noise(self): 52 | data_len = len(self.corpus) 53 | data_1 = self.corpus[:data_len // 2] 54 | data_2 = self.corpus[data_len // 2 + 1:] 55 | matrix_1 = self.construct_matrix(data_1) 56 | matrix_2 = self.construct_matrix(data_2) 57 | diff = matrix_1 - matrix_2 58 | self.noise = np.std(diff) * 0.5 59 | 60 | def export_estimates(self): 61 | with open("{}/estimates.yml".format(self._param_dir), "w") as f: 62 | f.write("lambda: {}\n".format("sv.pkl")) 63 | f.write("sigma: {}\n".format(self.noise)) 64 | f.write("alpha: {}\n".format(0.5)) #symmetric factorization 65 | 66 | 67 | def construct_matrix(self): 68 | raise NotImplementedError 69 | 70 | 71 | -------------------------------------------------------------------------------- /matrix/signal_matrix_factory.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | from matrix.word2vec_matrix import Word2VecMatrix 6 | from matrix.glove_matrix import GloVeMatrix 7 | from matrix.ppmi_lsa_matrix import LSAMatrix 8 | 9 | class SignalMatrixFactory(): 10 | def __init__(self, corpus): 11 | self.corpus = corpus 12 | 13 | def produce(self, algo): 14 | if algo == "word2vec": 15 | return Word2VecMatrix(self.corpus) 16 | elif algo == "glove": 17 | return GloVeMatrix(self.corpus) 18 | elif algo == "lsa": 19 | return LSAMatrix(self.corpus) 20 | else: 21 | raise NotImplementedError 22 | 23 | 24 | -------------------------------------------------------------------------------- /matrix/word2vec_matrix.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import collections 6 | import numpy as np 7 | import warnings 8 | 9 | from matrix.signal_matrix import SignalMatrix 10 | 11 | class Word2VecMatrix(SignalMatrix): 12 | 13 | def inject_params(self, kwargs): 14 | self._params = kwargs 15 | if "skip_window" not in self._params: 16 | self._params["skip_window"] = 5 17 | if "neg_samples" not in self._params: 18 | self._params["neg_samples"] = 1 19 | self.check_params() 20 | 21 | def check_params(self): 22 | if isinstance(self._params["skip_window"], int) and self._params["skip_window"] > 0: 23 | pass 24 | else: 25 | raise ValueError("skip_window must be a positive integer") 26 | if isinstance(self._params["neg_samples"], int) and self._params["neg_samples"] >= 0: 27 | self._params["neg_samples"] = max(self._params["neg_samples"], 1) 28 | else: 29 | raise ValueError("neg_samples must be a positive integer") 30 | 31 | def build_cooccurance_dict(self, data): 32 | skip_window = self._params["skip_window"] 33 | vocabulary_size = self.vocabulary_size 34 | cooccurance_count = collections.defaultdict(collections.Counter) 35 | for idx, center_word_id in enumerate(data): 36 | if center_word_id > vocabulary_size: 37 | vocabulary_size = center_word_id 38 | for i in range(max(idx - skip_window - 1, 0), min(idx + skip_window + 1, len(data))): 39 | cooccurance_count[center_word_id][data[i]] += 1 40 | cooccurance_count[center_word_id][center_word_id] -= 1 41 | return cooccurance_count, vocabulary_size 42 | 43 | 44 | def construct_matrix(self, data): 45 | cooccur, vocabulary_size = self.build_cooccurance_dict(data) 46 | k = self._params["neg_samples"] 47 | 48 | Nij = np.zeros([vocabulary_size, vocabulary_size]) 49 | for i in range(vocabulary_size): 50 | for j in range(vocabulary_size): 51 | Nij[i,j] += cooccur[i][j] 52 | Ni = np.sum(Nij, axis=1) 53 | tot = np.sum(Nij) 54 | with warnings.catch_warnings(): 55 | """log(0) is going to throw warnings, but we will deal with it.""" 56 | warnings.filterwarnings("ignore") 57 | Pij = Nij / tot 58 | Pi = Ni / np.sum(Ni) 59 | # c.f.Neural Word Embedding as Implicit Matrix Factorization, Levy & Goldberg, 2014 60 | PMI = np.log(Pij) - np.log(np.outer(Pi, Pi)) - np.log(k) 61 | PMI[np.isinf(PMI)] = 0 62 | PMI[np.isnan(PMI)] = 0 63 | return PMI 64 | 65 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # Requirements automatically generated by pigar. 2 | # https://github.com/damnever/pigar 3 | 4 | # main.py: 7 5 | # matrix/PIP_loss_calculator.py: 13 6 | PyYAML == 3.13 7 | 8 | # matrix/PIP_loss_calculator.py: 7,9 9 | # matrix/signal_matrix.py: 6,8 10 | matplotlib == 2.1.2 11 | 12 | # main.py: 9,10 13 | # matrix/glove_matrix.py: 9 14 | # matrix/ppmi_lsa_matrix.py: 11 15 | # matrix/signal_matrix_factory.py: 5,6,7 16 | # matrix/word2vec_matrix.py: 9 17 | matrix == 2.0.1 18 | 19 | # test/test_tokenizer.py: 5 20 | mock == 2.0.0 21 | 22 | # matrix/PIP_loss_calculator.py: 10 23 | # matrix/glove_matrix.py: 6 24 | # matrix/ppmi_lsa_matrix.py: 6 25 | # matrix/signal_matrix.py: 9 26 | # matrix/word2vec_matrix.py: 6 27 | numpy == 1.14.2 28 | 29 | # matrix/ppmi_lsa_matrix.py: 9 30 | # matrix/signal_matrix.py: 12 31 | six == 1.10.0 32 | 33 | # utils/reader.py: 7 34 | tensorflow == 1.5.0 35 | -------------------------------------------------------------------------------- /test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ziyin-dl/word-embedding-dimensionality-selection/cec3fb03deec9fb7117f1fb6a29d725ee8682414/test/__init__.py -------------------------------------------------------------------------------- /test/test_tokenizer.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import utils.tokenizer as tokenizer 3 | import utils.reader as reader 4 | 5 | from mock import patch, mock_open 6 | 7 | class TestRawTextReader(unittest.TestCase): 8 | def setUp(self, string="a b C d Ef G"): 9 | self._string = string 10 | self._reader = reader.RawTextReader() 11 | 12 | def test_read(self): 13 | m = mock_open() 14 | with patch('utils.reader.open', mock_open(read_data=self._string), create=True): 15 | data = self._reader.read_data('fake_test.txt') 16 | self.assertEqual(self._string, data) 17 | #m.assert_called_once_with('fake_test.txt') 18 | 19 | class TestTokenizer(unittest.TestCase): 20 | def setUp(self, string="a b C d Ef G"): 21 | self._expected_tokens = ["a", "b", "c", "d", "ef", "g"] 22 | self._expected_counts = {"a": 1, "b": 1, "c": 1, "d": 1, "ef": 1, "g": 1} 23 | self._string = string 24 | self._tokenizer = tokenizer.SimpleTokenizer() 25 | 26 | def test_tokenizer(self): 27 | ret = self._tokenizer.tokenize(self._string) 28 | self.assertEqual(self._expected_tokens, ret) 29 | 30 | def test_index(self): 31 | tokens = self._tokenizer.tokenize(self._string) 32 | dic, rev_dic = self._tokenizer.frequency_count(tokens, 10000) 33 | ret = self._tokenizer.index(tokens, dic) 34 | expected_indices = [dic[w] for w in tokens] 35 | self.assertEqual(expected_indices, ret) 36 | 37 | if __name__ == "__main__": 38 | unittest.main() 39 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ziyin-dl/word-embedding-dimensionality-selection/cec3fb03deec9fb7117f1fb6a29d725ee8682414/utils/__init__.py -------------------------------------------------------------------------------- /utils/reader.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import zipfile 6 | try: 7 | import cPickle as pickle 8 | except ImportError: 9 | import pickle 10 | import tensorflow as tf 11 | 12 | class ZipFileReader(): 13 | def __init__(self): 14 | pass 15 | 16 | def read_data(self, filename): 17 | """Extract the first file enclosed in a zip file as a string.""" 18 | with zipfile.ZipFile(filename) as f: 19 | data = tf.compat.as_str(f.read(f.namelist()[0])) 20 | return data 21 | 22 | class RawTextReader(): 23 | def __init__(self): 24 | pass 25 | 26 | def read_data(self, filename): 27 | with open(filename, 'r') as f: 28 | data = f.read() 29 | return data 30 | 31 | class ReaderFactory(): 32 | def __init__(self): 33 | pass 34 | 35 | @classmethod 36 | def produce(self, filetype): 37 | if filetype == "zip": 38 | return ZipFileReader() 39 | else: 40 | return RawTextReader() 41 | 42 | -------------------------------------------------------------------------------- /utils/tokenizer.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import collections 6 | try: 7 | import cPickle as pickle 8 | except ImportError: 9 | import pickle 10 | from multiprocessing import Pool 11 | 12 | def _lower(s): 13 | return s.lower() 14 | 15 | class SimpleTokenizer(): 16 | def __init__(self): 17 | pass 18 | 19 | def tokenize(self, data): 20 | """data: str""" 21 | data = data.replace('\n', ' ').replace('\r', '') 22 | splitted = data.split(' ') 23 | pool = Pool() 24 | tokenized = pool.map(_lower, splitted) 25 | return tokenized 26 | 27 | #TODO: add min_count, together with n_words to determine if UNK is needed 28 | def frequency_count(self, tokenized_data, n_words, min_count): 29 | count = [['UNK', -1]] 30 | num_above_threshold = 0 31 | counter = collections.Counter(tokenized_data) 32 | for k, v in counter.items(): 33 | if v >= min_count: 34 | num_above_threshold += 1 35 | n_words = min(n_words, num_above_threshold) 36 | # if more tokens than needed, map the rest to UNK 37 | if len(counter) > n_words: 38 | count.extend(collections.Counter(tokenized_data).most_common(n_words - 1)) 39 | else: 40 | count = collections.Counter(tokenized_data).most_common(n_words) 41 | dictionary = dict() 42 | for word, _ in count: 43 | dictionary[word] = len(dictionary) 44 | reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 45 | return dictionary, reversed_dictionary 46 | 47 | def index(self, tokenized_data, dictionary): 48 | data = list() 49 | unk_count = 0 50 | 51 | def _index(word): 52 | if word in dictionary: 53 | index = dictionary[word] 54 | else: 55 | index = dictionary['UNK'] 56 | return index 57 | 58 | data = [_index(word) for word in tokenized_data] 59 | return data 60 | 61 | def do_index_data(self, data, n_words=10000, min_count=100): 62 | """transform data: str into a tokens: list. tokens are mapped to {0, 1, ..., n_words - 1}""" 63 | self.tokenized = self.tokenize(data) 64 | self.dictionary, self.reversed_dictionary = self.frequency_count(self.tokenized, n_words, min_count) 65 | self.indexed = self.index(self.tokenized, self.dictionary) 66 | return self.indexed 67 | --------------------------------------------------------------------------------