├── .gitignore ├── requirements.txt ├── run.sh ├── README.md ├── run_discover.py ├── discover_utils.py └── discoverer.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | __pycache__ 3 | lab.py 4 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas 2 | numpy 3 | jieba 4 | scipy 5 | nltk 6 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # New Words Discovery 2 | 3 | ## Setup 4 | 5 | Requirements: 6 | 7 | ``` 8 | pandas 9 | numpy 10 | jieba 11 | scipy 12 | nltk 13 | ``` 14 | 15 | This implementation tries to discover four types of new words based on four parameters. 16 | 17 | Four types of new words: 18 | 19 | 1. latin words, including 20 | 21 | 1. pure digits (2333, 12315, 12306) 22 | 23 | 2. pure letters (iphone, vivo) 24 | 25 | 3. a mixture of both (iphone7, mate9) 26 | 27 | 2. 2-Chinese-character unigram (unigrams are defined as the elements produced by the segmentator): 28 | 29 | (马蓉，优酷，杨洋) 30 | 31 | 3. 3-Chinese-character unigram: 32 | 33 | (李易峰，张一山，井柏然) 34 | 35 | 4. bigrams, which are composed of two unigrams: 36 | 37 | (图片大全，英雄联盟，公交车路线，穿越火线) 38 | 39 | 40 | Four parameters: 41 | 42 | 1. term frequency (tf): The occurrences of a word. A larger `tf` indicates a larger confidence of the following 3 paramters. 43 | 44 | 2. aggregation coefficient: A larger `agg_coef` indicates a larger possibility of the co-occurrence of the two words. 45 | 46 | $\text{agg-coef}=\frac{P(w_1,w_2)}{P(w_1)P(w_2)}=\frac{C(w_1,w_2)/\text{\{nr-of-bigrams\}}}{[C(w_1)/\text{\{nr-of-bigrams\}}]*[C(w_2)\text{\{nr-of-bigrams\}}]}$ 47 | 48 | where `C(w_1, w_2)` indicates the counts of the pattern that `w_1` is followed by `w_2`. 49 | 50 | `C(w_1)` and `C(w_2)` indicate the count of the counts of `w_1` and `w_2` respectively. 51 | 52 | 3. minimum neighboring entropy 53 | 54 | 4. maximum neighboring entropy 55 | 56 | The minimum and maximum neighboring entropy are the minimum and maximum of left neighboring entropy and right neighboring entropy respectively. 57 | 58 | A larger neighboring entropy of a word `w` indicates that `w` collocates with mores possible words, which in turn indicates that `w` is an independent word. For instance, "我是" has a large `tf` and a large `agg_coef` but a small `minimum neighboring entropy` so it's not a word. 59 | 60 | left entropy: 61 | 62 | $\text{Ent}_{w_l}=-\sum_{w_l}\cdotP(w_l|w)\log(P(w_l|w))$ 63 | 64 | where `w_l` are the set of unigrams that appear to the left of word `w`. This above-mentioned formula also applies to the right neighboring entropy. 65 | 66 | ## Usage 67 | 68 | An execution script example (Note that the double quotes cannot be omitted if the path you provided contains spaces): 69 | 70 | ``` 71 | python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2 72 | ``` 73 | 74 | Run 75 | 76 | ``` 77 | python run_discover.py --help 78 | ``` 79 | 80 | for further information and help. 81 | 82 | Each iteration includes the following 11 steps: 83 | 84 | 1. cutting 85 | 2. counting characters 86 | 3. counting unigrams 87 | 4. counting bigrams 88 | 5. counting trigrams 89 | 6. calculating aggregation coefficients (for unigrams) 90 | 7. counting neighboring words (for unigrams) 91 | 8. Calculating boundary entropy (for unigrams) 92 | 9. calculating aggregation coefficients (for bigrams) 93 | 10. counting neighboring words (for bigrams) 94 | 11. calculating boundary entropy (for bigrams) 95 | 96 | After each iteration, you will get four files reporting new words of type latin, 2-Chinese-character words, 3-Chinese-character words and bigram respectively. After the program exits, you will get four files which respectively merge each type of new words generated from each iteration. 97 | 98 | If you encounter any problems, feel free to open an issue or contact me (rayarrow@qq.com). 99 | 100 | 101 | ====================================分隔线================================ 102 | 103 | # 新词发现 104 | 105 | 根据四个参数发现四种类型的新词。 106 | 107 | 四种类型的新词： 108 | 109 | 1. 拉丁词，包括： 110 | 111 | 1. 纯数字 (2333, 12315, 12306) 112 | 113 | 2. 纯字母 (iphone, vivo) 114 | 115 | 3. 数字字母混合 (iphone7, mate9) 116 | 117 | 2. 两个中文字符的unigram (unigrams被定义为分词器产生的元素): 118 | 119 | (马蓉，优酷，杨洋) 120 | 121 | 3. 三个中文字符的unigram unigram: 122 | 123 | (李易峰，张一山，井柏然) 124 | 125 | 4. bigrams, 每个bigram由两个unigram组成 126 | 127 | (图片大全，英雄联盟，公交车路线，穿越火线) 128 | 129 | 130 | 四个参数： 131 | 132 | 1. 词频 (tf): 一个词出现的次数。词频越大，表明下面三个参数的置信度越高。 133 | 134 | 2. 凝聚系数: 凝聚系数越大表明两个（字）词共同出现的概率越大（越不是偶然）。 135 | 136 | $\text{agg-coef}=\frac{P(w_1,w_2)}{P(w_1)P(w_2)}=\frac{C(w_1,w_2)/\text{\{nr-of-bigrams\}}}{[C(w_1)/\text{\{nr-of-bigrams\}}]*[C(w_2)\text{\{nr-of-bigrams\}}]}$ 137 | 138 | 其中`C(w_1, w_2)`是词`w_1`和`w_2`共同出现的次数。 139 | 140 | `C(w_1)`和`C(w_2)`是词`w_1`和`w_2`分别出现的次数。 141 | 142 | 3. 最小边界信息熵 143 | 144 | 4. 最大边界信息熵 145 | 146 | 最小和最大边界信息熵分别是左边界信息熵和右边界信息熵二者的最小值和最大值。 147 | 148 | 边界信息熵越大，表明一个词越能和更多词搭配，进而表明一个词是一个独立词。比如"我是"拥有大词频和大凝聚系数但是最小边界信息熵却很小，说明它不是一个词。 149 | 150 | 左边界信息熵: 151 | 152 | $\text{Ent}_{w_l}=-\sum_{w_l}\cdotP(w_l|w)\log(P(w_l|w))$ 153 | 154 | 其中`w_l`是出现在`w`左边的所有unigram组成的集合，上面的公式同样适用于右边界信息熵的计算。 155 | 156 | ## How-to 157 | 158 | 其中一个运行示例（注意如果路径中有空格那么两端的双引号不可省略） 159 | 160 | ``` 161 | python run_discover.py "G:\Documents\Exp Data\CCF_sogou_2016\sogouu8.txt" "G:\Documents\Exp Data\CCF_sogou_2016\reports" --latin 50 0 0 0 --bigram 20 80 0 1.5 --unigram_2 20 40 0 1 --unigram_3 20 41 0 1 --iteration 2 --verbose 2 162 | ``` 163 | 164 | 运行 165 | 166 | ``` 167 | python run_discover.py --help 168 | ``` 169 | 170 | 来获取更多帮助。 171 | 172 | 每次迭代包含以下11个步骤： 173 | 174 | 1. cutting 175 | 2. counting characters 176 | 3. counting unigrams 177 | 4. counting bigrams 178 | 5. counting trigrams 179 | 6. calculating aggregation coefficients (for unigrams) 180 | 7. counting neighboring words (for unigrams) 181 | 8. Calculating boundary entropy (for unigrams) 182 | 9. calculating aggregation coefficients (for bigrams) 183 | 10. counting neighboring words (for bigrams) 184 | 11. calculating boundary entropy (for bigrams) 185 | 186 | 每次迭代之后会产生4个文件分别报告拉丁新词，两个中文的unigram新词，三个中文的unigram新词和bigram新词。程序运行结束后，你会额外得到4个文件，每个文件是一个类型的新词，由之前每次迭代的结果综合而成。 187 | 188 | 如果遇到任何问题，欢迎提出issue或者联系我 (rayarrow@qq.com). 189 | -------------------------------------------------------------------------------- /run_discover.py: -------------------------------------------------------------------------------- 1 | # Created by Zhao Xinwei. 2 | # 2017.05.??. 3 | # Used to load the corpora and execute the new word discovering algorithm. 4 | 5 | from argparse import ArgumentParser 6 | from collections import namedtuple 7 | 8 | import jieba 9 | 10 | from discover_utils import * 11 | from discoverer import Discoverer 12 | 13 | 14 | # For 4000 lines of corpus. 15 | default_latin = [10, 0, 0, 0] 16 | default_bigram = [10, 50, 0, 1] 17 | default_unigram2 = [10, 2, 0, 1] 18 | default_unigram3 = [10, 2, 0, 1] 19 | default_iteration = 2 20 | default_verbose = 0 21 | 22 | arg_parser = ArgumentParser('New Words Discovery', 23 | usage='Discover new words from corpus according to term frequency, aggreagation coefficient, min neighboring entropy and max neighboring entropy.') 24 | 25 | arg_parser.add_argument('input_path', 26 | help='The path to the corpus. It should be a plain text file or a dir containing only plain text files.') 27 | arg_parser.add_argument('output_path', help='The path to generate the reports.') 28 | arg_parser.add_argument('--dictionary_path', default=os.path.join(os.path.dirname(jieba.__file__), 'dict.txt'), 29 | help='The path to the dictionary (text), each line of which contains item, POS-tag and frequency, seperated by spaces. Terms satisfying the filter condition but in the dictionary are not considered as new words.') 30 | arg_parser.add_argument('--latin', nargs=4, default=default_latin, type=int, 31 | help='The parameters include term frequency, aggreagation coefficient, max neighboring entropy and min neighboring entropy, which also applies for --bigram, --unigram_2 and --unigram_3. This argument set thresholds for latin words, including pure digits, pure letters and the combination of letters and digits such as "iphone 7".') 32 | arg_parser.add_argument('--bigram', nargs=4, default=default_bigram, type=float, 33 | help='Bigrams are defined as words that are composed of two unigram terms. Reference argument --latin for further help.') 34 | arg_parser.add_argument('--unigram_2', nargs=4, default=default_unigram2, type=float, 35 | help='A term which is composed of two Chinese characters and cannot be divided into other words. Reference argument --latin for further help.') 36 | arg_parser.add_argument('--unigram_3', nargs=4, default=default_unigram3, type=float, 37 | help='A term which is composed of three Chinese characters and cannot be divided into other words. Reference argument --latin for further help.') 38 | arg_parser.add_argument('--iteration', default=default_iteration, type=int, 39 | help='The next iteration will base its dictionary as the original dictionary plusing the new words discovered in the last iteration.') 40 | arg_parser.add_argument('--verbose', default=default_verbose, choices=[0, 1, 2], type=int, 41 | help="Determines the verbosity of the reports. *** 0: only new word items and their term frequency.*** 1: min neighboring entropy and max neighboring entropy are supplemented. *** 2:left and right neighboring entropy are added.") 42 | args = arg_parser.parse_args() 43 | 44 | documents, corpus_name = load_lines_of_documents(args.input_path) 45 | 46 | output_home = join(args.output_path, corpus_name) 47 | if not os.path.exists(output_home): 48 | logger.info('Output path does not exists and created.') 49 | os.makedirs(output_home) 50 | 51 | threshold_parameter = namedtuple('threshold_parameter', ['tf', 'agg_coef', 'max_entropy', 'min_entropy']) 52 | threshold_parameters = dict() 53 | 54 | threshold_parameters['bigram'] = threshold_parameter(*args.bigram) 55 | threshold_parameters['latin'] = threshold_parameter(*args.latin) 56 | threshold_parameters[2] = threshold_parameter(*args.unigram_2) 57 | threshold_parameters[3] = threshold_parameter(*args.unigram_3) 58 | 59 | dictionary = load_dictionary(args.dictionary_path) 60 | 61 | discoverer = Discoverer(save_segmentation=False) 62 | 63 | # Used to store stats generated in each iteration. 64 | stats_ind = list() 65 | 66 | import time 67 | 68 | for iteration in range(args.iteration): 69 | time.sleep(1) 70 | logger.info(""" 71 | ********************************************************************** 72 | 73 | commencing iteration {}... 74 | 75 | ********************************************************************** 76 | """.format(iteration + 1)) 77 | discoverer.fit(documents, corpus_name + ' [{}]'.format(iteration + 1)) 78 | discoverer.get_new_unigrams(dictionary) 79 | 80 | # Add new words to the `dictionary`. 81 | new_words, current_stats = generate_report(output_home, discoverer.new_unigram_stats, discoverer.bigram_stats, 82 | threshold_parameters, corpus_name=corpus_name, iteration=iteration + 1, 83 | verbose=args.verbose) 84 | dictionary += new_words 85 | stats_ind.append(current_stats) 86 | for each_new_word in new_words: 87 | jieba.add_word(each_new_word) 88 | 89 | # Output complete reports with the results of each iteration concatenated. 90 | by = 'tf' 91 | overall_latin_new_unigram_stats = pd.concat( 92 | [each_stats['latin'] for each_stats in stats_ind]).sort_values(by=by, ascending=False) 93 | overall_new_bigrams_stats = pd.concat( 94 | [each_stats['bigram'] for each_stats in stats_ind]).sort_values(by=by, ascending=False) 95 | output_stats(join(output_home, 'overall_latin.csv'), overall_latin_new_unigram_stats) 96 | output_stats(join(output_home, 'overall_bigrams.csv'), overall_new_bigrams_stats) 97 | 98 | for each_length in stats_ind[0]['chinese_unigram']: 99 | # ==================================================================================================== 100 | # ==================================================================================================== 101 | 102 | overall_chinese_sub_unigrams_verbose = pd.concat( 103 | [each_stats['chinese_unigram'][each_length] for each_stats in stats_ind]).sort_values(by=by, 104 | ascending=False) 105 | output_stats(join(output_home, 'overall_chinese_unigrams@{}.csv'.format(each_length)), 106 | overall_chinese_sub_unigrams_verbose) 107 | -------------------------------------------------------------------------------- /discover_utils.py: -------------------------------------------------------------------------------- 1 | # Created by Zhao Xinwei. 2 | # 2017.05.04. 3 | # Some auxiliary functions are implemented here to facilitate printing. 4 | 5 | import logging 6 | import os 7 | import re 8 | import sys 9 | from ast import literal_eval 10 | from collections import defaultdict 11 | from os.path import join, splitext 12 | 13 | import numpy as np 14 | import pandas as pd 15 | 16 | # Default thresholds for stats columns. 17 | # threshold_parameter = namedtuple('threshold_parameter', ['tf', 'agg_coef', 'max_entropy', 'min_entropy']) 18 | # threshold_parameters = dict() 19 | # threshold_parameters[0] = threshold_parameter(100, 2500, 0, 3) 20 | # threshold_parameters[2] = threshold_parameter(100, 60, 0, 2) 21 | # threshold_parameters[3] = threshold_parameter(100, 1000, 0, 2) 22 | 23 | # Match the strings that contains at least 1 Chinese characters. 24 | chinese_pattern = re.compile(r'[\u4e00-\u9fa5]') 25 | 26 | # Match the strings in which all characters are Chinesee. 27 | chinese_string_pattern = re.compile(r'^[\u4e00-\u9fa5]+$') 28 | 29 | # Characters considered to be punctuations. 30 | punctuations = set('，。！？"!、.： ?') 31 | 32 | # Configure the logger. 33 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, stream=sys.stdout) 34 | logger = logging.getLogger('New Word Discovery') 35 | 36 | 37 | def load_dictionary(path): 38 | logger.info('Loading the dictionary...') 39 | with open(path, 'r', encoding='utf8') as f: 40 | return [line.split()[0] for line in f] 41 | 42 | 43 | def load_lines_of_documents(path): 44 | documents = [] 45 | if not os.path.isdir(path): 46 | with open(path, 'r', encoding='utf8') as f: 47 | documents = [line.strip() for line in f] 48 | else: 49 | for each_file in os.listdir(path): 50 | with open(os.path.join(path, each_file), 'r', encoding='utf8') as f: 51 | documents.extend(line.strip() for line in f) 52 | 53 | return list(set(documents)), os.path.basename(path) 54 | 55 | 56 | def output_ordered_dict(path, an_ordered_dict, encoding='utf8'): 57 | """ 58 | Save an `ordered dict` as a two-column table to `path`. 59 | """ 60 | with open(path, 'w', encoding=encoding) as f: 61 | for each_unigram, count in an_ordered_dict.items(): 62 | f.write('{} \t\t {}\n'.format(each_unigram, count)) 63 | 64 | 65 | def load_stats(path): 66 | """ 67 | Read in a `stats` of type `DataFrame` with `encoding`, `index` and `header` specified. 68 | """ 69 | stats = pd.read_csv(path, sep='\t', encoding='utf8', index_col=0, header=0, na_values=['']) 70 | logger.info(r'The stats {} are successfully loaded'.format(path)) 71 | # If the index are not unigrams, convert the `str` form grams to `tuple` form. 72 | if re.match(r'(?:$\'.*?\', )+\'.*?\'$', stats.index[0]): 73 | logger.info(r'The index of {} are not unigrams. Commence the normalization process.'.format(path)) 74 | stats.index = stats.index.map(literal_eval) 75 | logger.info(r'the index of {} normalized.'.format(path)) 76 | return stats 77 | 78 | 79 | def modify_stats_path(path, stats): 80 | """ 81 | Add the specified threshold parameters in the file name. 82 | """ 83 | if stats.index.name is not None: 84 | return stats.index.name.join(splitext(path)) 85 | else: 86 | return path 87 | 88 | 89 | def output_stats(path, stats, preserve_grams=True): 90 | """ 91 | This function do two other things on top of the basic `DataFrame.to_csv()` method: 92 | 1. Specify the `float_format` and `encoding` parameters. 93 | 2. If `preserve_grams` is set to `False`, then the x-grams will be concatenated to a complete string. 94 | Example. If set to True, then `('王八', '蛋')` will be converted to `'王八蛋'`. 95 | """ 96 | if not preserve_grams and stats.shape[0] and isinstance(stats.index[0], tuple): 97 | stats.index, stats.index.name = stats.index.map(lambda x: ''.join(x)), stats.index.name 98 | stats.to_csv(path, sep='\t', float_format='%.5f', encoding='utf8') 99 | logger.info(r'Writing to `{}` succeed.'.format(path)) 100 | 101 | 102 | # !!!!!!!!!!!!!!!! Note that the entry in a 1_gram is taken as the unigram itself, not the characters that compose it. 103 | def contain_punc(x_gram): 104 | """ 105 | Determine if at least one of the entries in the given x-gram are punctuations. 106 | :return: 107 | """ 108 | return any(map(lambda x: x in punctuations, tuple(x_gram))) 109 | 110 | 111 | def contain_non_chinese(x_gram): 112 | """ 113 | If at least one grams in the `x_gram` contains non-Chinese character, return True. 114 | :return: 115 | """ 116 | return any(map(lambda x: not chinese_string_pattern.match(x), tuple(x_gram))) 117 | 118 | 119 | def no_chinese_at_all(x_gram): 120 | """ 121 | If every entry in the `x_gram` contains no Chinese characters, return True. 122 | :return: 123 | """ 124 | return not any(map(lambda x: chinese_pattern.match(x), tuple(x_gram))) 125 | 126 | 127 | def verbose_logging(content, idx, length, verbose, *other_para): 128 | """ 129 | A helper function to logging. 130 | :param content: the contents to be formatted and logged to the console. 131 | :param idx: the location of the being processed entry, i.e., the progress of the running function. 132 | :param length: the number of entries to be processed. 133 | :param verbose: This field controls the frequency of the logger. The logger log to the console when the process 134 | reaches k * `verbose` quantile. In other words, the logger will log 1/`verbose` times in total. 135 | # Example. When `verbose`=0.02, the logger logs when the currently working function reaches 2%, 4%, 6%, etc. 136 | :param other_para: variables to be printed except for the progress-related variables (idx, length). 137 | :return: 138 | """ 139 | checkpoint = int(length * verbose) 140 | # Prevent division by zero. 141 | if checkpoint == 0: 142 | checkpoint = 1 143 | if not idx % checkpoint: 144 | logger.info(content.format(*other_para, idx, length)) 145 | 146 | 147 | def infer_counter_type(counter): 148 | """ 149 | Given a counter, infer the type of its entries. 150 | Example. The type of entries in `discoverer.unigram_counter` is `unigram`. 151 | """ 152 | counter_type = {1: 'unigram', 153 | 2: 'bigram', 154 | 3: 'trigram'} 155 | if not counter: 156 | return 'Unknown' 157 | else: 158 | return counter_type[len(next(iter(counter)))] 159 | 160 | 161 | def filter_stats(stats, tf=1, agg=0, max_entropy=0, min_entropy=0, verbose=2, by='tf'): 162 | """ 163 | Return a `stats` preserving only the words of which attributes reach the thresholds. 164 | """ 165 | stats = stats.sort_values(by=by, ascending=False) 166 | stats = stats[ 167 | (stats.tf >= tf) & (stats.agg_coef >= agg) & (stats.max_entropy >= max_entropy) & (stats.min_entropy >= min_entropy)] 168 | if verbose == 0: 169 | stats = stats[['tf']] 170 | elif verbose == 1: 171 | stats = stats[['tf', 'agg_coef', 'max_entropy', 'min_entropy']] 172 | elif verbose == 2: 173 | stats = stats[['tf', 'agg_coef', 'max_entropy', 'min_entropy', 'left_entropy', 'right_entropy']] 174 | else: 175 | raise Exception('Invalid `verbose`.') 176 | 177 | # Store the config to its index name field. (`pd.DataFrame` has no `name` field) 178 | stats.index.name = '{} # {} # {} # {} # {}'.format(tf, agg, max_entropy, min_entropy, verbose) 179 | return stats 180 | 181 | 182 | def purify_stats(stats, length=2, pattern=r'[.a-zA-Z\u4e00-\u9fa5]', returning_non_pure=False): 183 | """ 184 | Select out the rows that the corresponding terms are reasonable characters. Refer to `pure_index` variable below. 185 | On top of that, `NULL` entries are removed here. 186 | :param returning_non_pure: If this is true, the stats of the unreasonable terms will also be returned. 187 | """ 188 | if not stats.shape[0]: 189 | logger.info(r'Empty stats. Nothing done.') 190 | 191 | # Remove `NULL` entries. 192 | stats = stats[pd.notnull(stats.index)] 193 | 194 | index = stats.index 195 | # If the index is not unigram, concatenate the x-grams to a str. 196 | if not isinstance(index[0], str): 197 | index = index.map(lambda x: ''.join(x)) 198 | 199 | pure_index = (index.str.contains(pattern)) & (index.str.len() >= length) 200 | 201 | if returning_non_pure: 202 | return stats[pure_index], stats[~pure_index] 203 | else: 204 | return stats[pure_index] 205 | 206 | 207 | def decompose_stats(stats): 208 | """ 209 | Decompose the stats of Chinese words and Latin words. 210 | The `stats` of Chinese words are further divided into several blocks based on the length of the words. 211 | """ 212 | agg_inf_index = (stats.agg_coef == np.inf) 213 | latin_pure_new_unigram_stats = stats[agg_inf_index] 214 | chinese_pure_new_unigram_stats = stats[~agg_inf_index] 215 | return chinese_pure_new_unigram_stats, latin_pure_new_unigram_stats 216 | 217 | 218 | def generate_report_file_path(output_home, corpus_name, iteration, stats_type): 219 | """ 220 | Compose a human-readable path from a series of parameters. 221 | """ 222 | return join(output_home, 'report_{} [{}]_{}.csv'.format(corpus_name, iteration, stats_type)) 223 | 224 | 225 | def generate_report(output_home, new_unigram_stats, bigram_stats, threshold_parameters, preserve_grams=False, 226 | corpus_name='default_corpus', unigram_max_len=3, verbose=0, iteration=1): 227 | """ 228 | Select out the new words based on the given `threshold_parameters`, which are in turn used to generate reports and 229 | update the dictionary. (the new words are returned to update the dictionary outside this function) 230 | """ 231 | new_words = list() 232 | pure_new_unigram_stats, messy_new_unigram_stats = purify_stats(new_unigram_stats, returning_non_pure=True) 233 | 234 | # output messy new unigram. 235 | # messy_new_unigram_stats_verbose_2 = filter_stats(messy_new_unigram_stats) 236 | # output_stats('./output/messy_new_unigram_verbose_2.csv', messy_new_unigram_stats_verbose_2) 237 | 238 | chinese_pure_new_unigram_stats, latin_pure_new_unigram_stats = decompose_stats(pure_new_unigram_stats) 239 | p = threshold_parameters['latin'] 240 | latin_pure_new_unigram_stats = filter_stats(latin_pure_new_unigram_stats, tf=p.tf, agg=p.agg_coef, 241 | max_entropy=p.max_entropy, min_entropy=p.min_entropy, 242 | verbose=verbose) 243 | output_stats(generate_report_file_path(output_home, corpus_name, iteration, 'latin'), 244 | latin_pure_new_unigram_stats) 245 | new_words.extend(list(latin_pure_new_unigram_stats.index)) 246 | 247 | # Generate the report for unigrams containing Chinese with different length. 248 | chinese_pure_new_unigram_stats_by_len = chinese_pure_new_unigram_stats.groupby(len) 249 | 250 | chinese_sub_stats_s = defaultdict(lambda: None) 251 | for each_length in sorted(set(chinese_pure_new_unigram_stats.index.map( 252 | lambda x: len(x) if len(x) < unigram_max_len else unigram_max_len))): 253 | p = threshold_parameters[each_length] 254 | chinese_sub_stats = chinese_pure_new_unigram_stats_by_len.get_group(each_length) 255 | chinese_sub_stats = filter_stats(chinese_sub_stats, tf=p.tf, agg=p.agg_coef, max_entropy=p.max_entropy, 256 | min_entropy=p.min_entropy, verbose=verbose) 257 | output_stats( 258 | generate_report_file_path(output_home, corpus_name, iteration, 'chinese_unigrams@{}'.format(each_length)), 259 | chinese_sub_stats) 260 | chinese_sub_stats_s[each_length] = chinese_sub_stats 261 | new_words.extend(list(chinese_sub_stats.index)) 262 | 263 | # Genereate the report for bigrams. 264 | p = threshold_parameters['bigram'] 265 | bigram_stats = filter_stats(bigram_stats, tf=p.tf, agg=p.agg_coef, max_entropy=p.max_entropy, 266 | min_entropy=p.min_entropy, verbose=verbose) 267 | output_stats(generate_report_file_path(output_home, corpus_name, iteration, 'bigram'), bigram_stats, 268 | preserve_grams=preserve_grams) 269 | new_words.extend(list(bigram_stats.index.map(lambda x: ''.join(x)))) 270 | 271 | # return the reports of each invocation of `generate_report()` to comprise a complete report with the result of each 272 | # iteration merged. 273 | return new_words, {'latin': latin_pure_new_unigram_stats, 'chinese_unigram': chinese_sub_stats_s, 274 | 'bigram': bigram_stats} 275 | -------------------------------------------------------------------------------- /discoverer.py: -------------------------------------------------------------------------------- 1 | # Created by Zhao Xinwei. 2 | # 2017 04 27. 3 | import json 4 | from collections import Counter, OrderedDict 5 | 6 | import jieba 7 | import nltk 8 | from scipy.stats import entropy 9 | 10 | from discover_utils import * 11 | 12 | 13 | # TO DO: 14 | # Intended to facilitate the access to the counter name, the parent and child of one specific counter. 15 | # By now, approaches to each field of a counter was substituted by functions, including `__parent()` 16 | # `__infer_counter_type()`, which inconvenience the following calculation. This may come in handy in future. 17 | # gram_counter = namedtuple('gram_counter', ['name', 'counter', 'parent', 'child', 'gram']) 18 | 19 | 20 | class Discoverer(object): 21 | def __init__(self, verbose=0.01, cache_path='./preprocessed', save_segmentation=True): 22 | # When this field is set to `False`, access to the boundary-calculation functions prior to the finish of 23 | # `fit()` will throw an exception. After `fit()` is done, this field will be set to `True`. 24 | self.is_fitted = False 25 | 26 | # This field controls the frequency of the logger. The logger log to the console when the process reaches 27 | # k * `verbose` quantile. In other words, the logger will log 1/`verbose` times in total. 28 | # Example. When `verbose`=0.02, the logger logs when the progress of the currently working function reaches 29 | # 2%, 4%, 6%, etc. 30 | self.verbose = verbose 31 | 32 | # `corpus_name` is intended to prevent confusion and overwritten when you forget to specified the output filenames. 33 | # This field identifies the corpus that this `discoverer` is working on. 34 | # In addition, this `field` will be included in the file name when, say, a `.csv` file is generated. 35 | self.corpus_name = None 36 | 37 | # The location to save and load the cached segmented documents. 38 | self.cache_path = cache_path 39 | if not os.path.exists(self.cache_path): 40 | os.mkdir(self.cache_path) 41 | 42 | # A flag variable determining whether to cache the segmented documents or not. 43 | # If this flag is set to `False`, the above defined cache_path will be of no use. 44 | self.save_segmentation = save_segmentation 45 | 46 | # A list containing raw documents, with each element as a document. 47 | self.documents = None 48 | self.nr_of_documents = None 49 | 50 | # segmented docs. Identical to `self.documents` except that the documents are converted to a list of words 51 | # comprising them. 52 | self.unigram_docs = None 53 | 54 | # The counters. 55 | # counter entry 56 | # ======= ===== 57 | # char character including Chinese characters, letters and punctuations. 58 | # unigram entries in a `unigram_doc`. The minimum consecutive character blocks identified by the segmenter. 59 | # bigram composed of two consecutive unigram (A tuple containing two items). 60 | # trigram ... 61 | self.char_counter = self.unigram_counter = self.bigram_counter = self.trigram_counter = None 62 | self.nr_of_chars = self.nr_of_unigrams = self.nr_of_bigrams = self.nr_of_trigrams = None 63 | 64 | # 8 columns (excluding the index column): `tf`, `agg`, `max_entropy`, `min_entropy`, `left_entropy` 65 | # `right_entropy`, `left_wc`, 'right_wc`. 66 | self.unigram_stats = self.bigram_stats = None 67 | 68 | # Same to `self.unigram_stats` except that the words that already in the given dictionary are removed. 69 | self.new_unigram_stats = None 70 | 71 | def fit(self, documents, corpus_name='default'): 72 | """ 73 | Everything before generating `.csv` files is done here. For the purpose of each field, refer to `__init()__`. 74 | Once `fit()`, `get_new_unigrams()` and `purify()` are called, all the remaining works are handed over to 75 | `discoverer_utils`. 76 | """ 77 | self.is_fitted = True 78 | self.corpus_name = corpus_name 79 | self.documents = documents 80 | self.nr_of_documents = len(documents) 81 | 82 | # Segment every document. Each element inside the `unigram_docs` is a list of words compromising the 83 | # corresponding document. 84 | self.unigram_docs = self._segment(documents) 85 | 86 | # Count the occurrence of each char. 87 | self.char_counter = self._get_chars(self.documents) 88 | self.nr_of_chars = sum(self.char_counter.values()) 89 | 90 | # Count the occurrence of each word. 91 | self.unigram_counter = self._get_unigrams(self.unigram_docs) 92 | self.nr_of_unigrams = sum(self.unigram_counter.values()) 93 | 94 | # Count the occurrence of each bigram. 95 | self.bigram_counter = self._get_xgrams(self.unigram_docs) 96 | self.nr_of_bigrams = sum(self.bigram_counter.values()) 97 | 98 | # Count the occurrence of each trigram. 99 | self.trigram_counter = self._get_xgrams(self.unigram_docs, tokenizer=nltk.trigrams) 100 | self.nr_of_trigrams = sum(self.trigram_counter.values()) 101 | 102 | self.unigram_stats = self._get_stats(self.unigram_counter) 103 | self.bigram_stats = self._get_stats(self.bigram_counter) 104 | 105 | # # Remove characters, unigrams and bigrams that contain punctuations. 106 | # self._denoise(self.unigram_counter) 107 | # self._denoise(self.bigram_counter) 108 | 109 | def _load_json(self, obj_name): 110 | """ 111 | A helper function that facilitates cache loading process. This function will examine the existence of caches 112 | corresponding to the given corpus, load, and log. 113 | """ 114 | # The cache file names are of the form "/{root dir}/{obj_name} @# {corpus_name}.json" 115 | ## The former `join()` is `os.path.join()` and the latter one is `str.join()`. 116 | json_path = join(self.cache_path, ' @# '.join([obj_name, self.corpus_name, '.json'])) 117 | if os.path.exists(json_path): 118 | logger.info('`{}` exists. Loading ...'.format(' @# '.join([obj_name, self.corpus_name]))) 119 | with open(json_path, encoding='utf8') as f: 120 | obj = json.load(f) 121 | logger.info('`{}` loaded.'.format(' @# '.join([obj_name, self.corpus_name]))) 122 | return obj 123 | else: 124 | logger.info('`{}` preprocessed does not exist. Get ready to train'.format(obj_name)) 125 | 126 | def _dump_json(self, save_flag, obj, obj_name): 127 | """ 128 | A helper function to facilitates the caching process. 129 | :param save_flag: determine whether to cache or not. 130 | :param obj: the `object` (counter here) to be cached. 131 | :param obj_name: the name of the `obj`. 132 | """ 133 | json_path = join(self.cache_path, ' @# '.join([obj_name, self.corpus_name, '.json'])) 134 | if save_flag and not os.path.exists(json_path): 135 | with open(json_path, 'w', encoding='utf8') as f: 136 | json.dump(obj, f) 137 | logger.info('`{}` dumped'.format(obj_name)) 138 | elif os.path.exists(json_path): 139 | logger.info('`{}` exists. No need to dump.'.format(obj_name)) 140 | 141 | def _segment(self, documents): 142 | """ 143 | Invoke `jieba.lcut()` and return segemented documents 144 | """ 145 | unigram_docs = self._load_json('unigram_docs') 146 | if unigram_docs: 147 | return unigram_docs 148 | unigram_docs = list() 149 | for idx, each_doc in enumerate(documents): 150 | verbose_logging('cutting ... {} / {}', idx, self.nr_of_documents, self.verbose) 151 | unigram_docs.append(jieba.lcut(each_doc)) 152 | self._dump_json(self.save_segmentation, unigram_docs, 'unigram_docs') 153 | return unigram_docs 154 | 155 | def _get_chars(self, documents): 156 | """ 157 | `collections.Counter()` is a subclass of `dict`, which takes an iterable or a `collections.Counter()` and update 158 | itself. 159 | """ 160 | char_counter = Counter() 161 | for idx, each_doc in enumerate(documents): 162 | verbose_logging('counting characters ... {} / {}', idx, self.nr_of_documents, self.verbose) 163 | char_counter.update(each_doc) 164 | return OrderedDict(sorted(char_counter.items(), key=lambda x: x[1], reverse=True)) 165 | 166 | def _get_unigrams(self, unigram_docs): 167 | """ 168 | Return a dict counting the occurrence of each unigram. 169 | :param unigram_docs: 170 | :return: 171 | """ 172 | unigram_counter = Counter() 173 | for idx, each_unigram_doc in enumerate(unigram_docs): 174 | verbose_logging('counting unigrams ... {} / {}', idx, self.nr_of_documents, self.verbose) 175 | unigram_counter.update(each_unigram_doc) 176 | return OrderedDict(sorted(unigram_counter.items(), key=lambda x: x[1], reverse=True)) 177 | 178 | def _get_xgrams(self, unigram_docs, tokenizer=nltk.bigrams): 179 | """ 180 | Return a dict counting the occurrence of each x_gram. When the process is being logged to the console. 181 | The "x" in the `x_gram` is inferred from the `tokenizer`. 182 | """ 183 | x_gram_counter = Counter() 184 | for idx, each_unigram_doc in enumerate(unigram_docs): 185 | verbose_logging('counting {} ... {} / {}', idx, self.nr_of_documents, self.verbose, tokenizer.__name__) 186 | x_gram_counter.update(tokenizer((each_unigram_doc))) 187 | return OrderedDict(sorted(x_gram_counter.items(), key=lambda x: x[1], reverse=True)) 188 | 189 | def purify(self, purify_new_unigrams=True, purify_unigrams=False, purify_bigrams=False): 190 | """ 191 | Denoise the `stats`. The detailed denoising rules are specified `purify_stats()`. 192 | I.e., remove numbers ('2016', '4399'), letters ('a', 'b'), special characters ('@') and unreadable characters. 193 | """ 194 | if purify_new_unigrams: 195 | self.new_unigram_stats = purify_stats(self.new_unigram_stats) 196 | logger.info('New unigrams purified. Refer to `new_unigram_stats`.') 197 | # if purify_unigrams: 198 | # self.unigram_stats = purify_unigram_stats(self.unigram_stats) 199 | # logger.info('Unigrams purified. Refer to `unigram_stats`.') 200 | # if purify_bigrams: 201 | # logger.info('Bigrams purified. Refer to `bigram_stats`.') 202 | # self.bigram_stats = purify_unigram_stats(self.bigram_stats) 203 | 204 | # def _denoise(self, counter, filters=(contain_punc, no_chinese_at_all)): 205 | # """ 206 | # Remove characters, unigrams and bigrams that contain punctuations. 207 | # :return: 208 | # """ 209 | # for each_filter in filters: 210 | # x_grams_to_be_removed = list() 211 | # for idx, each_x_grams in enumerate(counter): # `bigrams_counter` is to be replaced. 212 | # verbose_logging('Denoising {} using {} ... {} / {}', idx, len(counter), self.verbose, 213 | # infer_counter_type(counter), each_filter.__name__) 214 | # if each_filter(each_x_grams): 215 | # x_grams_to_be_removed.append(each_x_grams) 216 | # 217 | # for each_x_grams in x_grams_to_be_removed: 218 | # counter.pop(each_x_grams, 'None') 219 | 220 | def _get_stats(self, counter, by='tf'): 221 | """ 222 | Compose a `stats` of type `pandas.DataFrame`, with 8 columns as followed: 223 | `tf`, `aggregation coefficient`, `max_entropy`, `min_entropy`, `left_entropy`, `right_entropy`, 224 | `left_wc`, `right_wc` 225 | """ 226 | counter_aggregation = self._aggregation_coef(counter) 227 | # Convert `counter_aggregation` to a `Series` to facilitate the following concatenation process. 228 | counter_aggregation = pd.Series(counter_aggregation, name='agg_coef') 229 | 230 | # Calculate the boundary entropy. 231 | boundary_entropy = self._get_boundary_stats(counter) 232 | 233 | # Convert the `counter` to a `Series` to facilitate the following concatenation process. 234 | counts = pd.Series(counter, name='tf') 235 | return pd.concat([counts, counter_aggregation, boundary_entropy], axis=1).sort_values(by=by, ascending=False) 236 | 237 | def get_new_unigrams(self, dictionary): 238 | """ 239 | Initialize `new_unigram_counter` and `new_unigram_stats` attributes. 240 | The words already in the dictionary will be filtered. 241 | :param dictionary: An iterable containing words. 242 | """ 243 | logger.info('Getting new words...') 244 | new_words = set(self.unigram_counter) - set(dictionary) 245 | self.new_unigram_counter = OrderedDict( 246 | [(word, self.unigram_counter[word]) for word in self.unigram_counter if word in new_words]) 247 | 248 | self.new_unigram_stats = self.unigram_stats.loc[new_words].sort_values(by='agg_coef', ascending=False) 249 | logger.info('New unigrams gotten. Please refer to `new_unigram_counter and `new_unigram_stats`.') 250 | 251 | def __parent(self, counter): 252 | """ 253 | trigram counter > bigram counter > unigram counter > char counter. 254 | For example, when you are about to calculate the boundary entropy of a unigram, the bigrams containing that 255 | unigram will greatly facilitate your calculation. In this case, you need to the access the the parent of the 256 | `unigram_counter`, and this is where the function comes in. 257 | :param counter: 258 | :return: 259 | """ 260 | if counter == self.char_counter: 261 | return self.unigram_counter 262 | if counter == self.unigram_counter: 263 | return self.bigram_counter 264 | if counter == self.bigram_counter: 265 | return self.trigram_counter 266 | 267 | def __counter_grams(self, counter): 268 | """ 269 | A helper function to get the number of grams in the `counter`. 270 | Only `unigram_counter`, `bigram_counter` and `trigram_counter` are supported. 271 | """ 272 | if counter == self.unigram_counter: 273 | return 1 274 | if counter == self.bigram_counter: 275 | return 2 276 | if counter == self.trigram_counter: 277 | return 3 278 | else: 279 | raise Exception('Not supported. Refer to `help(__counter_grams)`') 280 | 281 | def _get_boundary_stats(self, counter): 282 | """ 283 | get the boundary statistics of each headword. A boundary stats contains the following columns: 284 | 'max_entropy', 'min_entropy', 'left_entropy', 'right_entropy', 'left_wc', 'right_wc' 285 | """ 286 | if not self.is_fitted: 287 | raise Exception('This model has not been trained') 288 | 289 | left_word_counter, right_word_counter = self._get_boundary_word_counts(counter) 290 | 291 | columns = ['max_entropy', 'min_entropy', 'left_entropy', 'right_entropy', 'left_wc', 'right_wc'] 292 | stats = [] 293 | words = [] 294 | 295 | # Calculate the entropy after the left adjacent words and right adjacent words of each x-gram are gotten. 296 | for idx, each_word in enumerate(counter): 297 | verbose_logging('Calculating boundary entropy ... {} / {}', idx, len(counter), self.verbose) 298 | left_entropy = entropy([count[1] for count in left_word_counter[each_word]], base=2) 299 | right_entropy = entropy([count[1] for count in right_word_counter[each_word]], base=2) 300 | words.append(each_word) 301 | stats.append(( 302 | max(left_entropy, right_entropy), 303 | min(left_entropy, right_entropy), 304 | left_entropy, 305 | right_entropy, 306 | left_word_counter[each_word], 307 | right_word_counter[each_word], 308 | )) 309 | # Name the index. This seems to be of no use, however. 310 | words_index = pd.Index(words, name=('word{}'.format(num + 1) for num in range(self.__counter_grams(counter)))) 311 | return pd.DataFrame(stats, index=words_index, columns=columns).sort_values(by='max_entropy', ascending=False) 312 | 313 | def _get_boundary_word_counts(self, counter): 314 | """ 315 | Get all the left and right adjacent words of each x-gram in the given `counter` and sort them by the frequency 316 | in descending order. 317 | """ 318 | 319 | # A nested dict essentially. 320 | # By default, access to an undefined key in a `dict` will throw an exception. This is where the `defaultdict` 321 | # comes in, which invoke the constructor and fill the accessed key with the object returned. 322 | 323 | # Note that parameters of `defaultdict` must be callable, and this is the reason why the `defaultdict(int)` is 324 | # modified to be `lambda: defaultdict(int)`. `defaultdict(int)` is a uncallable `defaultdict`, while 325 | # `lambda: defaultdict(int)` is a function returning a newly constructed `defaultdict(int)`. 326 | left_adjacent_word_counter = defaultdict(lambda: defaultdict(int)) 327 | right_adjacent_word_counter = defaultdict(lambda: defaultdict(int)) 328 | 329 | # For the behavior and the motivation, refer to `__parent()`. 330 | parent_counter = self.__parent(counter) 331 | 332 | for idx, each_x_gram in enumerate(parent_counter): 333 | verbose_logging('counting neighboring words ... {} / {}', idx, len(parent_counter), self.verbose) 334 | 335 | # The words in a x_gram ranging from 0-position to penultimate-position are considered as left adjacent words. 336 | # The words in a x_gram ranging from 1-position to last-position are considered as right adjacent words. 337 | head_left, head_right = each_x_gram[:-1], each_x_gram[1:] 338 | 339 | # If the given `counter` is a `unigram counter`, then there's no need to wrap it with a list. 340 | if len(head_left) == 1: 341 | head_left = head_left[0] 342 | head_right = head_right[0] 343 | 344 | # Like C++, operators, overloadable, behavior properly on many built-in classes. 345 | left_adjacent_word_counter[head_right][each_x_gram[0]] += parent_counter[each_x_gram] 346 | right_adjacent_word_counter[head_left][each_x_gram[-1]] += parent_counter[each_x_gram] 347 | 348 | def _sort_and_padding(word, adjacent_word_counter): 349 | """ 350 | The `word_counter` can be a `left_word_counter` or a `right_word_counter`. 351 | The `word` is not guaranteed to appear in the given `word_counter`, because not every word has both left 352 | or right neighbor words 353 | If the `word` exists in the `word_counter`, then its entry is sorted by the frequency of its neighbor words. 354 | Else, the `word_counter` will be assigned to an empty list, i.e. []. 355 | """ 356 | if word in adjacent_word_counter: 357 | adjacent_word_counter[word] = sorted(adjacent_word_counter[word].items(), key=lambda x: x[1], 358 | reverse=True) 359 | else: 360 | adjacent_word_counter[word] = [] 361 | 362 | # Fill empty entries in `left_adjacent_word_counter` and `right_adjacent_word_counter` with an empty list. 363 | for each_word in counter: 364 | _sort_and_padding(each_word, left_adjacent_word_counter) 365 | _sort_and_padding(each_word, right_adjacent_word_counter) 366 | 367 | return left_adjacent_word_counter, right_adjacent_word_counter 368 | 369 | def _cal_aggregation_coef(self, x_gram): 370 | """ 371 | Calculate the aggregation coefficient of a collocation. Aggregation coefficient is a variant of PMI (pair-wise 372 | mutual information). 373 | The elements considered to be the constituent of a collocation is the 374 | Only unigram and bigram are supported for now. 375 | P(w1, w2) C(w1, w2) / #{nr_of_bigrams} 376 | Aggregation coef = ----------- = ----------------------------------------------------- 377 | P(w1)P(w2) C(w1) / #{nr_of_bigrams} * C(w2) / #{nr_of_bigrams} 378 | In case of overflow and underflow, we divide each C(w) by a #{nr_of_bigrams} to make sure the results fall into 379 | the acceptable interval. In fact, it doesn't matter whether you divided the occurrences by `nr_of_bigrams` or 380 | `nr_of_unigrams` or the like. Any number that scale the coefficient to a safe interval can be alternatives. 381 | """ 382 | # if the "x" in `x_gram` == 1. I.e., `x_gram` is a string. 383 | # For the unigram (1-gram), only the strings containing at least one Chinese characters are considered. 384 | # If the unigram fails to meet the above-mentioned condition, it will be assigned an aggregation coefficient 385 | # of `inf`. 386 | # E.g. 387 | # pass: '高通', '王八蛋' 388 | # fail: '2333', iphone6plus. (aggregation coefficient = inf) 389 | if isinstance(x_gram, str): 390 | if chinese_pattern.search(x_gram): 391 | numerator = self.unigram_counter[x_gram] / self.nr_of_bigrams 392 | denominator_vector = np.array( 393 | [self.char_counter[each_char] for each_char in x_gram]) / self.nr_of_bigrams 394 | # fails to meet the "at least one Chinese characters" condition. 395 | else: 396 | numerator, denominator_vector = np.inf, 1 397 | # if the "x" in `x_gram` == 2. 398 | # Note that new words are potential to be appear in trigrams, but we ignore them right now for their sparsity in 399 | # trigrams and time-consuming computation. Besides, potential new words in the trigrams will become bigrams in 400 | # the next iteration. (After each iteration, the bigrams considered to be new words will be added to the 401 | # dictionary and become a unigram in the next iteration). 402 | else: 403 | numerator = self.bigram_counter[x_gram] / self.nr_of_bigrams 404 | denominator_vector = np.array( 405 | [self.unigram_counter[each_unigram] for each_unigram in x_gram]) / self.nr_of_bigrams 406 | return numerator / np.prod(np.array(denominator_vector)) 407 | 408 | def _aggregation_coef(self, counter): 409 | """ 410 | Calculate the aggregation coefficient of each word or gram in the given `counter`. 411 | :param counter: Any counter belonging to this class. 412 | :return: 413 | """ 414 | aggre_coef = list() 415 | for idx, each_x_gram in enumerate(counter): 416 | verbose_logging('Calculating aggregation coefficients ... {} / {}', idx, len(counter), self.verbose) 417 | aggre_coef.append((each_x_gram, self._cal_aggregation_coef(each_x_gram))) 418 | return OrderedDict(sorted(aggre_coef, key=lambda x: x[1], reverse=True)) 419 | 420 | # ==================================================================================================== 421 | # Codes below are now used for now. 422 | # 423 | # def test(self, method='chi_square'): 424 | # pass 425 | # 426 | # def _t_test(self, bigram): 427 | # """ 428 | # t-test. 429 | # :param bigram: 430 | # :return: t statistics 431 | # """ 432 | # w1, w2 = bigram[0], bigram[1] 433 | # mu = self.freq_of_unigram[w1] * self.freq_of_unigram[w2] 434 | # sample_mean = self.freq_of_bigram[bigram] 435 | # stats_t = (sample_mean - mu) / np.sqrt(sample_mean / self.n_of_bigram) 436 | # return stats_t 437 | # 438 | # def _chi_square_test(self, bigram): 439 | # w1, w2 = bigram[0], bigram[1] 440 | # O11 = self.bigram_counter[bigram] 441 | --------------------------------------------------------------------------------