├── README.md
└── kenlm_model.py


/README.md:
--------------------------------------------------------------------------------
  1 | # py-kenlm-model
  2 | python | 高效使用统计语言模型kenlm：新词发现、分词、智能纠错等
  3 | 
  4 | 之前看到苏神[【重新写了之前的新词发现算法：更快更好的新词发现】](https://spaces.ac.cn/archives/6920)中提到了kenlm，之前也自己玩过，没在意，现在遇到一些大规模的文本问题，模块确实好用，前几天还遇到几个差点“弃疗”的坑，解决了之后，就想，不把kenlm搞明白，对不起我浪费的两天。。
  5 | 
  6 | **kenlm的优点（[关于kenlm工具训练统计语言模型](https://blog.csdn.net/HHTNAN/article/details/84231733)）：**
  7 | 训练语言模型用的是传统的“统计+平滑”的方法，使用kenlm这个工具来训练。它快速，节省内存，最重要的是，允许在开源许可下使用多核处理器。
  8 | kenlm是一个C++编写的语言模型工具，具有速度快、占用内存小的特点，也提供了Python接口。
  9 | 
 10 | 额外需要加载的库：
 11 | ```
 12 | kenlm
 13 | pypinyin
 14 | pycorrector
 15 | ```
 16 | 
 17 | 笔者的代码可见github，只是粗略整理，欢迎大家一起改:
 18 | [mattzheng/py-kenlm-model](https://github.com/mattzheng/py-kenlm-model)
 19 | 
 20 | 相关新词发现,fork了苏神的,进行了微调:
 21 | 
 22 | [mattzheng/word-discovery](https://github.com/mattzheng/word-discovery)
 23 | 
 24 | 博客链接：
 25 | 
 26 | [python | 高效使用统计语言模型kenlm：新词发现、分词、智能纠错等](https://mattzheng.blog.csdn.net/article/details/101512616)
 27 | 
 28 | 
 29 | 
 30 | ----------
 31 | 
 32 | 
 33 | 
 34 | # 1 kenlm安装
 35 | 
 36 | 在这里面编译：[kpu/kenlm](https://github.com/kpu/kenlm)，下载库之后编译：
 37 | ```python
 38 | mkdir -p build
 39 | cd build
 40 | cmake ..
 41 | make -j 4
 42 | ```
 43 | 一般编译完，很多有用的文件都存在`build/bin`之中，这个后面会用到：
 44 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20190927094924188.png)
 45 | python库的安装方式：
 46 | 
 47 | ```python
 48 | pip install https://github.com/kpu/kenlm/archive/master.zip
 49 | ```
 50 | 简单使用：
 51 | 
 52 | ```python
 53 | import kenlm
 54 | model = kenlm.Model('lm/test.arpa')
 55 | print(model.score('this is a sentence .', bos = True, eos = True))
 56 | ```
 57 | 
 58 | 坑点来了，笔者之前装在docker之中的，之前一不小心重启docker，kenlm就不灵了。。
 59 | 当时并不知道该如何重新编译，就重新：`cmake ..` + `make -j 4`，但是这样出来，运行会报很多依赖没装：
 60 | 
 61 | ```python
 62 | 
 63 | libboost_program_options.so.1.54.0: cannot open shared object file: No such file or directory
 64 | ```
 65 | 笔者还假了嘛嘎的去ubuntu上拉下来装了，又报其他依赖错。。
 66 | 
 67 | （此处省略N多次，无效尝试。。。）
 68 | 
 69 | 如果出现：
 70 | 
 71 | ```python
 72 | -- Could NOT find BZip2 (missing:  BZIP2_LIBRARIES BZIP2_INCLUDE_DIR) 
 73 | -- Could NOT find LibLZMA (missing:  LIBLZMA_INCLUDE_DIR LIBLZMA_LIBRARY LIBLZMA_HAS_AUTO_DECODER LIBLZMA_HAS_EASY_ENCODER LIBLZMA_HAS_LZMA_PRESET)
 74 | 
 75 | ```
 76 | 需安装：
 77 | 
 78 | ```python
 79 | sudo apt install libbz2-dev
 80 | sudo apt install liblzma-dev
 81 | ```
 82 | 
 83 | 之后实验发现，把`build`文件夹删了，重新来一遍`cmake ..` + `make -j 4`即可。
 84 | 
 85 | 
 86 | ----------
 87 | 
 88 | # 2 kenlm统计语言模型使用
 89 | 
 90 | ## 2.1 kenlm的训练 `lmplz`
 91 | ### 2.1.1 两种训练方式
 92 | 训练是根据`build/bin/lmplz `来进行，一般来说有两种方式：
 93 | 
 94 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20190927143540315.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9tYXR0emhlbmcuYmxvZy5jc2RuLm5ldA==,size_16,color_FFFFFF,t_70)
 95 | 
 96 | （1）管道的方式传递
 97 | 
 98 | 数据print的方式，苏神之前的博客【[【中文分词系列】 5. 基于语言模型的无监督分词](https://spaces.ac.cn/archives/3956#%E5%AE%9E%E8%B7%B5%EF%BC%9A%E8%AE%AD%E7%BB%83)】中有提到：
 99 | 
100 | ```python
101 | python p.py|./kenlm/bin/lmplz -o 4 > weixin.arpa
102 | ```
103 | p.py为：
104 | 
105 | ```python
106 | import pymongo
107 | db = pymongo.MongoClient().weixin.text_articles
108 | 
109 | for text in db.find(no_cursor_timeout=True).limit(500000):
110 |     print ' '.join(text['text']).encode('utf-8')
111 | ```
112 | 
113 | （2）预先生成语料文本
114 | 
115 | 直接命令行，数据保存
116 | ```python
117 | bin/lmplz -o 3 --verbose_header --text ../text-18-03/text_18-03-AU.txt --arpa MyModel/log.arpa
118 | ```
119 | 其中参数的大致意义：
120 | 
121 | ```python
122 | -o n:最高采用n-gram语法
123 | -verbose_header:在生成的文件头位置加上统计信息
124 | --text text_file:指定存放预料的txt文件
125 | --arpa:指定输出的arpa文件
126 | -S [ --memory ] arg (=80%)  Sorting memory内存预占用量
127 | --skip_symbols : Treat <s>, </s>, and <unk> as whitespace instead of throwing an  exception
128 | ```
129 | 
130 | 预先语料可以不加开头、结尾符号，其中， 需要特别介绍三个特殊字符。
131 | `<s>、</s>和<unk>`
132 | `<s>`和`</s>`结对使用，模型在计算概率时对每句话都进行了处理，将该对标记加在一句话的起始和结尾。
133 | 这样就把开头和结尾的位置信息也考虑进来。
134 | 如`“我 喜欢 吃 苹果” --> "<s> 我 喜欢 吃 苹果 </s>"`
135 | `<unk>`表示unknown的词语，对于oov的单词可以用它的值进行替换。
136 | 
137 | 可参考：
138 | 不带开头结尾：
139 | ```
140 | W h o o   后   拱 辰 享 水   水 妍 护 肤 套 装 整 套 质 地 都 比 较 清 爽 
141 |  滋 润 
142 |  侧 重 保 湿 
143 |  适 合 各 种 肤 质 
144 |  调 节 肌 肤 水 平 衡 
145 |  它 还 具 有 修 复 功 效 
146 |  提 亮 肤 色 我 是 油 性 肤 质 用 起 来 也 一 点 也 不 觉 得 油 腻 
147 |  味 道 淡 淡 的 还 很 好 闻 
148 |  也 很 好 吸 收 
149 |  质 地 清 爽 
150 | ```
151 | 带开头结尾的：
152 | 
153 | ```python
154 | <s> 3 乙 方 应 依 据 有 关 法 律 规 定 </s>
155 | <s> 对 甲 方 为 订 立 和 履 行 本 合 同 向 乙 方 提 供 的 有 关 非 公 开 信 息 保 密 </s>
156 | <s> 但 下 列 情 形 除 外 </s>
157 | <s> 1 贷 款 人 有 权 依 据 有 关 法 律 法 规 或 其 他 规 范 性 文 件 的 规 定 或 金 融 监 管 机 构 的 要 求 </s>
158 | ```
159 | 
160 | 具体的训练过程可见该博客：[图解N-gram语言模型的原理--以kenlm为例](https://blog.csdn.net/asrgreek/article/details/81979194)
161 | 
162 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20190927143223581.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9tYXR0emhlbmcuYmxvZy5jc2RuLm5ldA==,size_16,color_FFFFFF,t_70)
163 | 
164 | 
165 | 
166 | ### 2.1.2 生成文件arpa的解释
167 | 来源：[语言模型kenlm的训练及使用](https://www.bbsmax.com/A/WpdKmENJVQ/)
168 | 其中生成的arpa文件有：
169 | 
170 | ```python
171 | 
172 | 
173 |     \1-grams:
174 |     -6.5514092	<unk>	0
175 |     0	<s>	-2.9842114
176 |     -1.8586434	</s>	0
177 |     -2.88382	!	-2.38764
178 |     -2.94351	world	-0.514311
179 |     -2.94351	hello	-0.514311
180 |     -6.09691	guys	-0.15553
181 |      
182 |     \2-grams:
183 |     -3.91009	world !	-0.351469
184 |     -3.91257	hello world	-0.24
185 |     -3.87582	hello guys	-0.0312
186 |      
187 |     \3-grams:
188 |     -0.00108858	hello world !
189 |     -0.000271867	, hi hello !
190 |      
191 |     \end\
192 | 
193 | 
194 | ```
195 | 
196 | 介绍该文件需要引入一个新的概念，back_pro【[language model](http://blog.csdn.net/visionfans/article/details/50131397)】
197 | 三个字段分别是：`Pro,word,back_pro `
198 | 注：arpa文件中给出的数值都是以10为底取对数后的结果
199 | 
200 | 
201 | 
202 | 
203 | ### 2.1.3 几个训练坑点解读 
204 | 
205 | 划重点来了，其中`-s` 非常重要，默认是`80%`，如果机器有20%被占了，笔者当时发现，10句话训练模型也能超内存，这不是瞎胡闹：
206 | 
207 | ```python
208 | #34304 what():  /mnt/mNLP/kg/kenlm/util/scoped.cc:20 in void* util::{anonymous}::InspectAddr(void*, std::size_t, const char*) threw MallocException because `!addr && requested'.
209 | #Cannot allocate memory for 84881776616 bytes in malloc
210 | ```
211 | 需要额外设置内存占用量！当然还有挺多可能会产生意外的参数：
212 | 
213 | 参数     | 解释
214 | -------- | :-----
215 | minimum_block arg (=8K)  | Minimum block size to allow
216 | sort_block arg (=64M)  | Size of IO operations for sort  (determines arity)
217 | block_count arg (=2)  | Block count (per order)
218 | interpolate_unigrams [=arg(=1)] (=1) | Interpolate the unigrams (default) as  opposed to giving lots of mass to <unk>  like SRI.  If you want SRI's behavior with a large <unk> and the old lmplz  default, use --interpolate_unigrams 0.
219 | discount_fallback [=arg(=0.5 1 1.5)] | The closed-form estimate for Kneser-Ney  discounts does not work without  singletons or doubletons. 
220 | 。。。(还有不少) | 。。。
221 | 
222 | 还有可能会报错：
223 | 
224 | ```python
225 | Unigram tokens 153 types 116
226 |     === 2/5 Calculating and sorting adjusted counts ===
227 |     Chain sizes: 1:1392 2:10964970496 3:20559319040 4:32894910464
228 |     /mnt/mNLP/kg/kenlm/lm/builder/adjust_counts.cc:52 in void lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const lm::builder::DiscountConfig&) threw BadDiscountException because `s.n[j] == 0'.
229 |     Could not calculate Kneser-Ney discounts for 1-grams with adjusted count 4 because we didn't observe any 1-grams with adjusted count 3; Is this small or artificial data?
230 |     Try deduplicating the input.  To override this error for e.g. a class-based model, rerun with --discount_fallback
231 | ```
232 | 报错码为:34304,主要是因为字数太少，所以训练的时候需要多加一些。
233 | 
234 | ## 2.2 模型压缩二进制化`build_binary `
235 | 这边生成的arpa文件，可能会比较大，可以通过二进制化缩小文件大小：
236 | 
237 | ```python
238 | bin/build_binary -s lm.arpa lm.bin
239 | ```
240 | 将arpa文件转换为binary文件，这样可以对arpa文件进行压缩，提高后续在python中加载的速度。
241 | 
242 | ![在这里插入图片描述](https://img-blog.csdnimg.cn/20190927143614915.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9tYXR0emhlbmcuYmxvZy5jc2RuLm5ldA==,size_16,color_FFFFFF,t_70)
243 | 虽然大小没有发生太大的变化，但是压缩后会大大提高Python加载的速度。
244 | 
245 | 可能会报错，报错码为：256，原因如下：
246 | 
247 | ```python
248 | No such file or directory while opening output/test2.arpa
249 | ```
250 | 
251 | 
252 | 
253 | ## 2.3 利用kenlm的`count_ngrams`计算n-grams
254 | 苏神[【重新写了之前的新词发现算法：更快更好的新词发现】](https://spaces.ac.cn/archives/6920)中用的是这个。
255 | 这个库存在`build/bin/count_ngrams`
256 | ```python
257 |     # Counts n-grams from standard input.
258 |     # corpus count:
259 |     #   -h [ --help ]                     Show this help message
260 |     #   -o [ --order ] arg                Order
261 |     #   -T [ --temp_prefix ] arg (=/tmp/) Temporary file prefix
262 |     #   -S [ --memory ] arg (=80%)        RAM
263 |     #   --read_vocab_table arg            Vocabulary hash table to read.  This should
264 |     #                                     be a probing hash table with size at the 
265 |     #                                     beginning.
266 |     #   --write_vocab_list arg            Vocabulary list to write as null-delimited 
267 |     #                                     strings.
268 | ```
269 | 其中也有该死的`-s`，要留意。
270 | 执行命令示例：
271 | ```python
272 | ./count_ngrams -S 50% -o 4 --write_vocab_list output/test2.chars <output/test2.corpus >output/test2.ngrams
273 | ```
274 | 其中，参数`-s`,`-o`与前面一样，
275 | 输入的是预生成文本`output/test2.corpus`，生成两个文件：`output/test2.chars` 和 `output/test2.ngrams`，分别是单词文件和ngrams的文件集合
276 | 
277 | 其中，执行的时候，如果返回的不是0，都是有错误的，笔者自己遇到过几个：
278 | 
279 | 错误码     | 原因
280 | -------- | :-----
281 | 32256| 计算不了 - 错误类型：没有权限，可能是count_ngrams没有执行权限
282 | 32512| 依赖报错：/count_ngrams: error while loading shared libraries: libboost_program_options.so.1.58.0: cannot open shared object file: No such file or directory
283 | 34304  | 内存报错：Cannot allocate memory for 84881776616 bytes in malloc
284 | 
285 | 
286 | ----------
287 | 
288 | 
289 | # 3 kenlm模型的初级使用
290 | 
291 | 参考文档：[kenlm/python/example.py](https://github.com/kpu/kenlm/blob/master/python/example.py)
292 | 
293 | ## 3.1 model.score函数
294 | python已经有可以使用的库，安装教程见第1章，简单测试方式：
295 | 
296 | ```python
297 | import kenlm
298 | model = kenlm.Model('lm/test.arpa')
299 | print(model.score('this is a sentence .', bos = True, eos = True))
300 | ```
301 | 其中，
302 | 每个句子通过语言模型都会得到一个概率(0-1),然后对概率值取log得到分数(-\propto ,0],得分值越接近于0越好。
303 | score函数输出的是对数概率，即log10(p('微 信'))，其中字符串可以是gbk，也可以是utf-8
304 | `bos=False, eos=False`意思是不自动添加句首和句末标记符,得分值越接近于0越好。
305 | 一般都要对计算的概率值做log变换，不然连乘值太小了，在程序里会出现 inf 值。
306 | 
307 | > @param sentence is a string (do not use boundary symbols)
308 | @param bos should kenlm add a bos state
309 | @param eos should kenlm add an eos state
310 | 来源：https://github.com/kpu/kenlm/blob/master/python/kenlm.pyx
311 | 
312 | 该模块，可以用来测试词条与句子的通顺度：
313 | 
314 | ```python
315 | text = '再 心 每 天也 不 会 担 个 大 油 饼 到 了 下  午 顶 着 一 了 '
316 | model.score(text, bos=True, eos=True)
317 | ```
318 | 需要注意，是需要空格隔开的。
319 | 
320 | ## 3.2 model.full_scores函数
321 | `score`是`full_scores`是精简版，full_scores会返回：` (prob, ngram length, oov)`
322 | 包括：概率，ngram长度，是否为oov
323 | 
324 | ```
325 | # Show scores and n-gram matches
326 | sentence = '盘点不怕被税的海淘网站❗️海淘向来便宜又保真，比旗舰店、专柜和代购好太多！'
327 | 
328 | words = ['<s>'] + parse_text(sentence).split() + ['</s>']
329 | for i, (prob, length, oov) in enumerate(model.full_scores(sentence)):
330 |     print('{0} {1}: {2}'.format(prob, length, ' '.join(words[i+2-length:i+2])))
331 |     if oov:
332 |         print('\t"{0}" is an OOV'.format(words[i+1]))
333 | 
334 | # Find out-of-vocabulary words
335 | for w in words:
336 |     if not w in model:
337 |         print('"{0}" is an OOV'.format(w))
338 | ```
339 | 
340 | ## 3.3 kenlm.State()状态转移概率
341 | 
342 | ```python
343 | '''
344 | 状态的累加
345 | score defaults to bos = True and eos = True.  
346 | Here we'll check without the endof sentence marker.  
347 | '''
348 | #Stateful query
349 | state = kenlm.State()
350 | state2 = kenlm.State()
351 | #Use <s> as context.  If you don't want <s>, use model.NullContextWrite(state).
352 | model.BeginSentenceWrite(state)
353 | ```
354 | 然后还有：
355 | 
356 | ```python
357 | accum = 0.0
358 | accum += model.BaseScore(state, "海", state2)
359 | print(accum)
360 | accum += model.BaseScore(state2, "淘", state)
361 | print(accum)
362 | accum += model.BaseScore(state, "</s>", state2)
363 | print(accum)
364 | 
365 | >>>-3.0864107608795166
366 | >>>-3.6341209411621094
367 | >>>-4.645392656326294
368 | 
369 | model.score("海 淘", eos = False)
370 | >>> -3.381103515625
371 | ```
372 | 这个实验可以看到：state2的状态概率与score的概率差不多，该模块还有很多可以深挖，NSP任务等等。
373 | 
374 | 
375 | ## 3.4 语句通顺度检测
376 | 通顺度其实用score即可，只不过用整个句子，整个句子需要空格隔开。
377 | 这边有一个项目，还封装了API，可参考：[DRUNK2013/lm-ken](https://github.com/DRUNK2013/lm-ken)
378 | 
379 | 
380 | 
381 | ----------
382 | 
383 | 
384 | # 4 kenlm的深度使用 - 分词
385 | 参考于：[【中文分词系列】 5. 基于语言模型的无监督分词](https://spaces.ac.cn/archives/3956#%E5%AE%9E%E8%B7%B5%EF%BC%9A%E8%AE%AD%E7%BB%83)
386 | 苏神的代码模块：
387 | 
388 | ```python
389 | import kenlm
390 | model = kenlm.Model('weixin.klm')
391 | 
392 | from math import log10
393 | 
394 | #这里的转移概率是人工总结的，总的来说，就是要降低长词的可能性。
395 | trans = {'bb':1, 'bc':0.15, 'cb':1, 'cd':0.01, 'db':1, 'de':0.01, 'eb':1, 'ee':0.001}
396 | trans = {i:log10(j) for i,j in trans.iteritems()}
397 | 
398 | def viterbi(nodes):
399 |     paths = nodes[0]
400 |     for l in range(1, len(nodes)):
401 |         paths_ = paths
402 |         paths = {}
403 |         for i in nodes[l]:
404 |             nows = {}
405 |             for j in paths_:
406 |                 if j[-1]+i in trans:
407 |                     nows[j+i]= paths_[j]+nodes[l][i]+trans[j[-1]+i]
408 |             k = nows.values().index(max(nows.values()))
409 |             paths[nows.keys()[k]] = nows.values()[k]
410 |     return paths.keys()[paths.values().index(max(paths.values()))]
411 | 
412 | def cp(s):
413 |     return (model.score(' '.join(s), bos=False, eos=False) - model.score(' '.join(s[:-1]), bos=False, eos=False)) or -100.0
414 | 
415 | def mycut(s):
416 |     nodes = [{'b':cp(s[i]), 'c':cp(s[i-1:i+1]), 'd':cp(s[i-2:i+1]), 'e':cp(s[i-3:i+1])} for i in range(len(s))]
417 |     tags = viterbi(nodes)
418 |     words = [s[0]]
419 |     for i in range(1, len(s)):
420 |         if tags[i] == 'b':
421 |             words.append(s[i])
422 |         else:
423 |             words[-1] += s[i]
424 |     return words
425 | ```
426 | 将分词转化为了标注问题，如果字语言模型取到4-gram，那么它相当于做了如下的字标注：
427 | 
428 | ```python
429 | b：单字词或者多字词的首字
430 | c：多字词的第二字
431 | d：多字词的第三字
432 | e：多字词的其余部分
433 | ```
434 | 笔者基本没改动，微调至py3可用，笔者的模块可以使用的方式为：
435 | ```
436 | # 初始化
437 | km = kenlm_model()
438 | km.model = km.load_model('output/test2.klm')
439 | ```
440 | 查询与分词：
441 | ```
442 | sentence = '这瓶洗棉奶用着狠不错'
443 | km.mycut(sentence)
444 | ```
445 | 当然，分词模块只是for fun的。。
446 | 
447 | 
448 | ----------
449 | 
450 | # 5 kenlm的深度使用 - 新词发现
451 | 
452 | 苏神[【重新写了之前的新词发现算法：更快更好的新词发现】](https://spaces.ac.cn/archives/6920)中用的是这个。大部分与苏神一致，微调至py3已经加入分词方式的调用。这个可能需要先训练：
453 | 
454 | ## 5.1 训练语料
455 | **第一步：模型加载**
456 | 
457 | ```python
458 | km = kenlm_model(save_path = 'output',project = 'test2',\
459 |                  memory = '50%',min_count = 2,order = 4,\
460 |                  skip_symbols = '"<unk>"',kenlm_model_path = './kenlm/build/bin/')
461 | ```
462 | 其中，
463 | - save_path， 是相关文件存储在哪，因为一次性会生成很多临时文件
464 | - project ，是项目编号，编译项目管理
465 | - memory，调用时候占用的内存容量
466 | - min_count = 2，在筛选n-grams最小的频率
467 | - order = 4，n-grams中的n
468 | - skip_symbols = `'"<unk>"'`，Treat` <s>`, `</s>`, and `<unk>` as whitespace instead of throwing an exception
469 | - kenlm_model_path = './kenlm/build/bin/'，kenlm那些编译好的文件存放在的位置
470 | 
471 | **第二步：准备训练材料**
472 | 训练，笔者拿了五句话来做训练（**实际需要多准备一些，不然文字太少会报错**）：
473 | 
474 | ```python
475 | text_list = ['Whoo 后 拱辰享水 水妍护肤套装整套质地都比较清爽，滋润，侧重保湿，适合各种肤质',
476 |  '盘点不怕被税的海淘网站❗️海淘向来便宜又保真，比旗舰店、专柜和代购好太多！还能体验海淘乐趣~外网需要双币信用卡，往往需要转运，北上地区容易被税。',
477 |  '学生用什么洗面奶好？学生党必备的这六款性价比最高的洗面奶是什么？',
478 |  '国货大宝。启初。……使用分享修复：玉泽或至本（第三代玉泽身体乳没有了麦冬根和神经酰胺）。芦荟胶（含酒精，不锁水，偶尔敷一下，皮肤会越用越干）。Swisse蜂蜜面膜（清洁鼻子，效果肉眼可见，不能常用）。',
479 |  '资生堂悦薇乳液，会回购。夏天用略油腻，冬天用刚好。真的有紧致感，28岁，眼部有笑纹，其他地方还可以。这是第二个空瓶。冬天会回购。没有美白效果。(资生堂悦薇)']
480 | 
481 | km.write_corpus(km.text_generator(text_list,jieba_cut = False), km.corpus_file) # 将语料转存为文本
482 | >>> success writed
483 | ```
484 | 
485 | 将文本解析为：
486 | ```
487 | W h o o   后   拱 辰 享 水   水 妍 护 肤 套 装 整 套 质 地 都 比 较 清 爽 
488 |  滋 润 
489 |  侧 重 保 湿 
490 |  适 合 各 种 肤 质
491 | ```
492 | 并保存在：`km.corpus_file`文件之中
493 | 
494 | **第三步：计算模型的n-grams**
495 | ```python
496 | # 计算模型的n-grams
497 | km.count_ngrams() # 用Kenlm统计ngram
498 | 
499 | >>>success,code is : 0 , 
500 |  code is : ./kenlm/build/bin/build_binary -S 50% -s output/test2.arpa output/test2.klm 
501 | 
502 | ```
503 | 这里如果状态码不是0，就是报错了，写在py之中不好看到报错信息，笔者自己把相关执行代码也显示出来，所以自己去终端敲一下，定位问题。这步骤，根据`.corpus`文件，生成`.chars`和`.ngrams`
504 | 
505 | 
506 | ## 5.2 读入模型并使用
507 | 这个`read_ngrams` 和 `filter_ngrams`都是苏神中的代码了
508 | ```python
509 | ngrams,total = km.read_ngrams()
510 | ngrams_2 = km.filter_ngrams(ngrams, total, min_pmi=[0, 1, 3, 5])
511 | ngrams_2
512 | ```
513 | read_ngrams是读入之前的训练文件，ngrams是有三个grams（1-gram,2-gram,3-gram）的（word freq）词与词频，
514 | filter_ngrams就是过滤ngram了，[0, 2, 4, 6]是互信息的阈值，其中第一个0无意义，仅填充用，而2, 4, 6分别是2gram、3gram、4gram的互信息阈值，基本上单调递增比较好。
515 | 得到这些n-grams之后的逻辑与苏神有点不一样，我的逻辑是：
516 | ```
517 | 是否能够被jieba分开
518 | 且限定在一定的条件下：词性限定 + 个别停用字
519 | ```
520 | 那么使用方式：
521 | ```
522 | km.word_discovery(ngrams_2)
523 | 
524 | >{'缓痘痘': 2,
525 |  '奶参考': 2,
526 |  '中文界': 5,
527 |  '文界面': 5,
528 |  '界面支': 5,
529 |  '蜂蜜面': 2,
530 |  '20英': 2,
531 |  '面奶参': 2,
532 |  '舒缓痘': 2,
533 |  '0英': 4}
534 | ```
535 | 我这边是返回了词 + 词频，便于画词云。
536 | 
537 | ----------
538 | 
539 | # 6 kenlm的深度使用 - 智能纠错
540 | 部分来源：
541 | [pycorrector](https://github.com/shibing624/pycorrector)
542 | [中文文本纠错算法--错别字纠正的二三事](https://zhuanlan.zhihu.com/p/40806718)
543 | 
544 | 笔者最近在研究智能写作，对纠错还是蛮有需求的，这边有看到文章些kenlm用在纠错上，不过是a/an的简单区别，这边笔者也基于此简单使用了一些。
545 | 纠错任务一般分别两个：
546 | 
547 | - 发现错误
548 | - 改正错误
549 | 
550 | 这边智能纠错笔者比较推荐的库是：[pycorrector](https://github.com/shibing624/pycorrector)，优点很多：
551 | 
552 | - 一直在维护
553 | - 可自定义加载自己的一些规则
554 | - 有深度方案的选项
555 | 
556 | 当然这个库好像要预装tensorflow？ 需要安装尝试的小伙伴注意下。中文文本纠错任务，常见错误类型包括：
557 | 
558 | - 谐音字词，如 配副眼睛-配副眼镜
559 | - 混淆音字词，如 流浪织女-牛郎织女
560 | - 字词顺序颠倒，如 伍迪艾伦-艾伦伍迪
561 | - 字词补全，如 爱有天意-假如爱有天意
562 | - 形似字错误，如 高梁-高粱
563 | - 中文拼音全拼，如 xingfu-幸福
564 | - 中文拼音缩写，如 sz-深圳
565 | - 语法错误，如 想象难以-难以想象
566 | 
567 | 因为只是实验，所以，发现错误这个环节就交给pycorrector了，笔者用kenlm来改正错误。
568 | 简单的发现错误的环节，思路大概是：
569 | > 错误检测部分先通过结巴中文分词器切词，由于句子中含有错别字，所以切词结果往往会有切分错误的情况，这样从字粒度和词粒度两方面检测错误，整合这两种粒度的疑似错误结果，形成疑似错误位置候选集
570 | 
571 | Kenlm改正错误，有个好处就是kenlm可以定制化训练某一领域的大规模语料的语言模型。本次简单实验的改正逻辑是：
572 | ```
573 | 两个字至少有一个字，字形相似
574 | 两个字拼音首字母一致
575 | ```
576 | 所以只是上述提到错误中的拼音缩写修正。
577 | 
578 | ## 6.1 pypinyin拼音模块
579 | 
580 | 其中，拼音模块涉及到了`pypinyin`，用来识别汉字的拼音，还有非常多种的模式：
581 | ```
582 | from pypinyin import lazy_pinyin, Style
583 | 	# Python 中拼音库 PyPinyin 的用法
584 | 	# https://blog.csdn.net/devcloud/article/details/95066038
585 | 
586 | tts = ['BOPOMOFO', 'BOPOMOFO_FIRST', 'CYRILLIC', 'CYRILLIC_FIRST', 'FINALS', 'FINALS_TONE',
587 |  'FINALS_TONE2', 'FINALS_TONE3', 'FIRST_LETTER', 'INITIALS', 'NORMAL', 'TONE', 'TONE2', 'TONE3']
588 | for tt in tts:
589 |     print(tt,lazy_pinyin('聪明的小兔子吃', style=eval('Style.{}'.format(tt))   ))
590 | 
591 | 
592 | ```
593 | 
594 | 其中结果为：
595 | 
596 | ```python
597 | BOPOMOFO ['ㄘㄨㄥ', 'ㄇㄧㄥˊ', 'ㄉㄜ˙', 'ㄒㄧㄠˇ', 'ㄊㄨˋ', 'ㄗ˙', 'ㄔ']
598 | BOPOMOFO_FIRST ['ㄘ', 'ㄇ', 'ㄉ', 'ㄒ', 'ㄊ', 'ㄗ', 'ㄔ']
599 | CYRILLIC ['цун1', 'мин2', 'дэ', 'сяо3', 'ту4', 'цзы', 'чи1']
600 | CYRILLIC_FIRST ['ц', 'м', 'д', 'с', 'т', 'ц', 'ч']
601 | FINALS ['ong', 'ing', 'e', 'iao', 'u', 'i', 'i']
602 | FINALS_TONE ['ōng', 'íng', 'e', 'iǎo', 'ù', 'i', 'ī']
603 | FINALS_TONE2 ['o1ng', 'i2ng', 'e', 'ia3o', 'u4', 'i', 'i1']
604 | FINALS_TONE3 ['ong1', 'ing2', 'e', 'iao3', 'u4', 'i', 'i1']
605 | FIRST_LETTER ['c', 'm', 'd', 'x', 't', 'z', 'c']
606 | INITIALS ['c', 'm', 'd', 'x', 't', 'z', 'ch']
607 | NORMAL ['cong', 'ming', 'de', 'xiao', 'tu', 'zi', 'chi']
608 | TONE ['cōng', 'míng', 'de', 'xiǎo', 'tù', 'zi', 'chī']
609 | TONE2 ['co1ng', 'mi2ng', 'de', 'xia3o', 'tu4', 'zi', 'chi1']
610 | TONE3 ['cong1', 'ming2', 'de', 'xiao3', 'tu4', 'zi', 'chi1']
611 | ```
612 | 
613 | 可以看出不同的style可以得到不同拼音形式。
614 | 
615 | ## 6.2 pycorrector纠错模块
616 | 
617 | pycorrector的`detect`，可以返回，错误字的信息
618 | ```
619 | import pycorrector
620 | sentence = '这瓶洗棉奶用着狠不错'
621 | idx_errors = pycorrector.detect(sentence)
622 | >>> [['这瓶', 0, 2, 'word'], ['棉奶', 3, 5, 'word']]
623 | ```
624 | correct是专门用来纠正：
625 | 
626 | ```python
627 | pycorrector.correct(sentence)
628 | ```
629 | 
630 | 
631 | ## 6.3 pycorrector与kenlm纠错对比
632 | 
633 | 来对比一下pycorrector自带的纠错和本次实验的纠错：
634 | 
635 | ```python
636 | import pycorrector
637 | sentence = '这瓶洗棉奶用着狠不错'
638 | idx_errors = pycorrector.detect(sentence)
639 | 
640 | correct = []
641 | for ide in idx_errors:
642 |     right_word = km.find_best_word(ide[0],ngrams_,freqs = 0)
643 |     if right_word != ide[0]:
644 |         correct.append([right_word] + ide)
645 | 
646 | print('错误：',idx_errors)
647 | print('pycorrector的结果：',pycorrector.correct(sentence))
648 | print('kenlm的结果：',correct)
649 | 
650 | > 错误： [['这瓶', 0, 2, 'word'], ['棉奶', 3, 5, 'word']]
651 | > pycorrector的结果： ('这瓶洗面奶用着狠不错', [['棉奶', '面奶', 3, 5]])
652 | > kenlm的结果： [['面奶', '棉奶', 3, 5, 'word']]
653 | ```
654 | 
655 | 其他类似的案例：
656 | 
657 | ```python
658 | sentence =  '少先队员因该给老人让坐'
659 | 
660 | > 错误： [['因该', 4, 6, 'word'], ['坐', 10, 11, 'char']]
661 | > pycorrector的结果： ('少先队员应该给老人让座', [['因该', '应该', 4, 6], ['坐', '座', 10, 11]])
662 | > kenlm的结果： [['应该', '因该', 4, 6, 'word']]
663 | ```
664 | 
665 | 这里笔者的简陋规则暴露问题了，只能对2个字以上的进行判定。
666 | 
667 | 另一个：
668 | 
669 | ```python
670 | sentence = '绿茶净华可以舒缓痘痘机肤'
671 | 
672 | > 错误： [['净华', 2, 4, 'word'], ['机肤', 10, 12, 'word']]
673 | > pycorrector的结果： ('绿茶净化可以舒缓痘痘肌肤', [['净华', '净化', 2, 4], ['机肤', '肌肤', 10, 12]])
674 | > kenlm的结果： [['精华', '净华', 2, 4, 'word'], ['肌肤', '机肤', 10, 12, 'word']]
675 | ```
676 | 因为训练的是这方面的语料，要比prcorrector好一些。
677 | 
678 | 
679 | ----------
680 | 
681 | 
682 | # 参考文献
683 | 
684 | [1 使用kenLM训练语言模型](https://blog.csdn.net/Nicholas_Wong/article/details/80013547)
685 | 
686 | [2 使用kenlm模型判别a/an错别字](https://zhuanlan.zhihu.com/p/39722203)
687 | 
688 | [3 语言模型kenlm的训练及使用](https://www.bbsmax.com/A/WpdKmENJVQ/)
689 | 
690 | [4 DRUNK2013/lm-ken](https://github.com/DRUNK2013/lm-ken)
691 | 
692 | [5 重新写了之前的新词发现算法：更快更好的新词发现](https://spaces.ac.cn/archives/6920)
693 | 
694 | [6 【中文分词系列】 5. 基于语言模型的无监督分词](https://spaces.ac.cn/archives/3956#%E5%AE%9E%E8%B7%B5%EF%BC%9A%E8%AE%AD%E7%BB%83)
695 | 
696 | [7 自然语言处理 | (13)kenLM统计语言模型构建与应用](https://blog.csdn.net/sdu_hao/article/details/87101741)
697 | 
698 | 
699 | 
700 | 


--------------------------------------------------------------------------------
/kenlm_model.py:
--------------------------------------------------------------------------------
  1 | import jieba
  2 | from collections import Counter
  3 | from tqdm import tqdm
  4 | import re,glob,os
  5 | from math import log10
  6 | import struct
  7 | import math
  8 | import kenlm
  9 | from collections import Counter
 10 | import jieba.posseg as pseg
 11 | 
 12 | class kenlm_model():
 13 |     def __init__(self,save_path = 'output',project = 'test2',\
 14 |                  memory = '50%',min_count = 5,order = 4,\
 15 |                  skip_symbols = '"<unk>"',kenlm_model_path = './kenlm/build/bin/'):
 16 |         self.memory = memory    # 运行预占用内存
 17 |         self.min_count = min_count # n-grams考虑的最低频率
 18 |         self.order = order    # n-grams的数量
 19 |         self.kenlm_model_path = kenlm_model_path 
 20 |         # kenlm模型路径 / 包括：count_ngrams/lmplz等kenlm模块的路径
 21 |         self.corpus_file = save_path + '/%s.corpus'%project # 语料保存的文件名
 22 |         self.vocab_file = save_path + '/%s.chars'%project # 字符集保存的文件名
 23 |         self.ngram_file = save_path + '/%s.ngrams'%project # ngram集保存的文件名
 24 |         self.output_file = save_path + '/%s.vocab'%project # 最后导出的词表文件名
 25 |         self.arpa_file = save_path + '/%s.arpa'%project # 语言模型的文件名arpa
 26 |         self.klm_file = save_path + '/%s.klm'%project# 语言模型的二进制文件名klm,也可以.bin
 27 |         self.skip_symbols = '"<unk>"'   
 28 |         # lm_train训练时候，Treat <s>, </s>, and <unk> as whitespace instead of throwing an exception
 29 |         #这里的转移概率是人工总结的，总的来说，就是要降低长词的可能性。
 30 |         trans = {'bb':1, 'bc':0.15, 'cb':1, 'cd':0.01, 'db':1, 'de':0.01, 'eb':1, 'ee':0.001}
 31 |         self.trans = {i:log10(j) for i,j in trans.items()}
 32 |         self.model = None
 33 | 
 34 |         
 35 |     def load_model(self,model_path ):
 36 |         return kenlm.Model(model_path)
 37 |     
 38 |     
 39 |     # 语料生成器，并且初步预处理语料
 40 |     @staticmethod
 41 |     def text_generator(texts,jieba_cut = False ):
 42 |         '''
 43 |         输入:
 44 |             文本,list
 45 |         输出:
 46 |             ['你\n', '是\n', '谁\n']
 47 |         其中:
 48 |             参数jieba_cut,代表是否基于jieba分词来判定
 49 |         
 50 |         '''
 51 |         for text in texts:
 52 |             text = re.sub(u'[^\u4e00-\u9fa50-9a-zA-Z ]+', '\n', text)
 53 |             if jieba_cut:
 54 |                 yield ' '.join(list(jieba.cut(text))) + '\n'
 55 |             else:
 56 |                 yield ' '.join(text) + '\n'
 57 |     
 58 |     @staticmethod
 59 |     def write_corpus(texts, filename):
 60 |         """将语料写到文件中，词与词(字与字)之间用空格隔开
 61 |         """
 62 |         with open(filename, 'w') as f:
 63 |             for s in texts:
 64 |                 #s = ' '.join(s) + '\n'
 65 |                 f.write(s)
 66 |         print('success writed')
 67 |     
 68 |     
 69 |     def count_ngrams(self):
 70 |         """
 71 |         通过os.system调用Kenlm的count_ngrams来统计频数
 72 |         
 73 |         # Counts n-grams from standard input.
 74 |         # corpus count:
 75 |         #   -h [ --help ]                     Show this help message
 76 |         #   -o [ --order ] arg                Order
 77 |         #   -T [ --temp_prefix ] arg (=/tmp/) Temporary file prefix
 78 |         #   -S [ --memory ] arg (=80%)        RAM
 79 |         #   --read_vocab_table arg            Vocabulary hash table to read.  This should
 80 |         #                                     be a probing hash table with size at the 
 81 |         #                                     beginning.
 82 |         #   --write_vocab_list arg            Vocabulary list to write as null-delimited 
 83 |         #                                     strings.
 84 |         """
 85 |         #corpus_file,vocab_file,ngram_file,memory = '50%',order = 4
 86 |         executive_code = self.kenlm_model_path + 'count_ngrams -S %s -o %s --write_vocab_list %s <%s >%s'%(self.memory,self.order, self.vocab_file, self.corpus_file, self.ngram_file)
 87 |         status = os.system(executive_code)
 88 |         if status == 0:
 89 |             return 'success,code is : %s , \n code is : %s '%(status,executive_code)
 90 |         else:
 91 |             return 'fail,code is : %s ,\n code is : %s '%(status,executive_code)
 92 |     
 93 |     
 94 |     def lm_train(self):
 95 |         '''
 96 |         # 训练数据格式一:保存成all.txt.parse 然后就可以直接训练了
 97 |         # 来源：https://github.com/DRUNK2013/lm-ken
 98 |         
 99 |         训练过程:
100 |             输入 : self.corpus_path语料文件
101 |             输出 : self.arpa_file语料文件
102 |         
103 |         报错：
104 |         34304 , 需要增加样本量
105 |         
106 |         '''
107 |         #corpus_file,arpa_file,memory = '50%',order = 4,skip_symbols = '"<unk>"'
108 |         executive_code = self.kenlm_model_path + 'lmplz -S {} -o {} --skip_symbols {} < {} > {} '.format(self.memory,self.order,self.skip_symbols,self.corpus_file,self.arpa_file)
109 |         status = os.system(
110 |                     executive_code
111 |                      )
112 |         if status == 0:
113 |             return 'success,code is : %s , \n code is : %s '%(status,executive_code)
114 |         else:
115 |             return 'fail,code is : %s ,\n code is : %s '%(status,executive_code)
116 |     
117 |     def convert_format(self):
118 |         '''
119 |         # 压缩模型
120 |         # 来自苏神：https://spaces.ac.cn/archives/3956#%E5%AE%9E%E8%B7%B5%EF%BC%9A%E8%AE%AD%E7%BB%83
121 |         
122 |         ```
123 |         ./kenlm/bin/build_binary weixin.arpa weixin.klm
124 |         ```
125 |         
126 |         arpa是通用的语言模型格式，klm是kenlm定义的二进制格式，klm格式占用空间更少。
127 |         
128 |         报错：
129 |         256 ： No such file or directory while opening output/test2.arpa
130 |     
131 |         '''
132 |         #arpa_file,klm_file,memory = '50%'
133 |         executive_code = self.kenlm_model_path + 'build_binary -S {} -s {} {}'.format(self.memory,self.arpa_file,self.klm_file)
134 |         status = os.system(
135 |                     executive_code
136 |                      )
137 |         if status == 0:
138 |             return 'success,code is : %s , \n code is : %s '%(status,executive_code)
139 |         else:
140 |             return 'fail,code is : %s ,\n code is : %s '%(status,executive_code)
141 |     
142 |             
143 |     '''
144 |         分词模块
145 |         主要引用苏神 : 【中文分词系列】 5. 基于语言模型的无监督分词
146 |     '''
147 |     def parse_text(self,text):
148 |         return ' '.join(list(text))
149 |     
150 | 
151 |     def viterbi(self,nodes):
152 |         '''  # 分词系统
153 |         #这里的转移概率是人工总结的，总的来说，就是要降低长词的可能性。
154 |         #trans = {'bb':1, 'bc':0.15, 'cb':1, 'cd':0.01, 'db':1, 'de':0.01, 'eb':1, 'ee':0.001}
155 |         #trans = {i:log10(j) for i,j in trans.items()}
156 |         
157 |         苏神的kenlm分词:
158 |             b：单字词或者多字词的首字
159 |             c：多字词的第二字
160 |             d：多字词的第三字
161 |             e：多字词的其余部分
162 |         '''
163 |         # py3的写法
164 |         paths = nodes[0]
165 |         for l in range(1, len(nodes)):
166 |             paths_ = paths
167 |             paths = {}
168 |             for i in nodes[l]:
169 |                 nows = {}
170 |                 for j in paths_:
171 |                     if j[-1]+i in self.trans:
172 |                         nows[j+i]= paths_[j]+nodes[l][i]+self.trans[j[-1]+i]
173 |                 #k = nows.values().index(max(nows.values()))
174 |                 k = max(nows, key=nows.get)
175 |                 #paths[nows.keys()[k]] = nows.values()[k]
176 |                 paths[k] = nows[k]
177 |         #return paths.keys()[paths.values().index(max(paths.values()))]
178 |         return max(paths, key=paths.get)
179 |     
180 |     def cp(self,s):
181 |         if self.model == None:
182 |             raise KeyError('please load model(.klm / .arpa).')
183 |         return (self.model.score(' '.join(s), bos=False, eos=False) - self.model.score(' '.join(s[:-1]), bos=False, eos=False)) or -100.0
184 |     
185 |     def mycut(self,s):
186 |         nodes = [{'b':self.cp(s[i]), 'c':self.cp(s[i-1:i+1]), 'd':self.cp(s[i-2:i+1]),\
187 |                   'e':self.cp(s[i-3:i+1])} for i in range(len(s))]
188 |         tags = self.viterbi(nodes)
189 |         words = [s[0]]
190 |         for i in range(1, len(s)):
191 |             if tags[i] == 'b':
192 |                 words.append(s[i])
193 |             else:
194 |                 words[-1] += s[i]
195 |         return words
196 |     
197 |     
198 |     '''
199 |         kenlm n-grams训练模块 + 新词发现
200 |         主要引用苏神的：重新写了之前的新词发现算法：更快更好的新词发现
201 |     '''
202 |     
203 |     def unpack(self,t, s):
204 |         return struct.unpack(t, s)[0]
205 |     
206 |     def read_ngrams(self):
207 |         """读取思路参考https://github.com/kpu/kenlm/issues/201
208 |         """
209 |         # 数据读入
210 |         f = open(self.vocab_file)
211 |         chars = f.read()
212 |         f.close()
213 |         chars = chars.split('\x00')
214 |         chars = [i for i in chars] # .decode('utf-8')
215 |         # 
216 |         ngrams = [Counter({}) for _ in range(self.order)]
217 |         total = 0
218 |         size_per_item = self.order * 4 + 8
219 |         f = open(self.ngram_file, 'rb')
220 |         filedata = f.read()
221 |         filesize = f.tell()
222 |         f.close()
223 |         for i in range(0, filesize, size_per_item):
224 |             s = filedata[i: i+size_per_item]
225 |             n = self.unpack('l', s[-8:])
226 |             if n >= self.min_count:
227 |                 total += n
228 |                 c = [self.unpack('i', s[j*4: (j+1)*4]) for j in range(self.order)]
229 |                 c = ''.join([chars[j] for j in c if j > 2])
230 |                 for j in range(self.order):# len(c) -> self.order
231 |                     ngrams[j][c[:j+1]] = ngrams[j].get(c[:j+1], 0) + n
232 |         return ngrams,total
233 |     
234 |     
235 |     def filter_ngrams(self,ngrams, total, min_pmi=1):
236 |         """通过互信息过滤ngrams，只保留“结实”的ngram。
237 |         """
238 |         order = len(ngrams)
239 |         if hasattr(min_pmi, '__iter__'):
240 |             min_pmi = list(min_pmi)
241 |         else:
242 |             min_pmi = [min_pmi] * order
243 |         #output_ngrams = set()
244 |         output_ngrams = Counter()
245 |         total = float(total)
246 |         for i in range(order-1, 0, -1):
247 |             for w, v in ngrams[i].items():
248 |                 pmi = min([
249 |                     total * v / (ngrams[j].get(w[:j+1], total) * ngrams[i-j-1].get(w[j+1:], total))
250 |                     for j in range(i)
251 |                 ])
252 |                 if math.log(pmi) >= min_pmi[i]:
253 |                     #output_ngrams.add(w)
254 |                     output_ngrams[w] = v
255 |         return output_ngrams
256 |     
257 |     '''
258 |         智能纠错模块
259 |             主要与pycorrector互动
260 |     '''
261 |     def is_Chinese(self,word):
262 |         for ch in word:
263 |             if '\u4e00' <= ch <= '\u9fff':
264 |                 return True
265 |         return False
266 |     
267 |     def word_match(self,text_a,text_b): 
268 |         '''
269 |         筛选规则:
270 |             # 字符数一致
271 |             # 不为空
272 |             # 拼音首字母一致
273 |             
274 |         输出:
275 |             最佳是否相似,bool
276 |         '''
277 |     
278 |         pinyin_n,match_w = 0,[]
279 |         text_a_pinyin = lazy_pinyin(text_a, style=Style.FIRST_LETTER) 
280 |         text_b_pinyin = lazy_pinyin(text_b, style=Style.FIRST_LETTER) 
281 |         #print(text_a_pinyin,text_b_pinyin)
282 |         if len(text_a) > 0 and (len(text_b)  == len(text_a) ) and self.is_Chinese(text_a) and self.is_Chinese(text_b):
283 |             for n,w1 in enumerate(text_a):
284 |                 if text_b[n] == w1:
285 |                     match_w.append(w1)
286 |                 if text_a_pinyin[n] == text_b_pinyin[n]:
287 |                     pinyin_n += 1
288 |             return True if len(match_w) > 0 and pinyin_n == len(text_a) else False
289 |         else:
290 |             return False
291 |         
292 |     def compare(self,text_a,text_b): 
293 |         '''
294 |         通过kenlm对比两个文本的优劣:
295 |             text_a - text_b > 0 , text_a 好
296 |         '''
297 |         return self.model.score(' '.join(text_a), bos=False, eos=False) - self.model.score(' '.join(text_b), bos=False, eos=False)
298 |     
299 |     def find_best_word(self,word,ngrams,freqs = 10):
300 |         '''
301 |         通过kenlm找出比word更适合的词
302 |         
303 |         输入:
304 |             word,str
305 |             ngrams,dict,一个{word:freq}的词典
306 |             
307 |         输出:
308 |             最佳替换word
309 |         '''
310 |         candidate = {bg:freq for bg,freq in ngrams.items() if self.word_match(word,bg) &  (freq > freqs) }
311 |         #if len(candidate) == 0:
312 |         #    raise KeyError('zero candidate,large freqs')
313 |         candidate_score = {k:self.compare(k,word) for k,v in candidate.items()}
314 |         if len(candidate_score) > 0:
315 |             return max(candidate_score, key=candidate_score.get)
316 |         else:
317 |             return word
318 | 
319 |     def word_discovery(self,ngrams_dict,\
320 |                        good_pos = ['n','v','ag','a','zg','d'],\
321 |                        bad_words = ['我','你','他','也','的','是','它','再','了','让'] ):
322 |         '''
323 |         新词筛选
324 |         筛选规则：
325 |             - jieba分不出来
326 |             - 词性也不包括以下几种
327 |     
328 |         jieba词性表：https://blog.csdn.net/orangefly0214/article/details/81391539
329 |     
330 |         坏词性：
331 |             uj,ur,助词
332 |             l,代词
333 |     
334 |         好词性：
335 |             n,v,ag,a,zg,d(副词)
336 |         '''
337 |         new_words_2 = {}
338 |         
339 |         for nw,freq in tqdm(ngrams_dict.items()):
340 |             jieba_nw = list(jieba.cut(nw))
341 |             words = list(pseg.cut(nw))
342 |             pos = [list(wor)[1] for wor in words]
343 |             if( len(jieba_nw) != 1)  and  ( len( [gp for gp in good_pos if gp in ''.join(pos)]  ) > 0  ) and    (  len([bw for bw in bad_words if bw in nw[0]]) == 0   ):
344 |                 new_words_2[nw] = freq
345 |                 #print(list(words))
346 |         return new_words_2
347 | 
348 | 
349 | if __name__ == "__main__":
350 | 
351 |     '''
352 |         模型训练与生成
353 |         
354 |         这里注意save_path是存放一些生成文件的路径
355 |     '''
356 |     # 模型加载
357 |     km = kenlm_model(save_path = 'output',project = 'test2',\
358 |                      memory = '50%',min_count = 2,order = 4,\
359 |                      skip_symbols = '"<unk>"',kenlm_model_path = './kenlm/build/bin/')
360 |     
361 |     # NLM模型训练文件生成
362 |     text_list = ['Whoo 后 拱辰享水 水妍护肤套装整套质地都比较清爽，滋润，侧重保湿，适合各种肤质，调节肌肤水平衡，它还具有修复功效，提亮肤色我是油性肤质用起来也一点也不觉得油腻，味道淡淡的还很好闻，也很好吸收，质地清爽，上脸没有油腻感，就算是炎炎夏日用起来也相当舒服它的颜值很高的，粉粉的超级少女，满足了老夫的少女心保湿指数⭐️⭐️⭐️⭐️⭐️推荐指数⭐️⭐️⭐️⭐️⭐️如果你是油性肤质，选择水妍系列，再也不会担心每天到了下午顶着一个大油饼子脸了，油性、混合性妹子完全不用担心踩雷性价比超高，这一套满足了日常护肤的基本要求(Whoo洁面膏)',
363 |      '盘点不怕被税的海淘网站❗️海淘向来便宜又保真，比旗舰店、专柜和代购好太多！还能体验海淘乐趣~外网需要双币信用卡，往往需要转运，北上地区容易被税。安利大家几家包税/有税补的海淘商家[赞R]\xa01⃣️beautylish?全中文页面支付宝/微信/银联100%免关税75美元全球包邮相对其他英淘，小众产品比较齐全，RCMA的散粉是无限回购好物，买过他们家福袋感觉蛮超值的。据说去年双十二爆出海关扣了一批货，客服装死。2⃣️Chemistforless?全中文界面支付宝/微信现金税补现金包税不限时免邮活动邮费超重部分按6.5澳/kg，比同类型网站的8澳便宜很多，同产品比较价格相对便宜。购买过三次，最快的一次的当天就出单号了，一周到国内，推荐它家孕妇DHA和蜂蜜面膜，我囤的比较多\xa03⃣️lookfantastic?全中文界面支付宝/微信/银联20英镑税补50英镑包邮网站的每月礼盒是我的最爱，送人必备，便宜又有逼格~只要你看好分单，金额控制在600以下不容易被税，所有网站都要分箱。很重要❗️chemistforless会帮你自动分箱，其他的网站需要自己分单下。\xa04⃣️perfume’s club?全中文界面支付宝/银联/微信20欧税补60欧包邮西班牙美妆网站~从西班牙/香港出货，买的几次都是西班牙直邮~网站品牌齐全，折扣较多，不着急的宝贝还是可以下单的。不过他家被税的几率有点大\xa05⃣️feelunique?全中文界面支付宝/银联/微信20英镑税补60英镑包邮搜索关键词不是很好，要点品牌看，品牌多样，下单时候有小样选择~物流应该走的阳光清关线路，西班牙发出，境外物流查不到，进了国内才能看到物流，FU补货速度一言难尽6⃣️wruru?全中文界面支付宝/微信现金包税1000免邮首家俄罗斯网站，物流走国际申通，比较慢。但俄罗斯的美妆向来很优惠~做活动的时候会送小样，品牌齐全，娇韵诗价格非常nice，暗戳戳觉得这家黑五娇韵诗一定会折上折，可以趁机囤一波双萃~大家口已一起上车～\xa0以上网站提及税补，小宝贝们被税的时候一定要保管好被税凭证，不然税补很困难~被税时不要紧张，金额不是很多老老实实交税吧，包裹退回的运费需自己承担，一来一去等于啥都没买还倒贴运费一般网站的微博和小红书也有活动~想买但没有优惠可以悄咪咪去微博小红书摸一摸优惠券哦[赞R](娇韵诗双萃,娇韵诗面膜)',
364 |      '学生用什么洗面奶好？学生党必备的这六款性价比最高的洗面奶是什么？你用过吗？DreamtimesM2梦幻洁面乳参考价格￥67Dreamtimes是一款专注年轻人护肤的品牌。他们家的东西都适合年轻的肌肤，所以这个品牌也是广大学生党的挚爱。这款洁面乳算是一款爆款产品了，里面含有洁肤因子，有效控油洁肤，是一款复合型洁面乳，对痘痘肌肤和敏感肌都是非常适合的，是一款性价比很高的洁面。妮维雅慕斯洗面奶参考价格￥35妮维雅家的新产品，一款氨基酸慕斯洁面。绵软的泡沫和奶油的质地让让欲罢不能。有效深入肌肤底层清洁肌肤，同时氨基酸配方，低刺激，温和洁面。如果你是敏感和易过敏肌肤，那这款就是你的最佳选择了。悦诗风吟绿茶洗面奶参考价格￥55这款也是很适合学生党的一款平价大碗洗面奶，里面蕴含绿茶精华，有效舒缓痘痘肌肤和改善油性肌肤等问题，绿茶精华可以舒缓痘痘肌肤，缓解红肿、过敏等症状。泡沫丰富细腻，改善出油状况的同时可以缓解痘痘肌肤。(妮维雅洗面奶,悦诗风吟洗面奶氨基酸)',
365 |      '国货大宝。启初。……使用分享修复：玉泽或至本（第三代玉泽身体乳没有了麦冬根和神经酰胺）。芦荟胶（含酒精，不锁水，偶尔敷一下，皮肤会越用越干）。Swisse蜂蜜面膜（清洁鼻子，效果肉眼可见，不能常用）。大宝sod别用在脸上，会变黑，脸疼，当护手霜，以后不买了。迷奇套装（含很多香精，可能只适合油性皮肤），香精可能会导致敏感。启初面霜（平价珂润，但珂润有神经酰胺），但好像没什么效果，油痘皮最好不要用，以后不要买了。(玉泽身体乳,珂润身体乳,珂润护手霜,玉泽面霜)',
366 |      '资生堂悦薇乳液，会回购。夏天用略油腻，冬天用刚好。真的有紧致感，28岁，眼部有笑纹，其他地方还可以。这是第二个空瓶。冬天会回购。没有美白效果。(资生堂悦薇)']
367 |     
368 |     km.write_corpus(km.text_generator(text_list,jieba_cut = False), km.corpus_file) # 将语料转存为文本
369 |     
370 |     # NLM模型训练
371 |     status = km.lm_train()
372 |     print(status)
373 |     
374 |     # NLM模型arpa文件转化
375 |     km.convert_format()
376 |     
377 |     
378 |     '''
379 |         新词发现
380 |     '''
381 |     # 模型n-grams生成
382 |     km.count_ngrams()
383 |     
384 |     # 模型读入与过滤
385 |     ngrams,total = km.read_ngrams()
386 |     ngrams_2 = km.filter_ngrams(ngrams, total, min_pmi=[0, 1, 3, 5])
387 |     
388 |     # 新词发现
389 |     km.word_discovery(ngrams_2)
390 |     
391 |     
392 |     '''
393 |         智能纠错
394 |     '''
395 |     # 加载模型
396 |     km.model = km.load_model(km.klm_file)
397 |     
398 |     # n-grams读入
399 |     ngrams,total = km.read_ngrams()
400 |     ngrams_2 = km.filter_ngrams(ngrams, total, min_pmi=[0, 1, 3, 5])
401 |     
402 |     # 纠错
403 |     import pycorrector
404 |     from pypinyin import lazy_pinyin, Style
405 |     sentence = '这瓶洗棉奶用着狠不错'
406 |     idx_errors = pycorrector.detect(sentence)
407 |     
408 |     correct = []
409 |     for ide in idx_errors:
410 |         right_word = km.find_best_word(ide[0],ngrams_2,freqs = 0)
411 |         if right_word != ide[0]:
412 |             correct.append([right_word] + ide)
413 |     
414 |     print('错误：',idx_errors)
415 |     print('pycorrector的结果：',pycorrector.correct(sentence))
416 |     print('kenlm的结果：',correct)
417 | 


--------------------------------------------------------------------------------