├── README.md
├── data
└── stopWord.json
├── douban_NLP_envs.ipynb
└── image
├── example.png
└── 数据样式.png
/README.md:
--------------------------------------------------------------------------------
1 | # 豆瓣电影评分预测
2 |
3 | 本项目基于豆瓣电影平台上用户对电影的评论与打分的数据集,构建一个豆瓣电影评分预测系统。
4 | 具体流程如下:
5 |
6 | - 中文分词
7 | - 文本预处理(包括去停用词、去低频词等)
8 | - 文本特征提取(使用tf-idf、word2vec、bert embedding三种技术设计文本特征)
9 | - 模型搭建和训练(利用逻辑回归、朴素贝叶斯两个简单的机器学习分类模型搭建算法)
10 |
11 | ## 1. 准备数据
12 | 首先下载[豆瓣电影评分数据集DBMS](https://pan.baidu.com/s/1pIBLEeiv5ychGmj2keNLSQ ),提取码`l9xj`。
13 |
14 | 另外,实验中还参照了知乎预训练的词向量进行比对,在[此处](https://pan.baidu.com/s/1kaa8EpMrN5r1Gc-ALOVI1w )下载,提取码:`yjic`。
15 |
16 | 下载后将两个文件放在项目中的data文件夹下。
17 |
18 | 加载数据后样式如下:
19 | 
20 |
21 | ## 2. 逐块运行`douban_NLP_envs.ipynb`文件
22 | 示例如下
23 | 
--------------------------------------------------------------------------------
/data/stopWord.json:
--------------------------------------------------------------------------------
1 | $
2 | 0
3 | 1
4 | 2
5 | 3
6 | 4
7 | 5
8 | 6
9 | 7
10 | 8
11 | 9
12 | ?
13 | _
14 | “
15 | ”
16 | 、
17 | 。
18 | 《
19 | 》
20 | 一
21 | 一些
22 | 一何
23 | 一切
24 | 一则
25 | 一方面
26 | 一旦
27 | 一来
28 | 一样
29 | 一般
30 | 一转眼
31 | 万一
32 | 上
33 | 上下
34 | 下
35 | 不
36 | 不仅
37 | 不但
38 | 不光
39 | 不单
40 | 不只
41 | 不外乎
42 | 不如
43 | 不妨
44 | 不尽
45 | 不尽然
46 | 不得
47 | 不怕
48 | 不惟
49 | 不成
50 | 不拘
51 | 不料
52 | 不是
53 | 不比
54 | 不然
55 | 不特
56 | 不独
57 | 不管
58 | 不至于
59 | 不若
60 | 不论
61 | 不过
62 | 不问
63 | 与
64 | 与其
65 | 与其说
66 | 与否
67 | 与此同时
68 | 且
69 | 且不说
70 | 且说
71 | 两者
72 | 个
73 | 个别
74 | 临
75 | 为
76 | 为了
77 | 为什么
78 | 为何
79 | 为止
80 | 为此
81 | 为着
82 | 乃
83 | 乃至
84 | 乃至于
85 | 么
86 | 之
87 | 之一
88 | 之所以
89 | 之类
90 | 乌乎
91 | 乎
92 | 乘
93 | 也
94 | 也好
95 | 也罢
96 | 了
97 | 二来
98 | 于
99 | 于是
100 | 于是乎
101 | 云云
102 | 云尔
103 | 些
104 | 亦
105 | 人
106 | 人们
107 | 人家
108 | 什么
109 | 什么样
110 | 今
111 | 介于
112 | 仍
113 | 仍旧
114 | 从
115 | 从此
116 | 从而
117 | 他
118 | 他人
119 | 他们
120 | 以
121 | 以上
122 | 以为
123 | 以便
124 | 以免
125 | 以及
126 | 以故
127 | 以期
128 | 以来
129 | 以至
130 | 以至于
131 | 以致
132 | 们
133 | 任
134 | 任何
135 | 任凭
136 | 似的
137 | 但
138 | 但凡
139 | 但是
140 | 何
141 | 何以
142 | 何况
143 | 何处
144 | 何时
145 | 余外
146 | 作为
147 | 你
148 | 你们
149 | 使
150 | 使得
151 | 例如
152 | 依
153 | 依据
154 | 依照
155 | 便于
156 | 俺
157 | 俺们
158 | 倘
159 | 倘使
160 | 倘或
161 | 倘然
162 | 倘若
163 | 借
164 | 假使
165 | 假如
166 | 假若
167 | 傥然
168 | 像
169 | 儿
170 | 先不先
171 | 光是
172 | 全体
173 | 全部
174 | 兮
175 | 关于
176 | 其
177 | 其一
178 | 其中
179 | 其二
180 | 其他
181 | 其余
182 | 其它
183 | 其次
184 | 具体地说
185 | 具体说来
186 | 兼之
187 | 内
188 | 再
189 | 再其次
190 | 再则
191 | 再有
192 | 再者
193 | 再者说
194 | 再说
195 | 冒
196 | 冲
197 | 况且
198 | 几
199 | 几时
200 | 凡
201 | 凡是
202 | 凭
203 | 凭借
204 | 出于
205 | 出来
206 | 分别
207 | 则
208 | 则甚
209 | 别
210 | 别人
211 | 别处
212 | 别是
213 | 别的
214 | 别管
215 | 别说
216 | 到
217 | 前后
218 | 前此
219 | 前者
220 | 加之
221 | 加以
222 | 即
223 | 即令
224 | 即使
225 | 即便
226 | 即如
227 | 即或
228 | 即若
229 | 却
230 | 去
231 | 又
232 | 又及
233 | 及
234 | 及其
235 | 及至
236 | 反之
237 | 反而
238 | 反过来
239 | 反过来说
240 | 受到
241 | 另
242 | 另一方面
243 | 另外
244 | 另悉
245 | 只
246 | 只当
247 | 只怕
248 | 只是
249 | 只有
250 | 只消
251 | 只要
252 | 只限
253 | 叫
254 | 叮咚
255 | 可
256 | 可以
257 | 可是
258 | 可见
259 | 各
260 | 各个
261 | 各位
262 | 各种
263 | 各自
264 | 同
265 | 同时
266 | 后
267 | 后者
268 | 向
269 | 向使
270 | 向着
271 | 吓
272 | 吗
273 | 否则
274 | 吧
275 | 吧哒
276 | 吱
277 | 呀
278 | 呃
279 | 呕
280 | 呗
281 | 呜
282 | 呜呼
283 | 呢
284 | 呵
285 | 呵呵
286 | 呸
287 | 呼哧
288 | 咋
289 | 和
290 | 咚
291 | 咦
292 | 咧
293 | 咱
294 | 咱们
295 | 咳
296 | 哇
297 | 哈
298 | 哈哈
299 | 哉
300 | 哎
301 | 哎呀
302 | 哎哟
303 | 哗
304 | 哟
305 | 哦
306 | 哩
307 | 哪
308 | 哪个
309 | 哪些
310 | 哪儿
311 | 哪天
312 | 哪年
313 | 哪怕
314 | 哪样
315 | 哪边
316 | 哪里
317 | 哼
318 | 哼唷
319 | 唉
320 | 唯有
321 | 啊
322 | 啐
323 | 啥
324 | 啦
325 | 啪达
326 | 啷当
327 | 喂
328 | 喏
329 | 喔唷
330 | 喽
331 | 嗡
332 | 嗡嗡
333 | 嗬
334 | 嗯
335 | 嗳
336 | 嘎
337 | 嘎登
338 | 嘘
339 | 嘛
340 | 嘻
341 | 嘿
342 | 嘿嘿
343 | 因
344 | 因为
345 | 因了
346 | 因此
347 | 因着
348 | 因而
349 | 固然
350 | 在
351 | 在下
352 | 在于
353 | 地
354 | 基于
355 | 处在
356 | 多
357 | 多么
358 | 多少
359 | 大
360 | 大家
361 | 她
362 | 她们
363 | 好
364 | 如
365 | 如上
366 | 如上所述
367 | 如下
368 | 如何
369 | 如其
370 | 如同
371 | 如是
372 | 如果
373 | 如此
374 | 如若
375 | 始而
376 | 孰料
377 | 孰知
378 | 宁
379 | 宁可
380 | 宁愿
381 | 宁肯
382 | 它
383 | 它们
384 | 对
385 | 对于
386 | 对待
387 | 对方
388 | 对比
389 | 将
390 | 小
391 | 尔
392 | 尔后
393 | 尔尔
394 | 尚且
395 | 就
396 | 就是
397 | 就是了
398 | 就是说
399 | 就算
400 | 就要
401 | 尽
402 | 尽管
403 | 尽管如此
404 | 岂但
405 | 己
406 | 已
407 | 已矣
408 | 巴
409 | 巴巴
410 | 并
411 | 并且
412 | 并非
413 | 庶乎
414 | 庶几
415 | 开外
416 | 开始
417 | 归
418 | 归齐
419 | 当
420 | 当地
421 | 当然
422 | 当着
423 | 彼
424 | 彼时
425 | 彼此
426 | 往
427 | 待
428 | 很
429 | 得
430 | 得了
431 | 怎
432 | 怎么
433 | 怎么办
434 | 怎么样
435 | 怎奈
436 | 怎样
437 | 总之
438 | 总的来看
439 | 总的来说
440 | 总的说来
441 | 总而言之
442 | 恰恰相反
443 | 您
444 | 惟其
445 | 慢说
446 | 我
447 | 我们
448 | 或
449 | 或则
450 | 或是
451 | 或曰
452 | 或者
453 | 截至
454 | 所
455 | 所以
456 | 所在
457 | 所幸
458 | 所有
459 | 才
460 | 才能
461 | 打
462 | 打从
463 | 把
464 | 抑或
465 | 拿
466 | 按
467 | 按照
468 | 换句话说
469 | 换言之
470 | 据
471 | 据此
472 | 接着
473 | 故
474 | 故此
475 | 故而
476 | 旁人
477 | 无
478 | 无宁
479 | 无论
480 | 既
481 | 既往
482 | 既是
483 | 既然
484 | 时候
485 | 是
486 | 是以
487 | 是的
488 | 曾
489 | 替
490 | 替代
491 | 最
492 | 有
493 | 有些
494 | 有关
495 | 有及
496 | 有时
497 | 有的
498 | 望
499 | 朝
500 | 朝着
501 | 本
502 | 本人
503 | 本地
504 | 本着
505 | 本身
506 | 来
507 | 来着
508 | 来自
509 | 来说
510 | 极了
511 | 果然
512 | 果真
513 | 某
514 | 某个
515 | 某些
516 | 某某
517 | 根据
518 | 欤
519 | 正值
520 | 正如
521 | 正巧
522 | 正是
523 | 此
524 | 此地
525 | 此处
526 | 此外
527 | 此时
528 | 此次
529 | 此间
530 | 毋宁
531 | 每
532 | 每当
533 | 比
534 | 比及
535 | 比如
536 | 比方
537 | 没奈何
538 | 沿
539 | 沿着
540 | 漫说
541 | 焉
542 | 然则
543 | 然后
544 | 然而
545 | 照
546 | 照着
547 | 犹且
548 | 犹自
549 | 甚且
550 | 甚么
551 | 甚或
552 | 甚而
553 | 甚至
554 | 甚至于
555 | 用
556 | 用来
557 | 由
558 | 由于
559 | 由是
560 | 由此
561 | 由此可见
562 | 的
563 | 的确
564 | 的话
565 | 直到
566 | 相对而言
567 | 省得
568 | 看
569 | 眨眼
570 | 着
571 | 着呢
572 | 矣
573 | 矣乎
574 | 矣哉
575 | 离
576 | 竟而
577 | 第
578 | 等
579 | 等到
580 | 等等
581 | 简言之
582 | 管
583 | 类如
584 | 紧接着
585 | 纵
586 | 纵令
587 | 纵使
588 | 纵然
589 | 经
590 | 经过
591 | 结果
592 | 给
593 | 继之
594 | 继后
595 | 继而
596 | 综上所述
597 | 罢了
598 | 者
599 | 而
600 | 而且
601 | 而况
602 | 而后
603 | 而外
604 | 而已
605 | 而是
606 | 而言
607 | 能
608 | 能否
609 | 腾
610 | 自
611 | 自个儿
612 | 自从
613 | 自各儿
614 | 自后
615 | 自家
616 | 自己
617 | 自打
618 | 自身
619 | 至
620 | 至于
621 | 至今
622 | 至若
623 | 致
624 | 般的
625 | 若
626 | 若夫
627 | 若是
628 | 若果
629 | 若非
630 | 莫不然
631 | 莫如
632 | 莫若
633 | 虽
634 | 虽则
635 | 虽然
636 | 虽说
637 | 被
638 | 要
639 | 要不
640 | 要不是
641 | 要不然
642 | 要么
643 | 要是
644 | 譬喻
645 | 譬如
646 | 让
647 | 许多
648 | 论
649 | 设使
650 | 设或
651 | 设若
652 | 诚如
653 | 诚然
654 | 该
655 | 说来
656 | 诸
657 | 诸位
658 | 诸如
659 | 谁
660 | 谁人
661 | 谁料
662 | 谁知
663 | 贼死
664 | 赖以
665 | 赶
666 | 起
667 | 起见
668 | 趁
669 | 趁着
670 | 越是
671 | 距
672 | 跟
673 | 较
674 | 较之
675 | 边
676 | 过
677 | 还
678 | 还是
679 | 还有
680 | 还要
681 | 这
682 | 这一来
683 | 这个
684 | 这么
685 | 这么些
686 | 这么样
687 | 这么点儿
688 | 这些
689 | 这会儿
690 | 这儿
691 | 这就是说
692 | 这时
693 | 这样
694 | 这次
695 | 这般
696 | 这边
697 | 这里
698 | 进而
699 | 连
700 | 连同
701 | 逐步
702 | 通过
703 | 遵循
704 | 遵照
705 | 那
706 | 那个
707 | 那么
708 | 那么些
709 | 那么样
710 | 那些
711 | 那会儿
712 | 那儿
713 | 那时
714 | 那样
715 | 那般
716 | 那边
717 | 那里
718 | 都
719 | 鄙人
720 | 鉴于
721 | 针对
722 | 阿
723 | 除
724 | 除了
725 | 除外
726 | 除开
727 | 除此之外
728 | 除非
729 | 随
730 | 随后
731 | 随时
732 | 随着
733 | 难道说
734 | 非但
735 | 非徒
736 | 非特
737 | 非独
738 | 靠
739 | 顺
740 | 顺着
741 | 首先
742 | !
743 | ,
744 | :
745 | ;
746 | ?
747 |
--------------------------------------------------------------------------------
/douban_NLP_envs.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### 豆瓣评分的预测\n",
8 | "\n",
9 | "在这个项目中,我们要预测一部电影的评分,这个问题实际上就是一个分类问题。给定的输入为一段文本,输出为具体的评分。 在这个项目中,我们需要做:\n",
10 | "- 文本的预处理,如停用词的过滤,低频词的过滤,特殊符号的过滤等\n",
11 | "- 文本转化成向量,将使用三种方式,分别为tf-idf, word2vec以及BERT向量。 \n",
12 | "- 训练逻辑回归和朴素贝叶斯模型,并做交叉验证\n",
13 | "- 评估模型的准确率\n"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 1,
19 | "metadata": {},
20 | "outputs": [],
21 | "source": [
22 | "#导入数据处理的基础包\n",
23 | "import numpy as np\n",
24 | "import pandas as pd\n",
25 | "\n",
26 | "#导入用于计数的包\n",
27 | "from collections import Counter\n",
28 | "\n",
29 | "#导入tf-idf相关的包\n",
30 | "from sklearn.feature_extraction.text import TfidfTransformer \n",
31 | "from sklearn.feature_extraction.text import CountVectorizer\n",
32 | "\n",
33 | "#导入模型评估的包\n",
34 | "from sklearn import metrics\n",
35 | "\n",
36 | "#导入与word2vec相关的包\n",
37 | "from gensim.models import KeyedVectors\n",
38 | "\n",
39 | "#导入与bert embedding相关的包,关于mxnet包下载的注意事项参考实验手册\n",
40 | "from bert_embedding import BertEmbedding\n",
41 | "import mxnet\n",
42 | "\n",
43 | "#包tqdm是用来对可迭代对象执行时生成一个进度条用以监视程序运行过程\n",
44 | "from tqdm import tqdm\n",
45 | "\n",
46 | "#导入其他一些功能包\n",
47 | "import requests\n",
48 | "import os\n"
49 | ]
50 | },
51 | {
52 | "cell_type": "markdown",
53 | "metadata": {},
54 | "source": [
55 | "### 1. 读取数据并做文本的处理\n",
56 | "需要完成以下几步操作:\n",
57 | "- 去掉无用的字符如!&,可自行定义\n",
58 | "- 中文分词\n",
59 | "- 去掉低频词"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 2,
65 | "metadata": {},
66 | "outputs": [
67 | {
68 | "data": {
69 | "text/html": [
70 | "
\n",
71 | "\n",
84 | "
\n",
85 | " \n",
86 | " \n",
87 | " | \n",
88 | " ID | \n",
89 | " Movie_Name_EN | \n",
90 | " Movie_Name_CN | \n",
91 | " Crawl_Date | \n",
92 | " Number | \n",
93 | " Username | \n",
94 | " Date | \n",
95 | " Star | \n",
96 | " Comment | \n",
97 | " Like | \n",
98 | "
\n",
99 | " \n",
100 | " \n",
101 | " \n",
102 | " 0 | \n",
103 | " 0 | \n",
104 | " Avengers Age of Ultron | \n",
105 | " 复仇者联盟2 | \n",
106 | " 2017-01-22 | \n",
107 | " 1 | \n",
108 | " 然潘 | \n",
109 | " 2015-05-13 | \n",
110 | " 3 | \n",
111 | " 连奥创都知道整容要去韩国。 | \n",
112 | " 2404 | \n",
113 | "
\n",
114 | " \n",
115 | " 1 | \n",
116 | " 10 | \n",
117 | " Avengers Age of Ultron | \n",
118 | " 复仇者联盟2 | \n",
119 | " 2017-01-22 | \n",
120 | " 11 | \n",
121 | " 影志 | \n",
122 | " 2015-04-30 | \n",
123 | " 4 | \n",
124 | " “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... | \n",
125 | " 381 | \n",
126 | "
\n",
127 | " \n",
128 | " 2 | \n",
129 | " 20 | \n",
130 | " Avengers Age of Ultron | \n",
131 | " 复仇者联盟2 | \n",
132 | " 2017-01-22 | \n",
133 | " 21 | \n",
134 | " 随时流感 | \n",
135 | " 2015-04-28 | \n",
136 | " 2 | \n",
137 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
138 | " 120 | \n",
139 | "
\n",
140 | " \n",
141 | " 3 | \n",
142 | " 30 | \n",
143 | " Avengers Age of Ultron | \n",
144 | " 复仇者联盟2 | \n",
145 | " 2017-01-22 | \n",
146 | " 31 | \n",
147 | " 乌鸦火堂 | \n",
148 | " 2015-05-08 | \n",
149 | " 4 | \n",
150 | " 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... | \n",
151 | " 30 | \n",
152 | "
\n",
153 | " \n",
154 | " 4 | \n",
155 | " 40 | \n",
156 | " Avengers Age of Ultron | \n",
157 | " 复仇者联盟2 | \n",
158 | " 2017-01-22 | \n",
159 | " 41 | \n",
160 | " 办公室甜心 | \n",
161 | " 2015-05-10 | \n",
162 | " 5 | \n",
163 | " 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... | \n",
164 | " 16 | \n",
165 | "
\n",
166 | " \n",
167 | "
\n",
168 | "
"
169 | ],
170 | "text/plain": [
171 | " ID Movie_Name_EN Movie_Name_CN Crawl_Date Number Username \\\n",
172 | "0 0 Avengers Age of Ultron 复仇者联盟2 2017-01-22 1 然潘 \n",
173 | "1 10 Avengers Age of Ultron 复仇者联盟2 2017-01-22 11 影志 \n",
174 | "2 20 Avengers Age of Ultron 复仇者联盟2 2017-01-22 21 随时流感 \n",
175 | "3 30 Avengers Age of Ultron 复仇者联盟2 2017-01-22 31 乌鸦火堂 \n",
176 | "4 40 Avengers Age of Ultron 复仇者联盟2 2017-01-22 41 办公室甜心 \n",
177 | "\n",
178 | " Date Star Comment Like \n",
179 | "0 2015-05-13 3 连奥创都知道整容要去韩国。 2404 \n",
180 | "1 2015-04-30 4 “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... 381 \n",
181 | "2 2015-04-28 2 奥创弱爆了弱爆了弱爆了啊!!!!!! 120 \n",
182 | "3 2015-05-08 4 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... 30 \n",
183 | "4 2015-05-10 5 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... 16 "
184 | ]
185 | },
186 | "execution_count": 2,
187 | "metadata": {},
188 | "output_type": "execute_result"
189 | }
190 | ],
191 | "source": [
192 | "#读取数据\n",
193 | "data = pd.read_csv('data/DMSC.csv')\n",
194 | "#观察数据格式\n",
195 | "data.head()"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 3,
201 | "metadata": {},
202 | "outputs": [
203 | {
204 | "data": {
205 | "text/html": [
206 | "\n",
207 | "\n",
220 | "
\n",
221 | " \n",
222 | " \n",
223 | " | \n",
224 | " Comment | \n",
225 | " Star | \n",
226 | "
\n",
227 | " \n",
228 | " \n",
229 | " \n",
230 | " 0 | \n",
231 | " 连奥创都知道整容要去韩国。 | \n",
232 | " 3 | \n",
233 | "
\n",
234 | " \n",
235 | " 1 | \n",
236 | " “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... | \n",
237 | " 4 | \n",
238 | "
\n",
239 | " \n",
240 | " 2 | \n",
241 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
242 | " 2 | \n",
243 | "
\n",
244 | " \n",
245 | " 3 | \n",
246 | " 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... | \n",
247 | " 4 | \n",
248 | "
\n",
249 | " \n",
250 | " 4 | \n",
251 | " 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... | \n",
252 | " 5 | \n",
253 | "
\n",
254 | " \n",
255 | "
\n",
256 | "
"
257 | ],
258 | "text/plain": [
259 | " Comment Star\n",
260 | "0 连奥创都知道整容要去韩国。 3\n",
261 | "1 “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... 4\n",
262 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 2\n",
263 | "3 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... 4\n",
264 | "4 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... 5"
265 | ]
266 | },
267 | "execution_count": 3,
268 | "metadata": {},
269 | "output_type": "execute_result"
270 | }
271 | ],
272 | "source": [
273 | "#只保留数据中我们需要的两列:Comment列和Star列\n",
274 | "data = data[['Comment','Star']]\n",
275 | "#观察新的数据的格式\n",
276 | "data.head()"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 4,
282 | "metadata": {},
283 | "outputs": [
284 | {
285 | "data": {
286 | "text/html": [
287 | "\n",
288 | "\n",
301 | "
\n",
302 | " \n",
303 | " \n",
304 | " | \n",
305 | " Comment | \n",
306 | " Star | \n",
307 | "
\n",
308 | " \n",
309 | " \n",
310 | " \n",
311 | " 0 | \n",
312 | " 连奥创都知道整容要去韩国。 | \n",
313 | " 1 | \n",
314 | "
\n",
315 | " \n",
316 | " 1 | \n",
317 | " “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... | \n",
318 | " 1 | \n",
319 | "
\n",
320 | " \n",
321 | " 2 | \n",
322 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
323 | " 0 | \n",
324 | "
\n",
325 | " \n",
326 | " 3 | \n",
327 | " 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... | \n",
328 | " 1 | \n",
329 | "
\n",
330 | " \n",
331 | " 4 | \n",
332 | " 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... | \n",
333 | " 1 | \n",
334 | "
\n",
335 | " \n",
336 | "
\n",
337 | "
"
338 | ],
339 | "text/plain": [
340 | " Comment Star\n",
341 | "0 连奥创都知道整容要去韩国。 1\n",
342 | "1 “一个没有黑暗面的人不值得信任。” 第二部剥去冗长的铺垫,开场即高潮、一直到结束,会有人觉... 1\n",
343 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 0\n",
344 | "3 与第一集不同,承上启下,阴郁严肃,但也不会不好看啊,除非本来就不喜欢漫威电影。场面更加宏大... 1\n",
345 | "4 看毕,我激动地对友人说,等等奥创要来毁灭台北怎么办厚,她拍了拍我肩膀,没事,反正你买了两份... 1"
346 | ]
347 | },
348 | "execution_count": 4,
349 | "metadata": {},
350 | "output_type": "execute_result"
351 | }
352 | ],
353 | "source": [
354 | "# 这里的star代表具体的评分。但在这个项目中,我们要预测的是正面还是负面。我们把评分为1和2的看作是负面,把评分为3,4,5的作为正面\n",
355 | "data['Star']=(data.Star/3).astype(int)\n",
356 | "data.head()"
357 | ]
358 | },
359 | {
360 | "cell_type": "markdown",
361 | "metadata": {},
362 | "source": [
363 | "#### 任务1: 去掉一些无用的字符"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 5,
369 | "metadata": {},
370 | "outputs": [
371 | {
372 | "name": "stderr",
373 | "output_type": "stream",
374 | "text": [
375 | "douban: 100%|██████████████████████████████████████████████████████████████| 212506/212506 [00:00<00:00, 562982.34it/s]\n"
376 | ]
377 | }
378 | ],
379 | "source": [
380 | "tqdm.pandas(desc='douban') #查看处理进度\n",
381 | "data['Comment']=data['Comment'].progress_apply(lambda x:x.replace(',','').replace('。',''))"
382 | ]
383 | },
384 | {
385 | "cell_type": "code",
386 | "execution_count": 6,
387 | "metadata": {
388 | "scrolled": true
389 | },
390 | "outputs": [
391 | {
392 | "data": {
393 | "text/html": [
394 | "\n",
395 | "\n",
408 | "
\n",
409 | " \n",
410 | " \n",
411 | " | \n",
412 | " Comment | \n",
413 | " Star | \n",
414 | "
\n",
415 | " \n",
416 | " \n",
417 | " \n",
418 | " 0 | \n",
419 | " 连奥创都知道整容要去韩国 | \n",
420 | " 1 | \n",
421 | "
\n",
422 | " \n",
423 | " 1 | \n",
424 | " “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... | \n",
425 | " 1 | \n",
426 | "
\n",
427 | " \n",
428 | " 2 | \n",
429 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
430 | " 0 | \n",
431 | "
\n",
432 | " \n",
433 | " 3 | \n",
434 | " 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... | \n",
435 | " 1 | \n",
436 | "
\n",
437 | " \n",
438 | " 4 | \n",
439 | " 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... | \n",
440 | " 1 | \n",
441 | "
\n",
442 | " \n",
443 | "
\n",
444 | "
"
445 | ],
446 | "text/plain": [
447 | " Comment Star\n",
448 | "0 连奥创都知道整容要去韩国 1\n",
449 | "1 “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... 1\n",
450 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 0\n",
451 | "3 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... 1\n",
452 | "4 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... 1"
453 | ]
454 | },
455 | "execution_count": 6,
456 | "metadata": {},
457 | "output_type": "execute_result"
458 | }
459 | ],
460 | "source": [
461 | "# 观察新的数据的格式\n",
462 | "data.head()"
463 | ]
464 | },
465 | {
466 | "cell_type": "markdown",
467 | "metadata": {},
468 | "source": [
469 | "#### 任务2:使用结巴分词对文本做分词"
470 | ]
471 | },
472 | {
473 | "cell_type": "code",
474 | "execution_count": 7,
475 | "metadata": {},
476 | "outputs": [
477 | {
478 | "name": "stderr",
479 | "output_type": "stream",
480 | "text": [
481 | "apply: 0%| | 0/212506 [00:00, ?it/s]Building prefix dict from the default dictionary ...\n",
482 | "Dumping model to file cache C:\\Users\\chxue\\AppData\\Local\\Temp\\jieba.cache\n",
483 | "Loading model cost 0.858 seconds.\n",
484 | "Prefix dict has been built successfully.\n",
485 | "apply: 100%|█████████████████████████████████████████████████████████████████| 212506/212506 [00:53<00:00, 3994.68it/s]\n"
486 | ]
487 | }
488 | ],
489 | "source": [
490 | "import jieba\n",
491 | "def comment_cut(content):\n",
492 | " return list(jieba.cut(content.strip()))\n",
493 | "\n",
494 | "# 输出进度条\n",
495 | "tqdm.pandas(desc='apply')\n",
496 | "data['comment_processed'] = data['Comment'].progress_apply(comment_cut)"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": 10,
502 | "metadata": {
503 | "scrolled": true
504 | },
505 | "outputs": [
506 | {
507 | "data": {
508 | "text/html": [
509 | "\n",
510 | "\n",
523 | "
\n",
524 | " \n",
525 | " \n",
526 | " | \n",
527 | " Comment | \n",
528 | " Star | \n",
529 | " comment_processed | \n",
530 | "
\n",
531 | " \n",
532 | " \n",
533 | " \n",
534 | " 0 | \n",
535 | " 连奥创都知道整容要去韩国 | \n",
536 | " 1 | \n",
537 | " [连, 奥创, 都, 知道, 整容, 要, 去, 韩国] | \n",
538 | "
\n",
539 | " \n",
540 | " 1 | \n",
541 | " “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... | \n",
542 | " 1 | \n",
543 | " [“, 一个, 没有, 黑暗面, 的, 人, 不, 值得, 信任, ”, , 第二部, 剥... | \n",
544 | "
\n",
545 | " \n",
546 | " 2 | \n",
547 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
548 | " 0 | \n",
549 | " [奥创, 弱, 爆, 了, 弱, 爆, 了, 弱, 爆, 了, 啊, !, !, !, !,... | \n",
550 | "
\n",
551 | " \n",
552 | " 3 | \n",
553 | " 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... | \n",
554 | " 1 | \n",
555 | " [与, 第一集, 不同, 承上启下, 阴郁, 严肃, 但, 也, 不会, 不, 好看, 啊,... | \n",
556 | "
\n",
557 | " \n",
558 | " 4 | \n",
559 | " 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... | \n",
560 | " 1 | \n",
561 | " [看毕, 我, 激动, 地, 对, 友人, 说, 等等, 奥创, 要, 来, 毁灭, 台北,... | \n",
562 | "
\n",
563 | " \n",
564 | "
\n",
565 | "
"
566 | ],
567 | "text/plain": [
568 | " Comment Star \\\n",
569 | "0 连奥创都知道整容要去韩国 1 \n",
570 | "1 “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... 1 \n",
571 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 0 \n",
572 | "3 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... 1 \n",
573 | "4 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... 1 \n",
574 | "\n",
575 | " comment_processed \n",
576 | "0 [连, 奥创, 都, 知道, 整容, 要, 去, 韩国] \n",
577 | "1 [“, 一个, 没有, 黑暗面, 的, 人, 不, 值得, 信任, ”, , 第二部, 剥... \n",
578 | "2 [奥创, 弱, 爆, 了, 弱, 爆, 了, 弱, 爆, 了, 啊, !, !, !, !,... \n",
579 | "3 [与, 第一集, 不同, 承上启下, 阴郁, 严肃, 但, 也, 不会, 不, 好看, 啊,... \n",
580 | "4 [看毕, 我, 激动, 地, 对, 友人, 说, 等等, 奥创, 要, 来, 毁灭, 台北,... "
581 | ]
582 | },
583 | "execution_count": 10,
584 | "metadata": {},
585 | "output_type": "execute_result"
586 | }
587 | ],
588 | "source": [
589 | "data.head()"
590 | ]
591 | },
592 | {
593 | "cell_type": "markdown",
594 | "metadata": {},
595 | "source": [
596 | "#### 任务3:设定停用词并去掉停用词"
597 | ]
598 | },
599 | {
600 | "cell_type": "code",
601 | "execution_count": 12,
602 | "metadata": {},
603 | "outputs": [
604 | {
605 | "name": "stderr",
606 | "output_type": "stream",
607 | "text": [
608 | "apply: 100%|█████| 212506/212506 [00:35<00:00, 5936.44it/s]\n"
609 | ]
610 | }
611 | ],
612 | "source": [
613 | "# 读取下载的停用词表,并保存在列表中\n",
614 | "with open(\"data/stopWord.json\",\"r\",encoding='utf-8') as f:\n",
615 | " stopWords = f.read().split(\"\\n\")\n",
616 | "\n",
617 | "# 去除停用词\n",
618 | "def rm_stop_word(wordList):\n",
619 | " return [word for word in wordList if word not in stopWords]\n",
620 | "\n",
621 | "#这行代码中.progress_apply()函数的作用等同于.apply()函数的作用,只是写成.progress_apply()函数才能被tqdm包监控从而输出进度条。\n",
622 | "data['comment_processed'] = data['comment_processed'].progress_apply(rm_stop_word)"
623 | ]
624 | },
625 | {
626 | "cell_type": "code",
627 | "execution_count": 15,
628 | "metadata": {
629 | "scrolled": true
630 | },
631 | "outputs": [
632 | {
633 | "data": {
634 | "text/html": [
635 | "\n",
636 | "\n",
649 | "
\n",
650 | " \n",
651 | " \n",
652 | " | \n",
653 | " Comment | \n",
654 | " Star | \n",
655 | " comment_processed | \n",
656 | "
\n",
657 | " \n",
658 | " \n",
659 | " \n",
660 | " 0 | \n",
661 | " 连奥创都知道整容要去韩国 | \n",
662 | " 1 | \n",
663 | " [奥创, 知道, 整容, 韩国] | \n",
664 | "
\n",
665 | " \n",
666 | " 1 | \n",
667 | " “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... | \n",
668 | " 1 | \n",
669 | " [一个, 没有, 黑暗面, 值得, 信任, , 第二部, 剥去, 冗长, 铺垫, 开场, ... | \n",
670 | "
\n",
671 | " \n",
672 | " 2 | \n",
673 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
674 | " 0 | \n",
675 | " [奥创, 弱, 爆, 弱, 爆, 弱, 爆] | \n",
676 | "
\n",
677 | " \n",
678 | " 3 | \n",
679 | " 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... | \n",
680 | " 1 | \n",
681 | " [第一集, 不同, 承上启下, 阴郁, 严肃, 不会, 好看, 本来, 喜欢, 漫威, 电影... | \n",
682 | "
\n",
683 | " \n",
684 | " 4 | \n",
685 | " 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... | \n",
686 | " 1 | \n",
687 | " [看毕, 激动, 友人, 说, 奥创, 毁灭, 台北, 厚, 拍了拍, 肩膀, 没事, 反正... | \n",
688 | "
\n",
689 | " \n",
690 | "
\n",
691 | "
"
692 | ],
693 | "text/plain": [
694 | " Comment Star \\\n",
695 | "0 连奥创都知道整容要去韩国 1 \n",
696 | "1 “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... 1 \n",
697 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 0 \n",
698 | "3 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... 1 \n",
699 | "4 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... 1 \n",
700 | "\n",
701 | " comment_processed \n",
702 | "0 [奥创, 知道, 整容, 韩国] \n",
703 | "1 [一个, 没有, 黑暗面, 值得, 信任, , 第二部, 剥去, 冗长, 铺垫, 开场, ... \n",
704 | "2 [奥创, 弱, 爆, 弱, 爆, 弱, 爆] \n",
705 | "3 [第一集, 不同, 承上启下, 阴郁, 严肃, 不会, 好看, 本来, 喜欢, 漫威, 电影... \n",
706 | "4 [看毕, 激动, 友人, 说, 奥创, 毁灭, 台北, 厚, 拍了拍, 肩膀, 没事, 反正... "
707 | ]
708 | },
709 | "execution_count": 15,
710 | "metadata": {},
711 | "output_type": "execute_result"
712 | }
713 | ],
714 | "source": [
715 | "data.head()"
716 | ]
717 | },
718 | {
719 | "cell_type": "markdown",
720 | "metadata": {},
721 | "source": [
722 | "#### 任务4:去掉低频词,出现次数少于10次的词去掉"
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": 17,
728 | "metadata": {},
729 | "outputs": [
730 | {
731 | "name": "stdout",
732 | "output_type": "stream",
733 | "text": [
734 | "106235\n"
735 | ]
736 | }
737 | ],
738 | "source": [
739 | "#确实得建立大词典来统计所有词词频\n",
740 | "word_dic={}\n",
741 | "for comment in data['comment_processed']:\n",
742 | " for word in comment:\n",
743 | " if word in word_dic:\n",
744 | " word_dic[word]+=1\n",
745 | " else:\n",
746 | " word_dic[word]=1\n",
747 | "print(len(word_dic))"
748 | ]
749 | },
750 | {
751 | "cell_type": "code",
752 | "execution_count": 19,
753 | "metadata": {},
754 | "outputs": [
755 | {
756 | "name": "stdout",
757 | "output_type": "stream",
758 | "text": [
759 | "[('奥创', 302), ('知道', 6022), ('整容', 79), ('韩国', 1633), ('一个', 13852), ('没有', 18448), ('黑暗面', 31), ('值得', 3056), ('信任', 223), (' ', 98052)]\n"
760 | ]
761 | }
762 | ],
763 | "source": [
764 | "print(list(word_dic.items())[:10])"
765 | ]
766 | },
767 | {
768 | "cell_type": "code",
769 | "execution_count": 20,
770 | "metadata": {},
771 | "outputs": [
772 | {
773 | "name": "stderr",
774 | "output_type": "stream",
775 | "text": [
776 | "apply: 100%|███| 212506/212506 [00:00<00:00, 269018.58it/s]\n"
777 | ]
778 | }
779 | ],
780 | "source": [
781 | "def rm_low_freq_word(wordList):\n",
782 | " return [w for w in wordList if word_dic[w]>=10]\n",
783 | "\n",
784 | "data['comment_processed'] = data['comment_processed'].progress_apply(rm_low_freq_word)"
785 | ]
786 | },
787 | {
788 | "cell_type": "code",
789 | "execution_count": 21,
790 | "metadata": {},
791 | "outputs": [
792 | {
793 | "data": {
794 | "text/html": [
795 | "\n",
796 | "\n",
809 | "
\n",
810 | " \n",
811 | " \n",
812 | " | \n",
813 | " Comment | \n",
814 | " Star | \n",
815 | " comment_processed | \n",
816 | "
\n",
817 | " \n",
818 | " \n",
819 | " \n",
820 | " 0 | \n",
821 | " 连奥创都知道整容要去韩国 | \n",
822 | " 1 | \n",
823 | " [奥创, 知道, 整容, 韩国] | \n",
824 | "
\n",
825 | " \n",
826 | " 1 | \n",
827 | " “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... | \n",
828 | " 1 | \n",
829 | " [一个, 没有, 黑暗面, 值得, 信任, , 第二部, 冗长, 铺垫, 开场, 高潮, ... | \n",
830 | "
\n",
831 | " \n",
832 | " 2 | \n",
833 | " 奥创弱爆了弱爆了弱爆了啊!!!!!! | \n",
834 | " 0 | \n",
835 | " [奥创, 弱, 爆, 弱, 爆, 弱, 爆] | \n",
836 | "
\n",
837 | " \n",
838 | " 3 | \n",
839 | " 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... | \n",
840 | " 1 | \n",
841 | " [第一集, 不同, 承上启下, 阴郁, 严肃, 不会, 好看, 本来, 喜欢, 漫威, 电影... | \n",
842 | "
\n",
843 | " \n",
844 | " 4 | \n",
845 | " 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... | \n",
846 | " 1 | \n",
847 | " [激动, 友人, 说, 奥创, 毁灭, 台北, 厚, 肩膀, 没事, 反正, 买, 两份, ... | \n",
848 | "
\n",
849 | " \n",
850 | "
\n",
851 | "
"
852 | ],
853 | "text/plain": [
854 | " Comment Star \\\n",
855 | "0 连奥创都知道整容要去韩国 1 \n",
856 | "1 “一个没有黑暗面的人不值得信任” 第二部剥去冗长的铺垫开场即高潮、一直到结束会有人觉得只剩... 1 \n",
857 | "2 奥创弱爆了弱爆了弱爆了啊!!!!!! 0 \n",
858 | "3 与第一集不同承上启下阴郁严肃但也不会不好看啊除非本来就不喜欢漫威电影场面更加宏大单打与团战... 1 \n",
859 | "4 看毕我激动地对友人说等等奥创要来毁灭台北怎么办厚她拍了拍我肩膀没事反正你买了两份旅行保险惹... 1 \n",
860 | "\n",
861 | " comment_processed \n",
862 | "0 [奥创, 知道, 整容, 韩国] \n",
863 | "1 [一个, 没有, 黑暗面, 值得, 信任, , 第二部, 冗长, 铺垫, 开场, 高潮, ... \n",
864 | "2 [奥创, 弱, 爆, 弱, 爆, 弱, 爆] \n",
865 | "3 [第一集, 不同, 承上启下, 阴郁, 严肃, 不会, 好看, 本来, 喜欢, 漫威, 电影... \n",
866 | "4 [激动, 友人, 说, 奥创, 毁灭, 台北, 厚, 肩膀, 没事, 反正, 买, 两份, ... "
867 | ]
868 | },
869 | "execution_count": 21,
870 | "metadata": {},
871 | "output_type": "execute_result"
872 | }
873 | ],
874 | "source": [
875 | "data.head()"
876 | ]
877 | },
878 | {
879 | "cell_type": "markdown",
880 | "metadata": {},
881 | "source": [
882 | "### 2. 把文本分为训练集和测试集\n",
883 | "选择语料库中的20%作为测试数据,剩下的作为训练数据"
884 | ]
885 | },
886 | {
887 | "cell_type": "code",
888 | "execution_count": 23,
889 | "metadata": {},
890 | "outputs": [],
891 | "source": [
892 | "from sklearn.model_selection import train_test_split\n",
893 | "train, test = train_test_split(data, test_size = 0.2)\n",
894 | "x_train = train['comment_processed']\n",
895 | "x_test = test['comment_processed']\n",
896 | "y_train = train['Star']\n",
897 | "y_test = test['Star']"
898 | ]
899 | },
900 | {
901 | "cell_type": "markdown",
902 | "metadata": {},
903 | "source": [
904 | "### 3. 把文本转换成向量的形式\n",
905 | "\n",
906 | "在这个部分我们会采用三种不同的方式:\n",
907 | "- 使用tf-idf向量\n",
908 | "- 使用word2vec\n",
909 | "- 使用bert向量\n",
910 | "\n",
911 | "转换成向量之后,我们接着做模型的训练"
912 | ]
913 | },
914 | {
915 | "cell_type": "markdown",
916 | "metadata": {},
917 | "source": [
918 | "#### 任务6:把文本转换成tf-idf向量"
919 | ]
920 | },
921 | {
922 | "cell_type": "code",
923 | "execution_count": 24,
924 | "metadata": {},
925 | "outputs": [],
926 | "source": [
927 | "comments_train_concat = x_train.apply(lambda x:' '.join(x))\n",
928 | "comments_test_concat = x_test.apply(lambda x:' '.join(x))"
929 | ]
930 | },
931 | {
932 | "cell_type": "code",
933 | "execution_count": 27,
934 | "metadata": {},
935 | "outputs": [
936 | {
937 | "data": {
938 | "text/plain": [
939 | "188015 [不停, 穿越, 不停, 转换, , 确实, 冲击, , 跟随, 剧情, , 完整, ...\n",
940 | "193033 [星半]\n",
941 | "163310 [ , 其实, , 没有, 纸醉金迷, ......, 林萧, 有点, 失望, .......]\n",
942 | "101549 [演员, 喜欢, 没, 演技, 撑, 不起, 一部, 电影]\n",
943 | "43776 [太, 难看]\n",
944 | "Name: comment_processed, dtype: object"
945 | ]
946 | },
947 | "execution_count": 27,
948 | "metadata": {},
949 | "output_type": "execute_result"
950 | }
951 | ],
952 | "source": [
953 | "x_train.head()"
954 | ]
955 | },
956 | {
957 | "cell_type": "code",
958 | "execution_count": 26,
959 | "metadata": {
960 | "scrolled": true
961 | },
962 | "outputs": [
963 | {
964 | "data": {
965 | "text/plain": [
966 | "188015 不停 穿越 不停 转换 确实 冲击 跟随 剧情 完整 紧凑 没 说 恍惚 ~ ~\n",
967 | "193033 星半\n",
968 | "163310 其实 没有 纸醉金迷 ...... 林萧 有点 失望 .......\n",
969 | "101549 演员 喜欢 没 演技 撑 不起 一部 电影\n",
970 | "43776 太 难看\n",
971 | "Name: comment_processed, dtype: object"
972 | ]
973 | },
974 | "execution_count": 26,
975 | "metadata": {},
976 | "output_type": "execute_result"
977 | }
978 | ],
979 | "source": [
980 | "comments_train_concat.head()"
981 | ]
982 | },
983 | {
984 | "cell_type": "code",
985 | "execution_count": 28,
986 | "metadata": {},
987 | "outputs": [
988 | {
989 | "name": "stdout",
990 | "output_type": "stream",
991 | "text": [
992 | "(170004, 14880) (42502, 14880)\n"
993 | ]
994 | }
995 | ],
996 | "source": [
997 | "vectorizer = CountVectorizer() #便于稀疏矩阵存储更有效率\n",
998 | "trans = TfidfTransformer() #根据“词 词 词”生成句向量\n",
999 | "\n",
1000 | "word_count_train = vectorizer.fit_transform(comments_train_concat) #fit表示既训练又转化\n",
1001 | "tfidf_train = trans.fit_transform(word_count_train)\n",
1002 | "\n",
1003 | "word_count_test = vectorizer.transform(comments_test_concat) #fit表示既训练又转化\n",
1004 | "tfidf_test = trans.transform(word_count_test)\n",
1005 | "\n",
1006 | "print(tfidf_train.shape, tfidf_test.shape)"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "code",
1011 | "execution_count": 29,
1012 | "metadata": {
1013 | "scrolled": true
1014 | },
1015 | "outputs": [
1016 | {
1017 | "data": {
1018 | "text/plain": [
1019 | "<170004x14880 sparse matrix of type ''\n",
1020 | "\twith 1468830 stored elements in Compressed Sparse Row format>"
1021 | ]
1022 | },
1023 | "execution_count": 29,
1024 | "metadata": {},
1025 | "output_type": "execute_result"
1026 | }
1027 | ],
1028 | "source": [
1029 | "tfidf_train"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "markdown",
1034 | "metadata": {},
1035 | "source": [
1036 | "#### 任务7:把文本转换成word2vec向量"
1037 | ]
1038 | },
1039 | {
1040 | "cell_type": "code",
1041 | "execution_count": 30,
1042 | "metadata": {},
1043 | "outputs": [],
1044 | "source": [
1045 | "# 使用KeyedVectors.load_word2vec_format()函数加载预训练好的词向量文件\n",
1046 | "model = KeyedVectors.load_word2vec_format('data/sgns.zhihu.word')"
1047 | ]
1048 | },
1049 | {
1050 | "cell_type": "code",
1051 | "execution_count": 33,
1052 | "metadata": {},
1053 | "outputs": [
1054 | {
1055 | "data": {
1056 | "text/plain": [
1057 | "300"
1058 | ]
1059 | },
1060 | "execution_count": 33,
1061 | "metadata": {},
1062 | "output_type": "execute_result"
1063 | }
1064 | ],
1065 | "source": [
1066 | "model['今天'].shape[0]"
1067 | ]
1068 | },
1069 | {
1070 | "cell_type": "code",
1071 | "execution_count": 49,
1072 | "metadata": {},
1073 | "outputs": [
1074 | {
1075 | "data": {
1076 | "text/plain": [
1077 | "80"
1078 | ]
1079 | },
1080 | "execution_count": 49,
1081 | "metadata": {},
1082 | "output_type": "execute_result"
1083 | }
1084 | ],
1085 | "source": [
1086 | "# 由于数据量大需要长时间训练,这里就用小样本试一下效果\n",
1087 | "x_train = x_train[:80]\n",
1088 | "x_test = x_test[:20]\n",
1089 | "y_train = y_train[:80]\n",
1090 | "y_test = y_test[:20]\n",
1091 | "len(x_train)"
1092 | ]
1093 | },
1094 | {
1095 | "cell_type": "code",
1096 | "execution_count": 50,
1097 | "metadata": {},
1098 | "outputs": [
1099 | {
1100 | "name": "stderr",
1101 | "output_type": "stream",
1102 | "text": [
1103 | "\n",
1104 | "apply: 0%| | 0/80 [00:00, ?it/s]\u001b[A\n",
1105 | "apply: 12%|█▉ | 10/80 [00:00<00:00, 95.50it/s]\u001b[A\n",
1106 | "apply: 41%|█████▊ | 33/80 [00:00<00:00, 114.84it/s]\u001b[A\n",
1107 | "apply: 100%|██████████████| 80/80 [00:00<00:00, 215.14it/s]\u001b[A\n",
1108 | "\n",
1109 | "apply: 100%|██████████████| 20/20 [00:00<00:00, 392.58it/s]\u001b[A"
1110 | ]
1111 | },
1112 | {
1113 | "name": "stdout",
1114 | "output_type": "stream",
1115 | "text": [
1116 | "(80, 300) (20, 300)\n"
1117 | ]
1118 | },
1119 | {
1120 | "name": "stderr",
1121 | "output_type": "stream",
1122 | "text": [
1123 | "\n"
1124 | ]
1125 | }
1126 | ],
1127 | "source": [
1128 | "#利用词向量取平均构造句向量\n",
1129 | "sentence_len=model['今天'].shape[0]\n",
1130 | "def get_sentence_vec(comment):\n",
1131 | " sentence_vec=np.zeros(sentence_len)\n",
1132 | " word_num=0\n",
1133 | " for word in comment:\n",
1134 | " if word in model:\n",
1135 | " sentence_vec+=model[word]\n",
1136 | " word_num+=1\n",
1137 | " return sentence_vec/word_num\n",
1138 | "\n",
1139 | "word2vec_train = np.vstack(x_train.progress_apply(get_sentence_vec))\n",
1140 | "word2vec_test = np.vstack(x_test.progress_apply(get_sentence_vec))\n",
1141 | "\n",
1142 | "print (word2vec_train.shape, word2vec_test.shape)"
1143 | ]
1144 | },
1145 | {
1146 | "cell_type": "code",
1147 | "execution_count": 51,
1148 | "metadata": {
1149 | "scrolled": false
1150 | },
1151 | "outputs": [
1152 | {
1153 | "name": "stdout",
1154 | "output_type": "stream",
1155 | "text": [
1156 | "[[-0.085519 0.43515207 -0.17389666 ... 0.20313986 0.073142\n",
1157 | " -0.07703887]\n",
1158 | " [ nan nan nan ... nan nan\n",
1159 | " nan]\n",
1160 | " [-0.00253413 0.31285712 0.03232525 ... 0.06110513 0.09487825\n",
1161 | " -0.0657875 ]\n",
1162 | " ...\n",
1163 | " [-0.090169 0.42740925 -0.217206 ... -0.065442 0.09200576\n",
1164 | " -0.041123 ]\n",
1165 | " [-0.04754096 0.42249026 -0.143586 ... 0.05339443 0.03275817\n",
1166 | " -0.080588 ]\n",
1167 | " [-0.16413033 0.3194075 -0.00936734 ... 0.084864 0.14636583\n",
1168 | " -0.0699775 ]]\n",
1169 | "True\n"
1170 | ]
1171 | }
1172 | ],
1173 | "source": [
1174 | "# 查看向量,发现存在空值\n",
1175 | "print(word2vec_train[:10])\n",
1176 | "print(np.isnan(word2vec_train).any())"
1177 | ]
1178 | },
1179 | {
1180 | "cell_type": "code",
1181 | "execution_count": 52,
1182 | "metadata": {},
1183 | "outputs": [],
1184 | "source": [
1185 | "# 去掉空值\n",
1186 | "word2vec_train[np.isnan(word2vec_train)]=0\n",
1187 | "word2vec_test[np.isnan(word2vec_test)] = 0"
1188 | ]
1189 | },
1190 | {
1191 | "cell_type": "code",
1192 | "execution_count": 53,
1193 | "metadata": {},
1194 | "outputs": [
1195 | {
1196 | "name": "stdout",
1197 | "output_type": "stream",
1198 | "text": [
1199 | "False\n",
1200 | "False\n"
1201 | ]
1202 | }
1203 | ],
1204 | "source": [
1205 | "# 检查后无空值\n",
1206 | "print(np.isnan(word2vec_train).any())\n",
1207 | "print(np.isnan(word2vec_test).any())"
1208 | ]
1209 | },
1210 | {
1211 | "cell_type": "markdown",
1212 | "metadata": {},
1213 | "source": [
1214 | "#### 任务8:把文本转换成bert向量"
1215 | ]
1216 | },
1217 | {
1218 | "cell_type": "code",
1219 | "execution_count": 8,
1220 | "metadata": {},
1221 | "outputs": [],
1222 | "source": [
1223 | "embedding_en = BertEmbedding()"
1224 | ]
1225 | },
1226 | {
1227 | "cell_type": "code",
1228 | "execution_count": 9,
1229 | "metadata": {
1230 | "scrolled": true
1231 | },
1232 | "outputs": [
1233 | {
1234 | "name": "stderr",
1235 | "output_type": "stream",
1236 | "text": [
1237 | "D:\\Downloads\\Anaconda\\envs\\NLP_class\\lib\\site-packages\\gluonnlp\\model\\bert.py:693: UserWarning: wiki_cn/wiki_multilingual will be deprecated. Please use wiki_cn_cased/wiki_multilingual_uncased instead.\n",
1238 | " warnings.warn('wiki_cn/wiki_multilingual will be deprecated.'\n"
1239 | ]
1240 | }
1241 | ],
1242 | "source": [
1243 | "ctx = mxnet.cpu()\n",
1244 | "embedding = BertEmbedding(ctx=ctx, dataset_name='wiki_multilingual')"
1245 | ]
1246 | },
1247 | {
1248 | "cell_type": "code",
1249 | "execution_count": 10,
1250 | "metadata": {},
1251 | "outputs": [],
1252 | "source": [
1253 | "result = embedding(['今天'])"
1254 | ]
1255 | },
1256 | {
1257 | "cell_type": "code",
1258 | "execution_count": 11,
1259 | "metadata": {},
1260 | "outputs": [
1261 | {
1262 | "data": {
1263 | "text/plain": [
1264 | "[(['今', '天'],\n",
1265 | " [array([ 1.26106322e-01, 2.66802341e-01, 8.72310027e-02, -6.03290349e-02,\n",
1266 | " -7.75884330e-01, -1.13786548e-01, 3.59727591e-01, 1.85813323e-01,\n",
1267 | " -1.47403002e+00, -1.19097352e-01, -7.38840699e-02, -7.87430704e-01,\n",
1268 | " -9.21352804e-02, 1.61335051e-01, -8.64775032e-02, 2.56176621e-01,\n",
1269 | " 8.67899284e-02, -1.32034823e-01, -8.10163245e-02, -1.74523041e-01,\n",
1270 | " 3.95636708e-02, 7.05541968e-02, -1.02931105e-01, -2.12510914e-01,\n",
1271 | " 7.06533492e-01, -6.93008751e-02, 8.60075727e-02, -2.62604266e-01,\n",
1272 | " -1.59369946e+00, -1.49633080e-01, -1.63492024e-01, 5.32592773e-01,\n",
1273 | " -4.69610304e-01, -6.60765246e-02, -1.39505550e-01, -3.14024031e-01,\n",
1274 | " 7.26260841e-02, 1.34167528e+00, -4.96392064e-02, -3.96719128e-01,\n",
1275 | " 6.35222569e-02, -1.24420702e-01, -6.83844239e-02, 9.81433690e-03,\n",
1276 | " -3.70132998e-02, -3.15945178e-01, 6.09729528e-01, 1.38048172e-01,\n",
1277 | " -6.59693405e-02, -9.41009894e-02, -7.49259174e-01, 3.76792192e-01,\n",
1278 | " 2.29001790e-01, -4.36345339e-01, -2.04384401e-02, 1.27597496e-01,\n",
1279 | " -2.80143678e-01, 4.38202657e-02, 1.90831870e-01, 2.05787390e-01,\n",
1280 | " 1.03129065e+00, -4.77374680e-02, -2.58414537e-01, 1.24643020e-01,\n",
1281 | " -5.21558076e-02, -1.89845592e-01, 5.06949723e-02, 7.42469877e-02,\n",
1282 | " -1.77084714e-01, 1.19201057e-01, 1.99304760e-01, 2.05253616e-01,\n",
1283 | " 2.32253879e-01, -3.04696709e-01, 2.71507531e-01, 1.94131047e-01,\n",
1284 | " 1.54946834e-01, -2.09391147e-01, -1.40534580e-01, 1.79154813e-01,\n",
1285 | " 1.66331947e-01, 2.83545375e-01, -4.85008568e-01, -1.10102713e-01,\n",
1286 | " -4.37588692e-02, 3.53730649e-01, 7.23953396e-02, -4.11256403e-02,\n",
1287 | " -8.49407092e-02, -3.21422130e-01, 6.40220046e-01, -3.01744163e-01,\n",
1288 | " -4.54312176e-01, 7.76606202e-02, -4.19383854e-01, 4.01837938e-02,\n",
1289 | " -3.23261052e-01, 1.00842738e+00, -2.18932271e-01, 2.30585501e-01,\n",
1290 | " 1.68398395e-02, 1.46595612e-01, 1.70177445e-01, -2.34217346e-01,\n",
1291 | " 2.61459291e-01, 2.28032231e-01, 1.65102705e-01, 1.80528983e-01,\n",
1292 | " -2.26848245e-01, -2.89855778e-01, 4.55773324e-01, -3.38162422e-01,\n",
1293 | " 2.56421328e-01, -3.19461167e-01, -6.63729571e-03, -3.59232277e-01,\n",
1294 | " -2.51614779e-01, -5.85042655e-01, 2.99407244e-02, 1.08890526e-01,\n",
1295 | " 1.09779947e-01, 1.17927873e+00, 7.00526461e-02, -1.68722495e-01,\n",
1296 | " 1.58386186e-01, 1.07950479e-01, -5.55427194e-01, 4.00758743e-01,\n",
1297 | " 8.85060951e-02, -2.95245498e-02, 8.94615203e-02, 1.16875812e-01,\n",
1298 | " 9.45655406e-01, -2.48193353e-01, -1.67217374e-01, -1.24813199e-01,\n",
1299 | " -3.57200444e-01, -8.91825855e-02, 4.10464913e-01, 8.75420943e-02,\n",
1300 | " 2.35141397e-01, -2.41117239e-01, -1.64400697e-01, -3.88971031e-01,\n",
1301 | " -1.03416644e-01, 1.00639954e-01, 1.52441904e-01, -3.37588906e-01,\n",
1302 | " 5.99943027e-02, 8.34143162e-02, 1.95154741e-01, 5.97074449e-01,\n",
1303 | " -9.98994336e-04, 5.94795123e-03, -7.06354082e-02, -8.68372619e-02,\n",
1304 | " -4.43338275e-01, -4.76958960e-01, 3.45992565e-01, -2.50542855e+00,\n",
1305 | " -1.75109133e-01, 1.71542510e-01, 3.99805456e-01, 1.02809697e-01,\n",
1306 | " 8.99637818e-01, 2.62709528e-01, -3.96312475e-01, 1.22865485e-02,\n",
1307 | " -6.69312835e-01, 1.47285536e-01, -1.25258565e-01, 8.64879191e-02,\n",
1308 | " 1.54150099e-01, 9.46274400e-02, -2.97961265e-01, 6.06270730e-01,\n",
1309 | " 1.19781710e-01, -3.04604024e-01, -1.67918161e-01, 1.73134714e-01,\n",
1310 | " -1.42603189e-01, 5.63433588e-01, 2.62960315e-01, 2.93688178e-01,\n",
1311 | " -2.41111428e-01, 2.52504230e-01, -2.10302398e-01, 1.78636029e-01,\n",
1312 | " -4.46130931e-02, -2.98608124e-01, 1.87268883e-01, -4.24292274e-02,\n",
1313 | " 3.30697410e-02, 6.96135089e-02, -8.97845775e-02, -2.84802735e-01,\n",
1314 | " 7.24574700e-02, 1.40984729e-03, -2.09936619e-01, -4.63914573e-02,\n",
1315 | " -4.66850728e-01, 8.46646950e-02, 2.63746083e-01, 2.40564883e-01,\n",
1316 | " -3.36567491e-01, -2.39261061e-01, -6.79055274e-01, 1.69596106e-01,\n",
1317 | " -1.59863949e-01, -2.59549052e-01, 2.85877913e-01, 8.04904997e-02,\n",
1318 | " -2.16998607e-01, 3.64427231e-02, -4.19895351e-03, 2.11900640e-02,\n",
1319 | " 1.31882980e-01, -1.63818710e-02, 3.26660365e-01, -2.90987492e-02,\n",
1320 | " -1.40660316e-01, -1.91358663e-02, -5.07389307e-02, -5.15377149e-02,\n",
1321 | " -7.72325322e-02, 2.51942754e-01, -5.22490501e-01, 8.03613942e-03,\n",
1322 | " -1.88994348e-01, -1.63445592e-01, -1.13846026e-01, -3.61358747e-02,\n",
1323 | " -2.69619554e-01, -9.06395726e-03, 1.09605975e-01, -2.84056127e-01,\n",
1324 | " -2.32409030e-01, 1.17916115e-01, 8.92082453e-02, -6.59729913e-02,\n",
1325 | " 2.91591167e-01, -1.24273121e-01, 2.01254100e-01, -4.53801304e-02,\n",
1326 | " 1.33521545e+00, 5.70048511e-01, 1.32188246e-01, -6.67885691e-03,\n",
1327 | " 3.68599325e-01, 1.69576973e-01, 3.06160986e-01, 3.08821976e-01,\n",
1328 | " -1.45869899e+00, -1.40947878e-01, -1.81350201e-01, -1.27963245e-01,\n",
1329 | " 4.61721599e-01, -8.30608457e-02, -6.22207463e-01, -3.87557037e-02,\n",
1330 | " -1.02501965e+00, -1.61401615e-01, 1.01217628e-01, 1.29409418e-01,\n",
1331 | " -1.57489806e-01, -2.99530357e-01, -8.41979086e-02, -2.59303957e-01,\n",
1332 | " -1.85507134e-01, -2.17634276e-01, -3.18406940e-01, -8.77603441e-02,\n",
1333 | " -1.03986733e-01, 1.31400645e-01, 1.02642681e-02, -3.55789423e-01,\n",
1334 | " 3.74548972e-01, 1.22185118e-01, 4.40281510e-01, -2.77468950e-01,\n",
1335 | " -5.11380196e-01, -4.45399523e-01, 6.50682077e-02, 3.84544164e-01,\n",
1336 | " 2.65027970e-01, -4.27128136e-01, 2.79622197e-01, 1.42016113e-01,\n",
1337 | " 2.84151644e-01, -2.00458057e-03, -2.31731749e+00, 7.58050829e-02,\n",
1338 | " 3.53268713e-01, -2.52083428e-02, -1.25145584e-01, -7.88250417e-02,\n",
1339 | " 1.68024912e-01, 1.11410409e-01, 3.76170203e-02, -3.74360859e-01,\n",
1340 | " -1.76447108e-01, -1.80362284e-01, -8.76877010e-02, -1.52705580e-01,\n",
1341 | " 2.13665605e-01, 6.94131404e-02, -3.55336875e-01, 3.54051948e-01,\n",
1342 | " -6.56578660e-01, 1.79975390e-01, -1.09655797e+00, 2.34456658e-01,\n",
1343 | " 1.08367577e-01, -2.76697040e-01, 1.44788101e-02, 9.24978554e-02,\n",
1344 | " 1.16660610e-01, 3.34203094e-02, -2.57551223e-01, 1.11221418e-01,\n",
1345 | " 1.74321383e-02, 4.64707464e-01, 8.76464993e-02, -3.02211680e-02,\n",
1346 | " 2.02788696e-01, -1.76837921e-01, 5.86101748e-02, -3.23150218e-01,\n",
1347 | " 3.98626864e-01, -7.29966909e-02, 8.50926414e-02, 2.77387381e-01,\n",
1348 | " -3.13408375e-01, -1.45106584e-01, -1.26967400e-01, -3.30692708e-01,\n",
1349 | " -9.21746641e-02, 6.08755574e-02, -3.19018736e-02, -1.76300272e-01,\n",
1350 | " -4.67699915e-02, -4.67610896e-01, -9.82120186e-02, -1.19484149e-01,\n",
1351 | " -8.63983110e-02, 5.03668413e-02, 6.13780171e-02, -1.12738304e-01,\n",
1352 | " -4.61732268e-01, -1.94343507e-01, 4.68280166e-03, -4.51166153e-01,\n",
1353 | " -4.47263658e-01, -1.53689012e-01, 1.44705981e-01, -1.52902856e-01,\n",
1354 | " 1.27458975e-01, 1.39114633e-01, 1.21240847e-01, 2.70443529e-01,\n",
1355 | " -1.07925482e-01, 4.49916907e-02, -2.01556966e-01, 2.12367550e-02,\n",
1356 | " 3.18209156e-02, -1.78075349e+00, 1.22928545e-01, 4.71502066e-01,\n",
1357 | " -8.31669867e-02, 8.07472318e-02, 3.55381757e-01, -2.20013529e-01,\n",
1358 | " 7.17123926e-01, 4.06115681e-01, -4.97565158e-02, -2.63155341e-01,\n",
1359 | " -1.87172025e-01, 2.92106569e-02, -3.06485176e-01, 1.79982558e-03,\n",
1360 | " 4.29848917e-02, -1.07468590e-02, -4.95602451e-02, 4.56169456e-01,\n",
1361 | " -4.31179889e-02, -4.22874391e-01, 2.73265064e-01, -2.88694918e-01,\n",
1362 | " -2.47252788e-02, 2.36432642e-01, 2.48168856e-01, 3.00108343e-01,\n",
1363 | " 7.38754272e-02, -3.16119254e-01, 5.92554808e-01, -7.37164497e-01,\n",
1364 | " 4.27768767e-01, -1.13189481e-01, 3.50129873e-01, 1.28235281e+00,\n",
1365 | " -3.31415772e-01, 9.66605246e-02, -2.40004495e-01, 1.38831961e+00,\n",
1366 | " 5.00005856e-02, 8.48281011e-02, 9.96230766e-02, 3.71624529e-01,\n",
1367 | " 1.40831396e-01, -1.18550634e+00, -1.67928290e+00, 2.30228707e-01,\n",
1368 | " 8.57262537e-02, -2.77547932e+00, -1.13164201e-01, 4.78007831e-03,\n",
1369 | " 7.50111938e-02, 6.44212812e-02, -5.72150230e-01, -1.37697339e-01,\n",
1370 | " 2.76656240e-01, -9.72450435e-01, -3.00585121e-01, 7.80006796e-02,\n",
1371 | " 2.94482291e-01, -5.49518466e-01, 2.56707162e-01, 2.93736830e-02,\n",
1372 | " 1.30949342e+00, -6.27330691e-03, 2.00231671e-02, -3.41446027e-02,\n",
1373 | " 2.68776119e-01, -1.56018570e-01, 3.56860816e-01, 1.41598657e-03,\n",
1374 | " 2.22957969e-01, -2.35503107e-01, 2.44242787e-01, -1.65964998e-02,\n",
1375 | " 2.00391412e-01, -9.69057251e-03, 6.15444779e-03, 1.78755075e-03,\n",
1376 | " -2.29847193e-01, 9.46993232e-02, -1.37527660e-01, 1.84551144e+00,\n",
1377 | " -1.93628371e-02, 3.48703712e-01, -2.39256501e-01, 1.21311188e-01,\n",
1378 | " 2.59126544e-01, 5.47014177e-03, 4.43561733e-01, 1.59504414e-01,\n",
1379 | " -5.83179630e-02, -3.43183517e-01, 2.20565796e-01, 1.72658253e+00,\n",
1380 | " -2.05172300e-02, 2.14464128e-01, 3.41670930e-01, 8.80390525e-01,\n",
1381 | " 4.26010042e-03, 1.95064694e-01, 1.09338000e-01, -9.70373526e-02,\n",
1382 | " -5.03570922e-02, -6.46455809e-02, 6.07973635e-01, 6.83265030e-01,\n",
1383 | " -7.07271993e-01, -2.53319442e-01, 1.21342614e-01, -3.17116380e-01,\n",
1384 | " -2.92985402e-02, 6.84522465e-02, -2.34826624e-01, 1.85807377e-01,\n",
1385 | " -2.53703982e-01, -2.72399187e-03, 4.68543619e-01, 2.34067529e-01,\n",
1386 | " 1.14263453e-01, 3.06182206e-01, -6.61556870e-02, 1.58093631e-01,\n",
1387 | " -1.14024051e-01, -1.41105995e-01, -6.51377290e-02, -2.05903813e-01,\n",
1388 | " 1.53660074e-01, -3.46582532e-02, -9.20230150e-01, -1.63497627e-02,\n",
1389 | " -9.38337147e-02, -2.43473351e-01, 4.93378133e-01, 9.99102056e-01,\n",
1390 | " 7.27729201e-02, 4.63724844e-02, 3.25009078e-01, 2.73074269e-01,\n",
1391 | " -1.73848212e-01, -2.55281329e-01, 1.36712766e+00, -3.93186539e-01,\n",
1392 | " -2.09334925e-01, 8.77970532e-02, 7.92194903e-03, 3.22975516e-01,\n",
1393 | " -2.29011640e-01, 8.16138387e-02, -2.44001374e-01, 1.51399374e-01,\n",
1394 | " -3.71974334e-02, 8.22767541e-02, 7.56590962e-02, -3.26808482e-01,\n",
1395 | " 1.45708874e-01, -2.93443978e-01, 1.15933500e-01, -1.25427306e-01,\n",
1396 | " 3.07911158e-01, -2.14309558e-01, 5.68735003e-02, -5.91122136e-02,\n",
1397 | " -3.39143872e-01, 1.53029454e+00, -3.25454295e-01, 3.35788354e-02,\n",
1398 | " -2.98235625e-01, -4.90363210e-01, -3.14886630e-01, 7.61448294e-02,\n",
1399 | " 4.91844356e-01, 4.01459396e-01, 5.85890561e-02, 1.88470259e-01,\n",
1400 | " -7.65935257e-02, -1.05816878e-01, -5.41607626e-02, 1.01357079e+00,\n",
1401 | " -1.55723348e-01, 3.88455093e-01, -5.03053308e-01, 7.38272369e-01,\n",
1402 | " -2.06884779e-02, -4.38171327e-01, -2.91449606e-01, 1.76572651e-01,\n",
1403 | " 4.28900808e-01, -1.38760414e-02, 1.71694219e-01, 4.26687181e-01,\n",
1404 | " -3.02698821e-01, 2.45089561e-01, -3.93071920e-01, -6.40760660e-01,\n",
1405 | " 3.91694039e-01, -1.15061805e-01, 8.06533694e-02, -1.92064738e+00,\n",
1406 | " -1.07382469e-01, 5.60993031e-02, 4.74147320e-01, 1.52929157e-01,\n",
1407 | " -1.51162986e-02, 4.37012594e-03, -1.17910467e-01, 2.80901849e-01,\n",
1408 | " -1.48666799e-01, 2.85651926e-02, 5.18093467e-01, -1.84381470e-01,\n",
1409 | " -3.30267310e-01, 1.22551285e-02, -1.85444206e-01, -2.56561995e-01,\n",
1410 | " 1.91102132e-01, -6.46088049e-02, 7.74124265e-02, 3.34806025e-01,\n",
1411 | " 1.68648645e-01, 2.86596149e-01, -2.16023773e-01, 6.21371232e-02,\n",
1412 | " 5.54141700e-02, -1.17394432e-01, -3.19062829e-01, 3.74340788e-02,\n",
1413 | " 4.93071347e-01, -1.29419476e-01, -6.29059970e-02, -3.26554105e-02,\n",
1414 | " 5.72735891e-02, -4.83003259e-02, 3.15703988e-01, 4.80694652e-01,\n",
1415 | " -1.12627077e+00, -1.09754004e-01, 3.40339035e-01, 2.15341955e-01,\n",
1416 | " 7.08184958e-01, 1.00181863e-01, 4.47505176e-01, -3.32108617e-01,\n",
1417 | " -2.27021277e-01, 1.36014104e-01, 5.07473722e-02, 2.79215336e-01,\n",
1418 | " 2.51521587e-01, 6.26080036e-01, 1.79235369e-01, -1.08090714e-01,\n",
1419 | " -1.47943527e-01, 1.84664428e-01, -2.22321134e-02, 1.23993143e-01,\n",
1420 | " -3.32192481e-02, 1.96175575e-01, 2.32960790e-01, 2.90090650e-01,\n",
1421 | " 9.78291109e-02, 2.73019075e-01, -3.28467250e-01, -4.99838442e-02,\n",
1422 | " 4.32633579e-01, -1.28734767e-01, -1.34447455e+00, -6.02802262e-02,\n",
1423 | " -2.82504678e-01, 2.45649427e-01, 1.57594755e-01, -7.11214066e-01,\n",
1424 | " 1.06716406e+00, 4.51241821e-01, -4.64118809e-01, 4.04723436e-02,\n",
1425 | " 4.36342001e-01, 1.49244583e+00, -2.46528089e-01, 3.24444413e-01,\n",
1426 | " 7.68179372e-02, -4.70080376e-01, -4.39414084e-02, 3.61224532e-01,\n",
1427 | " 1.85811996e-01, 4.13785130e-03, -1.81902146e+00, 1.03456140e-01,\n",
1428 | " 6.63369775e-01, 4.85309541e-01, -1.09832084e+00, 2.99086496e-02,\n",
1429 | " -2.99130976e-01, -2.31902108e-01, -6.27931580e-02, 4.13477361e-01,\n",
1430 | " 1.16714031e-01, 2.04963908e-01, -2.44405329e-01, -1.07766517e-01,\n",
1431 | " -2.28581727e-01, -3.05053592e-02, 1.29239663e-01, -5.52971661e-01,\n",
1432 | " 5.72551787e-01, -1.66640282e+00, -1.40449420e-01, 1.31966174e-01,\n",
1433 | " 4.61150140e-01, -3.04494649e-01, -5.94246462e-02, 2.94334978e-01,\n",
1434 | " 1.62373948e+00, -4.83049870e-01, -1.15044087e-01, -2.62384027e-01,\n",
1435 | " -1.55561447e-01, 1.65369194e-02, 1.33945167e-01, 2.17812240e-01,\n",
1436 | " -3.16705972e-01, -3.57981831e-01, -4.04067002e-02, -2.55170673e-01,\n",
1437 | " 2.52302498e-01, 1.38146877e-01, -1.92023411e-01, 1.11707054e-01,\n",
1438 | " -1.41912505e-01, -1.69336021e+00, 3.43797922e-01, 3.56099188e-01,\n",
1439 | " 1.16415247e-01, 5.19578904e-02, 1.67370355e+00, -8.39106590e-02,\n",
1440 | " 4.42560881e-01, 8.83387327e-02, 2.33116448e-01, 3.40943784e-01,\n",
1441 | " -1.12218529e-01, 3.71993333e-02, -1.18298024e-01, 2.46531010e-01,\n",
1442 | " 5.63015282e-01, -4.80919152e-01, -2.44298130e-01, 1.62026346e-01,\n",
1443 | " 4.00632054e-01, -1.09600008e-01, -1.12734988e-01, 3.44706297e-01,\n",
1444 | " -7.72474945e-01, 7.56983906e-02, 2.94565797e-01, -4.01974142e-01,\n",
1445 | " 2.67698914e-01, -1.51735395e-02, 1.56277165e-01, -4.56883430e-01,\n",
1446 | " 2.43612036e-01, 2.26460785e-01, 1.11711368e-01, 3.08002979e-01,\n",
1447 | " -1.54413319e+00, -2.78150856e-01, -2.94177756e-02, 1.29684055e+00,\n",
1448 | " 2.09367484e-01, -9.56634432e-02, 1.22253895e+00, -1.66500002e-01,\n",
1449 | " 8.60894769e-02, 1.60716608e-01, -5.97515285e-01, 3.64835262e-01,\n",
1450 | " 2.09492847e-01, -4.08062905e-01, -1.27320933e+00, -2.06542468e+00,\n",
1451 | " -2.11794302e-01, 9.25208554e-02, 2.63187975e-01, 2.07841665e-01,\n",
1452 | " 4.35125977e-02, 3.40145886e-01, 1.17807984e-01, -3.16560447e-01,\n",
1453 | " 5.29482812e-02, -1.98397905e-01, -4.27072868e-02, -5.01838438e-02,\n",
1454 | " -1.03334904e-01, 1.30403638e+00, -1.05680496e-01, 2.11387539e+00,\n",
1455 | " -2.78932601e-01, -1.74029738e-01, -1.48544431e-01, 3.32106620e-01,\n",
1456 | " -1.85129374e-01, 1.88460089e-02, -1.45166323e-01, -3.02773684e-01],\n",
1457 | " dtype=float32),\n",
1458 | " array([ 2.76840925e-01, 8.26565027e-01, 9.07764882e-02, -1.50831327e-01,\n",
1459 | " -6.34551585e-01, 6.10323846e-01, 3.97140115e-01, -1.00776479e-02,\n",
1460 | " -6.45264566e-01, 4.22459871e-01, -8.07657242e-01, -5.92913926e-01,\n",
1461 | " -4.75562960e-01, 2.22507417e-01, 1.83781922e-01, -6.24561012e-01,\n",
1462 | " -3.49579811e-01, 8.73854756e-02, -4.15823460e-01, 6.43006444e-01,\n",
1463 | " 3.47611904e-01, -1.51468739e-01, 7.03401864e-04, -1.27931923e-01,\n",
1464 | " 5.79305947e-01, -7.44373322e-01, -1.93675160e-01, 2.22524792e-01,\n",
1465 | " 5.86300492e-01, -1.79958761e-01, 5.16699076e-01, 4.03794795e-02,\n",
1466 | " -6.57347217e-02, -5.08569002e-01, -5.45851588e-01, -5.60832500e-01,\n",
1467 | " 8.39215592e-02, 7.73327351e-01, -6.76578656e-02, -3.77435863e-01,\n",
1468 | " -1.44090988e-02, 2.21318886e-01, -1.87255934e-01, -9.98707712e-02,\n",
1469 | " 3.48694354e-01, -4.69148725e-01, 6.75562024e-01, 1.86995074e-01,\n",
1470 | " 3.36505175e-01, -1.17399938e-01, -3.43767345e-01, -2.32344866e-01,\n",
1471 | " 5.58337212e-01, -3.32930386e-01, -1.42446637e-01, -1.67436361e-01,\n",
1472 | " 1.74312398e-01, -1.67070940e-01, -2.54285008e-01, 2.35959832e-02,\n",
1473 | " -5.12498319e-01, -1.77699074e-01, 1.28490105e-01, 7.30292737e-01,\n",
1474 | " 2.13838533e-01, 3.37158620e-01, -4.47189242e-01, 8.90575767e-01,\n",
1475 | " -1.23945497e-01, 2.51982994e-02, 2.37684131e-01, -6.03702068e-02,\n",
1476 | " -3.92875761e-01, -8.27128649e-01, 9.60737988e-02, 9.37794000e-02,\n",
1477 | " -3.25616956e-01, 3.71967912e-01, -4.72943455e-01, 2.85212129e-01,\n",
1478 | " 2.88874865e-01, 3.07326049e-01, 1.13291092e-01, 8.42854321e-01,\n",
1479 | " -1.97290570e-01, -5.22886097e-01, -9.34460938e-01, 3.35298449e-01,\n",
1480 | " 1.37902901e-01, 4.28579897e-01, 5.54107130e-04, -1.04619473e-01,\n",
1481 | " -5.04912496e-01, -3.88676077e-01, -1.02410428e-01, 5.50666809e-01,\n",
1482 | " -1.91893429e-01, -1.78403199e-01, 1.84441984e-01, -2.60585874e-01,\n",
1483 | " 1.18960977e+00, 4.59139705e-01, 2.08614141e-01, 1.05013363e-01,\n",
1484 | " 4.42037404e-01, -3.23232532e-01, -7.67949224e-01, 5.72599709e-01,\n",
1485 | " -2.72664465e-02, -8.97544697e-02, 2.73619294e-01, -5.13183713e-01,\n",
1486 | " -1.47768781e-02, -9.78036821e-02, -3.66084695e-01, 1.64795011e-01,\n",
1487 | " -6.91594303e-01, -4.93507385e-01, -3.65763977e-02, -3.13233584e-01,\n",
1488 | " 6.20015800e-01, -2.15986282e-01, -6.75275683e-01, -2.87165463e-01,\n",
1489 | " -2.55224437e-01, -4.71210808e-01, -6.07174397e-01, 5.19710422e-01,\n",
1490 | " -2.75853395e-01, 2.59882450e-01, -2.92171836e-01, -1.72849536e-01,\n",
1491 | " 4.25588965e-01, -5.18059552e-01, -1.29473180e-01, -1.44534975e-01,\n",
1492 | " -3.67147088e-01, -5.47776937e-01, -1.94942519e-01, 3.59415263e-03,\n",
1493 | " -1.38248533e-01, 3.69243026e-02, 1.46158755e-01, -3.76326591e-01,\n",
1494 | " -4.68575686e-01, -1.27162173e-01, 4.90764230e-01, -3.61120701e-03,\n",
1495 | " -4.05050665e-01, 7.16862142e-01, -1.62370205e-01, 1.45982653e-01,\n",
1496 | " 1.32606953e-01, -5.16272783e-01, -2.25130439e-01, -7.62912333e-02,\n",
1497 | " -7.15014040e-01, 4.63619262e-01, -9.10431705e-03, -1.74510732e-01,\n",
1498 | " -1.58363938e-01, 8.39608461e-02, -3.83243084e-01, 2.39158228e-01,\n",
1499 | " -3.95091951e-01, -2.88085133e-01, -4.62105691e-01, 4.23221558e-01,\n",
1500 | " -5.09737909e-01, 3.98316622e-01, 4.21871431e-02, 7.52901435e-02,\n",
1501 | " 3.84747952e-01, 1.47277191e-02, -1.51112840e-01, 3.84985089e-01,\n",
1502 | " -4.32581782e-01, 5.52707255e-01, -3.57496515e-02, 2.62063533e-01,\n",
1503 | " -4.89295244e-01, 5.74415624e-01, 3.64445269e-01, 4.28482264e-01,\n",
1504 | " -9.01821628e-02, 5.04856348e-01, 3.83304358e-01, 1.63643792e-01,\n",
1505 | " -3.36065024e-01, -1.26045555e-01, 5.65065086e-01, -1.95936143e-01,\n",
1506 | " -1.35561243e-01, 1.60098895e-02, -4.71215159e-01, 1.17438227e-01,\n",
1507 | " 6.58761188e-02, 1.39416113e-01, -2.85199314e-01, -7.74468541e-01,\n",
1508 | " -5.18814139e-02, -7.03027993e-02, -3.92293751e-01, 3.98168236e-01,\n",
1509 | " 3.28498483e-01, -6.54256761e-01, -1.81806594e-01, 7.50007212e-01,\n",
1510 | " -1.75041314e-02, -1.75579280e-01, 4.16870832e-01, 8.09306502e-02,\n",
1511 | " -6.43741548e-01, -6.76393449e-01, 9.76293802e-01, -4.92420822e-01,\n",
1512 | " 1.02614427e+00, 5.89634106e-02, 4.79298562e-01, 2.17192143e-01,\n",
1513 | " -2.72959143e-01, 1.14002660e-01, -2.36779660e-01, 1.84743851e-01,\n",
1514 | " -9.25041288e-02, 1.21317118e-01, -8.55241477e-01, -1.06928244e-01,\n",
1515 | " -2.43034482e-01, -2.77067870e-01, 5.39215542e-02, -1.51094660e-01,\n",
1516 | " -1.66546777e-01, -4.45110142e-01, -1.41027048e-02, -9.44142044e-03,\n",
1517 | " -2.35394657e-01, -5.20308077e-01, -2.43284419e-01, -3.06548625e-01,\n",
1518 | " -1.61093920e-01, 2.17013136e-02, 3.90323281e-01, 6.97779775e-01,\n",
1519 | " -3.29184026e-01, 4.51146245e-01, -1.24973223e-01, 5.86056590e-01,\n",
1520 | " 4.34757829e-01, -3.29620123e-01, 1.32997274e-01, -9.33215618e-01,\n",
1521 | " 1.59078345e-01, 1.71741679e-01, 6.28198147e-01, 3.62409949e-01,\n",
1522 | " 4.94185925e-01, 4.07402188e-01, -4.28444445e-01, 6.80085719e-02,\n",
1523 | " 4.11365747e-01, -2.57630974e-01, 3.89243126e-01, -9.62318778e-02,\n",
1524 | " -7.28507221e-01, -5.68340644e-02, 1.62404522e-01, -2.61736155e-01,\n",
1525 | " 3.58443737e-01, -2.94490129e-01, -2.77685463e-01, 2.26352811e-01,\n",
1526 | " -7.01522902e-02, 2.17956036e-01, -5.76574922e-01, -6.76278397e-02,\n",
1527 | " 5.40413976e-01, -2.24588946e-01, 1.90195262e-01, -9.57252830e-02,\n",
1528 | " -5.32542944e-01, -3.60040665e-01, 5.65953076e-01, 6.45993352e-01,\n",
1529 | " -1.33299544e-01, 1.47439493e-02, -5.20115346e-02, -6.19270325e-01,\n",
1530 | " 6.13151610e-01, -2.98626214e-01, 2.87942976e-01, -3.71322721e-01,\n",
1531 | " 5.38482010e-01, -4.82835799e-01, -7.15893060e-02, -2.43481338e-01,\n",
1532 | " -2.68257201e-01, 1.88601077e-01, 1.67941257e-01, 2.40665004e-02,\n",
1533 | " 5.74061513e-01, 3.06341976e-01, -4.11236249e-02, -3.27600002e-01,\n",
1534 | " -2.57163614e-01, 1.90908790e-01, 5.69362342e-02, 2.26930857e-01,\n",
1535 | " -3.03816229e-01, 2.65539050e-01, 5.38473427e-01, 7.64161423e-02,\n",
1536 | " -2.41635010e-01, 2.61448503e-01, -1.16281199e+00, 2.90542573e-01,\n",
1537 | " -3.95415984e-02, -5.81672907e-01, 5.77206872e-02, 2.95678496e-01,\n",
1538 | " -9.22688246e-02, -2.39181519e-01, 2.60175705e-01, 2.48556826e-02,\n",
1539 | " -8.83583203e-02, 3.87453645e-01, 2.58662075e-01, -1.07055640e+00,\n",
1540 | " -9.40756574e-02, 1.69061646e-02, 4.87125993e-01, -5.75047135e-02,\n",
1541 | " -1.32935971e-01, 1.86964378e-01, 3.83485109e-04, 8.61733258e-02,\n",
1542 | " 2.25000799e-01, -1.49061039e-01, -1.23562358e-01, -4.27220911e-02,\n",
1543 | " -1.51235417e-01, 1.06609017e-02, -4.82833087e-01, -2.75613129e-01,\n",
1544 | " -1.55519038e-01, -4.79521513e-01, -4.59499866e-01, 3.90237331e-01,\n",
1545 | " -5.58014572e-01, 2.18323976e-01, -3.70846659e-01, -1.79571033e-01,\n",
1546 | " -6.82328790e-02, -3.65593940e-01, 1.78819627e-01, 4.80528176e-02,\n",
1547 | " -2.44904310e-01, 5.67843914e-01, 5.79010010e-01, -1.56230539e-01,\n",
1548 | " 3.37402791e-01, 7.60124475e-02, 1.53991506e-01, 2.69366235e-01,\n",
1549 | " 8.30932409e-02, -3.16670150e-01, 1.52292758e-01, 1.10690880e+00,\n",
1550 | " 1.70462534e-01, -2.39071757e-01, -5.64445108e-02, -3.39342654e-01,\n",
1551 | " -1.43847376e-01, 5.02909124e-01, -3.12874019e-02, 9.44376141e-02,\n",
1552 | " 8.18557262e-01, 4.71902877e-01, -4.04975265e-02, 9.20465589e-02,\n",
1553 | " 3.22986841e-01, 2.18848631e-01, 3.94165628e-02, -1.20331392e-01,\n",
1554 | " 3.85783911e-02, 4.40358929e-02, -2.30434299e-01, -6.97801635e-02,\n",
1555 | " 1.30828425e-01, -9.52604190e-02, 3.37484509e-01, 1.23802461e-01,\n",
1556 | " 1.32070839e-01, -7.92660192e-03, 1.32292002e-01, -3.01257938e-01,\n",
1557 | " 2.49189928e-01, 1.16309941e-01, -2.50121862e-01, 2.63739735e-01,\n",
1558 | " -1.61772236e-01, 2.19450206e-01, -6.20333612e-01, 7.33659714e-02,\n",
1559 | " 1.07449546e-01, -8.04409161e-02, 8.54642749e-01, 2.25090206e-01,\n",
1560 | " 6.72243953e-01, -1.66301087e-01, 2.10834548e-01, 5.05019605e-01,\n",
1561 | " 1.26398206e-01, -1.43682480e-01, -3.31898779e-01, 1.14536852e-01,\n",
1562 | " -2.30645597e-01, 4.73025814e-02, -1.07875749e-01, -1.72144145e-01,\n",
1563 | " -5.17711230e-02, 5.70954561e-01, -8.09718072e-01, 8.00146312e-02,\n",
1564 | " 5.16867712e-02, -2.72809356e-01, -2.84016840e-02, -1.19916223e-01,\n",
1565 | " 3.41239154e-01, 1.11662716e-01, 1.66919865e-02, 5.52908033e-02,\n",
1566 | " 6.20945692e-01, -3.48970354e-01, -2.35599458e-01, 5.48840940e-01,\n",
1567 | " 9.64760855e-02, -3.98310162e-02, -2.71340668e-01, 5.15167832e-01,\n",
1568 | " 1.41690895e-01, -1.43796531e-02, -4.97483909e-02, -3.21184814e-01,\n",
1569 | " 1.87300742e-01, -6.18926287e-01, 7.87684143e-01, -1.94054589e-01,\n",
1570 | " -7.76575029e-01, 4.46316510e-01, -3.18624437e-01, 5.10790229e-01,\n",
1571 | " 1.06287435e-01, 3.07366312e-01, 3.28234166e-01, 4.54727113e-01,\n",
1572 | " -3.33646864e-01, 1.19129881e-01, 3.97487879e-01, 7.51769006e-01,\n",
1573 | " 3.20967078e-01, -3.73383671e-01, 5.18961966e-01, 6.60418198e-02,\n",
1574 | " -4.74335998e-03, 2.64523298e-01, -2.13951319e-01, -1.19847864e-01,\n",
1575 | " 5.07644713e-01, -2.59853929e-01, 7.63416231e-01, -3.07736218e-01,\n",
1576 | " 2.15829849e-01, 3.37613165e-01, 6.54141456e-02, -7.23732769e-01,\n",
1577 | " 4.50283915e-01, -1.52957305e-01, 1.99199438e-01, 2.92792946e-01,\n",
1578 | " -2.69260675e-01, 2.86223218e-02, 3.98590088e-01, 4.09100741e-01,\n",
1579 | " -1.16927907e-01, 1.85440794e-01, 3.95636320e-01, -8.28807876e-02,\n",
1580 | " -2.30106711e-01, 1.76303089e-01, -6.61098719e-01, -3.41480643e-01,\n",
1581 | " 5.16563296e-01, 3.02324817e-02, 5.97619891e-01, 6.41855538e-01,\n",
1582 | " 1.46834910e-01, -2.11373456e-02, -1.11692272e-01, -1.08284995e-01,\n",
1583 | " 5.70352674e-01, 3.46855015e-01, -6.93482101e-01, -2.88041793e-02,\n",
1584 | " 4.16417003e-01, -1.39435589e-01, -7.61784688e-02, -6.61546171e-01,\n",
1585 | " 2.46568948e-01, 1.22470096e-01, 4.17300224e-01, 4.40982610e-01,\n",
1586 | " -4.16393697e-01, 3.16070586e-01, -5.00988066e-01, -5.51797450e-01,\n",
1587 | " -2.90869355e-01, -1.30978793e-01, 5.02031505e-01, -5.04513204e-01,\n",
1588 | " 6.68973207e-01, -5.41653335e-01, -2.41655484e-02, -5.36852598e-01,\n",
1589 | " -3.76026601e-01, 3.21112536e-02, -6.04872227e-01, 2.71088123e-01,\n",
1590 | " 3.00134480e-01, -1.16666459e-01, -3.37054133e-01, -9.05266851e-02,\n",
1591 | " 3.84772986e-01, -5.64304352e-01, 7.90870488e-01, -5.95813990e-03,\n",
1592 | " -1.49497315e-01, -3.63787830e-01, 1.39228120e-01, -4.25898105e-01,\n",
1593 | " -1.98030308e-01, -2.07325965e-02, -2.93137506e-02, -1.99729592e-01,\n",
1594 | " -2.48851925e-01, 2.83485353e-01, -4.01313961e-01, 1.47738248e-01,\n",
1595 | " 8.44463557e-02, 3.18531126e-01, -2.31741637e-01, 4.15888965e-01,\n",
1596 | " 4.02550042e-01, -4.48794365e-01, 2.86729485e-01, -4.48561072e-01,\n",
1597 | " -7.52085268e-01, -2.37673670e-02, -5.68502486e-01, -3.53369743e-01,\n",
1598 | " 8.89680386e-02, 4.31340665e-01, 2.89432585e-01, 3.15292716e-01,\n",
1599 | " 3.59115094e-01, -5.57349883e-02, -1.58318490e-01, -1.10178508e-01,\n",
1600 | " -1.12563670e-02, -5.91831505e-02, 2.87963837e-01, -2.05741182e-01,\n",
1601 | " -3.57460201e-01, -2.66833548e-02, 4.38757360e-01, -3.23295593e-01,\n",
1602 | " -2.90090814e-02, 1.68381155e-01, 6.80677071e-02, -3.36379141e-01,\n",
1603 | " -2.87641376e-01, -2.28633702e-01, 2.06889778e-01, 5.51556274e-02,\n",
1604 | " 5.15698493e-01, -1.29872516e-01, -3.13996375e-01, -4.22227383e-01,\n",
1605 | " 2.43226588e-01, -4.99812931e-01, -4.01831090e-01, 2.02137232e-01,\n",
1606 | " 1.97118297e-01, -1.81076437e-01, -1.05943531e-01, 2.20869422e-01,\n",
1607 | " 9.84147966e-01, 1.15650997e-01, 3.56579483e-01, 1.66290596e-01,\n",
1608 | " -4.24213171e-01, 2.40331993e-01, 7.08245456e-01, 3.49749982e-01,\n",
1609 | " -8.52970719e-01, 2.20732927e-01, 5.02041399e-01, -4.67942953e-02,\n",
1610 | " 1.19512536e-01, -6.99503481e-01, -7.85716116e-01, 2.36024842e-01,\n",
1611 | " 7.53568053e-01, 4.72140402e-01, 2.73438036e-01, -6.52500391e-02,\n",
1612 | " 4.59974319e-01, -1.68109506e-01, 1.81992680e-01, 3.37684005e-01,\n",
1613 | " 9.74726006e-02, 3.61280516e-02, 9.69141498e-02, -2.83546716e-01,\n",
1614 | " -9.33192000e-02, 1.45069718e-01, 3.41943800e-01, -5.25634468e-01,\n",
1615 | " 5.05060971e-01, 2.83408999e-01, 2.76972383e-01, 2.21857071e-01,\n",
1616 | " -3.25723030e-02, 5.63406572e-02, 6.07237756e-01, -6.95867181e-01,\n",
1617 | " 1.00567728e-01, 4.74537611e-01, 3.37903649e-01, -6.83513358e-02,\n",
1618 | " 4.02592719e-01, 6.97898567e-02, -4.13532823e-01, 5.76120138e-01,\n",
1619 | " 1.16270646e-01, 1.84804901e-01, 6.81304455e-01, 1.40475839e-01,\n",
1620 | " 2.51083404e-01, 1.82496279e-01, -3.27882916e-01, 1.71544284e-01,\n",
1621 | " 3.30828249e-01, 1.81298435e-01, 1.49514273e-01, -7.48686135e-01,\n",
1622 | " -8.86624083e-02, 2.94715576e-02, 2.19254240e-01, 4.46581870e-01,\n",
1623 | " 5.85817061e-02, 3.81266117e-01, 3.03337276e-01, -1.88793212e-01,\n",
1624 | " 3.14181261e-02, -5.68980873e-01, 5.44854403e-02, -7.74939805e-02,\n",
1625 | " -6.14694118e-01, 1.78257316e-01, -1.85588390e-01, -4.92249668e-01,\n",
1626 | " 4.82140332e-01, -8.45473826e-01, 1.90706059e-01, -2.74542332e-01,\n",
1627 | " -4.25427765e-01, -2.88292974e-01, -3.55689317e-01, -5.32729745e-01,\n",
1628 | " -3.04560382e-02, 1.46151841e-01, 1.55448437e-01, -3.66787940e-01,\n",
1629 | " 2.29660094e-01, -3.51501614e-01, -8.73198062e-02, 7.02636614e-02,\n",
1630 | " -3.88192892e-01, -4.43718612e-01, 6.04313254e-01, 5.10663837e-02,\n",
1631 | " 7.68236071e-02, 9.08105671e-02, 9.57990587e-01, 5.30025721e-01,\n",
1632 | " 3.75556648e-01, 1.13459659e+00, -1.48903561e+00, -3.27929080e-01,\n",
1633 | " 9.01361585e-01, 1.15407608e-01, -3.78036529e-01, 3.17056119e-01,\n",
1634 | " -3.79730493e-01, -6.74803555e-03, 1.52334422e-01, -8.61858577e-03,\n",
1635 | " 1.97943777e-01, -3.30416262e-02, 1.63521200e-01, -3.81670803e-01,\n",
1636 | " -1.01735607e-01, -4.94871855e-01, 3.22815329e-01, 9.53062177e-01,\n",
1637 | " -3.83967578e-01, 1.51089519e-01, -2.36456066e-01, -5.47379553e-02,\n",
1638 | " 1.10694516e+00, -8.35253000e-02, -6.29568249e-02, -1.24091260e-01,\n",
1639 | " 9.29023921e-02, -9.71774012e-03, -1.10696584e-01, 1.99686259e-01,\n",
1640 | " -3.43343556e-01, -5.87700456e-02, -2.36133710e-01, -9.13501531e-02,\n",
1641 | " -4.20142919e-01, 2.39062995e-01, 7.18532681e-01, 5.91429651e-01,\n",
1642 | " 7.84924507e-01, -1.17374226e-01, -4.19736087e-01, -9.57143456e-02,\n",
1643 | " -4.41571586e-02, -2.72485793e-01, 4.47263211e-01, 2.00161874e-01,\n",
1644 | " 2.01198429e-01, -1.56416342e-01, -4.52620797e-02, 6.01006746e-01,\n",
1645 | " -3.38032663e-01, 7.62699664e-01, -3.83426249e-01, -1.60245538e-01,\n",
1646 | " -5.73046982e-01, -5.12821972e-01, 8.31243396e-02, -3.98193419e-01,\n",
1647 | " 1.16693616e-01, 3.55675668e-02, 7.60468543e-01, 5.14936745e-02,\n",
1648 | " -4.25191104e-01, -1.98021621e-01, -6.02186657e-02, -2.98234463e-01,\n",
1649 | " -2.55434394e-01, 6.27072603e-02, -5.09121060e-01, -6.11817837e-01],\n",
1650 | " dtype=float32)])]"
1651 | ]
1652 | },
1653 | "execution_count": 11,
1654 | "metadata": {},
1655 | "output_type": "execute_result"
1656 | }
1657 | ],
1658 | "source": [
1659 | "result #查看数据结构,发现result[0][1][0]为“今天”的向量"
1660 | ]
1661 | },
1662 | {
1663 | "cell_type": "code",
1664 | "execution_count": 91,
1665 | "metadata": {},
1666 | "outputs": [],
1667 | "source": [
1668 | "# 将comment数据用bert模型向量化\n",
1669 | "vec_len = result[0][1][0].shape[0]\n",
1670 | "\n",
1671 | "def process_word(w):\n",
1672 | " vec_com = np.zeros(vec_len)\n",
1673 | " num = len(w)\n",
1674 | " res = embedding([w])\n",
1675 | " k = len(res[0][0])\n",
1676 | " for i in range(k):\n",
1677 | " vec_com += res[0][1][i]\n",
1678 | " return vec_com/k\n",
1679 | "\n",
1680 | "# 利用词向量平均法求句向量,输入词典类型\n",
1681 | "def comm_vec(c):\n",
1682 | " vec_com = np.zeros(vec_len)\n",
1683 | " coun = 0\n",
1684 | " for w in c:\n",
1685 | " if w in model:\n",
1686 | " vec_com += process_word(w)\n",
1687 | " coun += 1\n",
1688 | " return vec_com/coun\n",
1689 | "\n"
1690 | ]
1691 | },
1692 | {
1693 | "cell_type": "code",
1694 | "execution_count": 94,
1695 | "metadata": {
1696 | "scrolled": true
1697 | },
1698 | "outputs": [
1699 | {
1700 | "name": "stderr",
1701 | "output_type": "stream",
1702 | "text": [
1703 | "\n",
1704 | "apply: 0%| | 0/80 [00:00, ?it/s]\u001b[A\n",
1705 | "apply: 2%|▍ | 2/80 [00:04<03:09, 2.43s/it]\u001b[A\n",
1706 | "apply: 5%|▊ | 4/80 [00:07<02:37, 2.07s/it]\u001b[A\n",
1707 | "apply: 6%|█ | 5/80 [00:09<02:33, 2.05s/it]\u001b[A\n",
1708 | "apply: 8%|█▏ | 6/80 [00:10<02:01, 1.65s/it]\u001b[A\n",
1709 | "apply: 10%|█▌ | 8/80 [00:10<01:26, 1.20s/it]\u001b[A\n",
1710 | "apply: 11%|█▊ | 9/80 [00:11<01:29, 1.26s/it]\u001b[A\n",
1711 | "apply: 12%|█▉ | 10/80 [00:19<03:48, 3.27s/it]\u001b[A\n",
1712 | "apply: 14%|██ | 11/80 [00:21<03:18, 2.88s/it]\u001b[A\n",
1713 | "apply: 15%|██▎ | 12/80 [00:24<03:11, 2.82s/it]\u001b[A\n",
1714 | "apply: 16%|██▍ | 13/80 [00:25<02:31, 2.27s/it]\u001b[A\n",
1715 | "apply: 18%|██▋ | 14/80 [00:28<02:49, 2.57s/it]\u001b[A\n",
1716 | "apply: 19%|██▊ | 15/80 [00:33<03:27, 3.19s/it]\u001b[A\n",
1717 | "apply: 20%|███ | 16/80 [00:34<02:51, 2.68s/it]\u001b[A\n",
1718 | "apply: 21%|███▏ | 17/80 [00:38<03:04, 2.93s/it]\u001b[A\n",
1719 | "apply: 22%|███▍ | 18/80 [00:40<02:56, 2.84s/it]\u001b[A\n",
1720 | "apply: 24%|███▌ | 19/80 [00:41<02:17, 2.25s/it]\u001b[A\n",
1721 | "apply: 25%|███▊ | 20/80 [00:43<02:08, 2.15s/it]\u001b[A\n",
1722 | "apply: 26%|███▉ | 21/80 [00:45<02:03, 2.09s/it]\u001b[A\n",
1723 | "apply: 28%|████▏ | 22/80 [01:01<05:59, 6.20s/it]\u001b[A\n",
1724 | "apply: 29%|████▎ | 23/80 [01:02<04:25, 4.65s/it]\u001b[A\n",
1725 | "apply: 30%|████▌ | 24/80 [01:04<03:30, 3.75s/it]\u001b[A\n",
1726 | "apply: 31%|████▋ | 25/80 [01:06<03:03, 3.34s/it]\u001b[A\n",
1727 | "apply: 32%|████▉ | 26/80 [01:09<02:52, 3.19s/it]\u001b[A\n",
1728 | "apply: 34%|█████ | 27/80 [01:12<02:49, 3.21s/it]\u001b[A\n",
1729 | "apply: 35%|█████▎ | 28/80 [01:18<03:31, 4.07s/it]\u001b[A\n",
1730 | "apply: 36%|█████▍ | 29/80 [01:21<03:08, 3.70s/it]\u001b[A\n",
1731 | "apply: 38%|█████▋ | 30/80 [01:32<04:57, 5.95s/it]\u001b[A\n",
1732 | "apply: 39%|█████▊ | 31/80 [01:33<03:41, 4.51s/it]\u001b[A\n",
1733 | "apply: 40%|██████ | 32/80 [01:34<02:44, 3.43s/it]\u001b[A\n",
1734 | "apply: 41%|██████▏ | 33/80 [01:45<04:23, 5.60s/it]\u001b[A\n",
1735 | "apply: 42%|██████▍ | 34/80 [01:51<04:20, 5.67s/it]\u001b[A\n",
1736 | "apply: 44%|██████▌ | 35/80 [02:01<05:22, 7.16s/it]\u001b[A\n",
1737 | "apply: 45%|██████▊ | 36/80 [02:02<03:45, 5.11s/it]\u001b[A\n",
1738 | "apply: 46%|██████▉ | 37/80 [02:04<02:57, 4.13s/it]\u001b[A\n",
1739 | "apply: 48%|███████▏ | 38/80 [02:13<04:00, 5.72s/it]\u001b[A\n",
1740 | "apply: 49%|███████▎ | 39/80 [02:24<05:03, 7.41s/it]\u001b[A\n",
1741 | "apply: 50%|███████▌ | 40/80 [02:34<05:28, 8.22s/it]\u001b[A\n",
1742 | "apply: 51%|███████▋ | 41/80 [02:35<03:48, 5.86s/it]\u001b[A\n",
1743 | "apply: 52%|███████▉ | 42/80 [02:40<03:39, 5.78s/it]\u001b[A\n",
1744 | "apply: 54%|████████ | 43/80 [02:42<02:47, 4.52s/it]\u001b[A\n",
1745 | "apply: 55%|████████▎ | 44/80 [02:44<02:15, 3.77s/it]\u001b[A\n",
1746 | "apply: 56%|████████▍ | 45/80 [02:55<03:28, 5.95s/it]\u001b[A\n",
1747 | "apply: 57%|████████▋ | 46/80 [03:04<03:54, 6.91s/it]\u001b[A\n",
1748 | "apply: 59%|████████▊ | 47/80 [03:09<03:25, 6.21s/it]\u001b[A\n",
1749 | "apply: 60%|█████████ | 48/80 [03:09<02:25, 4.56s/it]\u001b[A\n",
1750 | "apply: 61%|█████████▏ | 49/80 [03:22<03:39, 7.07s/it]\u001b[A\n",
1751 | "apply: 62%|█████████▍ | 50/80 [03:25<02:49, 5.65s/it]\u001b[A\n",
1752 | "apply: 64%|█████████▌ | 51/80 [03:29<02:31, 5.22s/it]\u001b[A\n",
1753 | "apply: 65%|█████████▊ | 52/80 [03:35<02:32, 5.46s/it]\u001b[A\n",
1754 | "apply: 66%|█████████▉ | 53/80 [03:43<02:46, 6.16s/it]\u001b[A\n",
1755 | "apply: 68%|██████████▏ | 54/80 [03:46<02:20, 5.41s/it]\u001b[A\n",
1756 | "apply: 69%|██████████▎ | 55/80 [03:50<02:05, 5.00s/it]\u001b[A\n",
1757 | "apply: 70%|██████████▌ | 56/80 [03:53<01:41, 4.22s/it]\u001b[A\n",
1758 | "apply: 71%|██████████▋ | 57/80 [03:58<01:40, 4.36s/it]\u001b[A\n",
1759 | "apply: 72%|██████████▉ | 58/80 [03:58<01:09, 3.17s/it]\u001b[A\n",
1760 | "apply: 74%|███████████ | 59/80 [04:01<01:03, 3.02s/it]\u001b[A\n",
1761 | "apply: 75%|███████████▎ | 60/80 [04:01<00:45, 2.30s/it]\u001b[A\n",
1762 | "apply: 76%|███████████▍ | 61/80 [04:12<01:31, 4.81s/it]\u001b[A\n",
1763 | "apply: 78%|███████████▋ | 62/80 [04:13<01:06, 3.70s/it]\u001b[A\n",
1764 | "apply: 79%|███████████▊ | 63/80 [04:14<00:47, 2.79s/it]\u001b[A\n",
1765 | "apply: 80%|████████████ | 64/80 [04:14<00:34, 2.16s/it]\u001b[A\n",
1766 | "apply: 81%|████████████▏ | 65/80 [04:21<00:54, 3.62s/it]\u001b[A\n",
1767 | "apply: 82%|████████████▍ | 66/80 [04:26<00:53, 3.85s/it]\u001b[A\n",
1768 | "apply: 84%|████████████▌ | 67/80 [04:28<00:41, 3.21s/it]\u001b[A\n",
1769 | "apply: 85%|████████████▊ | 68/80 [04:31<00:40, 3.35s/it]\u001b[A\n",
1770 | "apply: 86%|████████████▉ | 69/80 [04:33<00:31, 2.84s/it]\u001b[A\n",
1771 | "apply: 88%|█████████████▏ | 70/80 [04:35<00:27, 2.73s/it]\u001b[A\n",
1772 | "apply: 89%|█████████████▎ | 71/80 [04:40<00:28, 3.17s/it]\u001b[A\n",
1773 | "apply: 90%|█████████████▌ | 72/80 [04:42<00:22, 2.83s/it]\u001b[A\n",
1774 | "apply: 91%|█████████████▋ | 73/80 [04:45<00:20, 2.91s/it]\u001b[A\n",
1775 | "apply: 92%|█████████████▉ | 74/80 [04:46<00:15, 2.59s/it]\u001b[A\n",
1776 | "apply: 94%|██████████████ | 75/80 [04:49<00:12, 2.56s/it]\u001b[A\n",
1777 | "apply: 95%|██████████████▎| 76/80 [04:50<00:08, 2.02s/it]\u001b[A\n",
1778 | "apply: 96%|██████████████▍| 77/80 [04:58<00:11, 3.90s/it]\u001b[A\n",
1779 | "apply: 98%|██████████████▋| 78/80 [05:05<00:09, 4.91s/it]\u001b[A\n",
1780 | "apply: 99%|██████████████▊| 79/80 [05:10<00:04, 4.84s/it]\u001b[A\n",
1781 | "apply: 100%|███████████████| 80/80 [05:19<00:00, 4.00s/it]\u001b[A\n",
1782 | "\n",
1783 | "apply: 0%| | 0/20 [00:00, ?it/s]\u001b[A\n",
1784 | "apply: 15%|██▍ | 3/20 [00:01<00:08, 1.90it/s]\u001b[A\n",
1785 | "apply: 20%|███▏ | 4/20 [00:11<00:53, 3.34s/it]\u001b[A\n",
1786 | "apply: 25%|████ | 5/20 [00:12<00:39, 2.63s/it]\u001b[A\n",
1787 | "apply: 30%|████▊ | 6/20 [00:14<00:36, 2.58s/it]\u001b[A\n",
1788 | "apply: 35%|█████▌ | 7/20 [00:15<00:24, 1.92s/it]\u001b[A\n",
1789 | "apply: 40%|██████▍ | 8/20 [00:25<00:53, 4.44s/it]\u001b[A\n",
1790 | "apply: 45%|███████▏ | 9/20 [00:29<00:47, 4.32s/it]\u001b[A\n",
1791 | "apply: 50%|███████▌ | 10/20 [00:36<00:49, 4.99s/it]\u001b[A\n",
1792 | "apply: 55%|████████▎ | 11/20 [00:38<00:36, 4.08s/it]\u001b[A\n",
1793 | "apply: 60%|█████████ | 12/20 [00:46<00:43, 5.45s/it]\u001b[A\n",
1794 | "apply: 65%|█████████▊ | 13/20 [00:48<00:30, 4.35s/it]\u001b[A\n",
1795 | "apply: 70%|██████████▌ | 14/20 [01:00<00:39, 6.65s/it]\u001b[A\n",
1796 | "apply: 75%|███████████▎ | 15/20 [01:12<00:40, 8.08s/it]\u001b[A\n",
1797 | "apply: 80%|████████████ | 16/20 [01:14<00:25, 6.41s/it]\u001b[A\n",
1798 | "apply: 85%|████████████▊ | 17/20 [01:24<00:22, 7.36s/it]\u001b[A\n",
1799 | "apply: 90%|█████████████▌ | 18/20 [01:26<00:11, 5.99s/it]\u001b[A\n",
1800 | "apply: 95%|██████████████▎| 19/20 [01:28<00:04, 4.69s/it]\u001b[A\n",
1801 | "apply: 100%|███████████████| 20/20 [01:29<00:00, 4.48s/it]\u001b[A\n"
1802 | ]
1803 | }
1804 | ],
1805 | "source": [
1806 | "bert_train = np.vstack(x_train.progress_apply(comm_vec))\n",
1807 | "bert_test = np.vstack(x_test.progress_apply(comm_vec))"
1808 | ]
1809 | },
1810 | {
1811 | "cell_type": "markdown",
1812 | "metadata": {},
1813 | "source": [
1814 | "### 4. 训练模型以及评估\n",
1815 | "对如上三种不同的向量表示法,分别训练逻辑回归模型,需要做:\n",
1816 | "- 搭建模型\n",
1817 | "- 训练模型(并做交叉验证)\n",
1818 | "- 输出最好的结果"
1819 | ]
1820 | },
1821 | {
1822 | "cell_type": "markdown",
1823 | "metadata": {},
1824 | "source": [
1825 | "#### 4.1 用word2vec训练模型"
1826 | ]
1827 | },
1828 | {
1829 | "cell_type": "code",
1830 | "execution_count": 39,
1831 | "metadata": {},
1832 | "outputs": [],
1833 | "source": [
1834 | "from sklearn.linear_model import LogisticRegression"
1835 | ]
1836 | },
1837 | {
1838 | "cell_type": "code",
1839 | "execution_count": 54,
1840 | "metadata": {},
1841 | "outputs": [
1842 | {
1843 | "name": "stdout",
1844 | "output_type": "stream",
1845 | "text": [
1846 | "Best Score:0.8875 %\n",
1847 | "Best Parameters:{'C': 10, 'penalty': 'l1', 'solver': 'saga'} %\n"
1848 | ]
1849 | },
1850 | {
1851 | "name": "stderr",
1852 | "output_type": "stream",
1853 | "text": [
1854 | "D:\\Downloads\\Anaconda\\envs\\NLP_class\\lib\\site-packages\\sklearn\\linear_model\\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
1855 | " \"the coef_ did not converge\", ConvergenceWarning)\n"
1856 | ]
1857 | }
1858 | ],
1859 | "source": [
1860 | "# 利用GriSearcCV找到最优参数组合\n",
1861 | "from sklearn.model_selection import GridSearchCV #网格搜索选出最好的参数组合\n",
1862 | "parameters = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],'penalty':['l1','l2'], 'solver':['liblinear','lbfgs','sag','saga'] }\n",
1863 | "LR = LogisticRegression()\n",
1864 | "LR_grid_search = GridSearchCV(estimator = LR,\n",
1865 | " param_grid = parameters,\n",
1866 | " cv = 5,\n",
1867 | " n_jobs = -1)\n",
1868 | "LR_grid_search.fit(word2vec_train,y_train)\n",
1869 | "\n",
1870 | "print('Best Score:{} %'.format(LR_grid_search.best_score_))\n",
1871 | "print('Best Parameters:{} %'.format(LR_grid_search.best_params_))"
1872 | ]
1873 | },
1874 | {
1875 | "cell_type": "code",
1876 | "execution_count": 74,
1877 | "metadata": {},
1878 | "outputs": [
1879 | {
1880 | "name": "stdout",
1881 | "output_type": "stream",
1882 | "text": [
1883 | "word2vec LR test accuracy 0.9\n",
1884 | "Word2vec LR test F1_score 0.7222222222222222\n"
1885 | ]
1886 | }
1887 | ],
1888 | "source": [
1889 | "# 用非最优参数测试\n",
1890 | "LR_best=LogisticRegression(penalty='l1',C=100,solver='liblinear')\n",
1891 | "LR_best.fit(word2vec_train,y_train)\n",
1892 | "word2vec_y_pred=LR_best.predict(word2vec_test)\n",
1893 | "print('word2vec LR test accuracy %s' % metrics.accuracy_score(y_test,word2vec_y_pred))\n",
1894 | "print('Word2vec LR test F1_score %s' % metrics.f1_score(y_test, word2vec_y_pred,average=\"macro\"))"
1895 | ]
1896 | },
1897 | {
1898 | "cell_type": "code",
1899 | "execution_count": 59,
1900 | "metadata": {
1901 | "scrolled": false
1902 | },
1903 | "outputs": [
1904 | {
1905 | "name": "stdout",
1906 | "output_type": "stream",
1907 | "text": [
1908 | "word2vec LR test accuracy 0.8\n",
1909 | "Word2vec LR test F1_score 0.4444444444444444\n"
1910 | ]
1911 | },
1912 | {
1913 | "name": "stderr",
1914 | "output_type": "stream",
1915 | "text": [
1916 | "D:\\Downloads\\Anaconda\\envs\\NLP_class\\lib\\site-packages\\sklearn\\linear_model\\_sag.py:330: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
1917 | " \"the coef_ did not converge\", ConvergenceWarning)\n"
1918 | ]
1919 | }
1920 | ],
1921 | "source": [
1922 | "# 用最优参数测试,发现效果居然没上面好\n",
1923 | "LR_best=LogisticRegression(C= 10, penalty='l1', solver= 'saga')\n",
1924 | "LR_best.fit(word2vec_train,y_train)\n",
1925 | "word2vec_y_pred=LR_best.predict(word2vec_test)\n",
1926 | "print('word2vec LR test accuracy %s' % metrics.accuracy_score(y_test,word2vec_y_pred))\n",
1927 | "print('Word2vec LR test F1_score %s' % metrics.f1_score(y_test, word2vec_y_pred,average=\"macro\"))"
1928 | ]
1929 | },
1930 | {
1931 | "cell_type": "markdown",
1932 | "metadata": {},
1933 | "source": [
1934 | "#### 4.2 用bert训练模型"
1935 | ]
1936 | },
1937 | {
1938 | "cell_type": "code",
1939 | "execution_count": 101,
1940 | "metadata": {},
1941 | "outputs": [],
1942 | "source": [
1943 | "bert_train[np.isnan(bert_train)] = 0 #将空值设为0"
1944 | ]
1945 | },
1946 | {
1947 | "cell_type": "code",
1948 | "execution_count": 103,
1949 | "metadata": {},
1950 | "outputs": [
1951 | {
1952 | "name": "stdout",
1953 | "output_type": "stream",
1954 | "text": [
1955 | "Best Score:0.825\n",
1956 | "Best Parameters:{'C': 0.0001, 'penalty': 'l1', 'solver': 'saga'}\n"
1957 | ]
1958 | }
1959 | ],
1960 | "source": [
1961 | "lr = LogisticRegression()\n",
1962 | "\n",
1963 | "parameters = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],'penalty':['l1','l2'], 'solver':['liblinear','lbfgs','sag','saga'] }\n",
1964 | "\n",
1965 | "lr_grid_search = GridSearchCV(estimator = lr,\n",
1966 | " param_grid = parameters,\n",
1967 | " cv = 5,\n",
1968 | " n_jobs = -1)\n",
1969 | "\n",
1970 | "lr_gs = lr_grid_search.fit(bert_train, y_train)\n",
1971 | "lr_best_score = lr_grid_search.best_score_\n",
1972 | "lr_bes_param = lr_grid_search.best_params_\n",
1973 | "\n",
1974 | "print('Best Score:{}'.format(lr_best_score))\n",
1975 | "print('Best Parameters:{}'.format(lr_bes_param))"
1976 | ]
1977 | },
1978 | {
1979 | "cell_type": "code",
1980 | "execution_count": 104,
1981 | "metadata": {},
1982 | "outputs": [],
1983 | "source": [
1984 | "bert_test[np.isnan(bert_test)] = 0"
1985 | ]
1986 | },
1987 | {
1988 | "cell_type": "code",
1989 | "execution_count": 105,
1990 | "metadata": {},
1991 | "outputs": [
1992 | {
1993 | "name": "stdout",
1994 | "output_type": "stream",
1995 | "text": [
1996 | "Bert LR test accuracy 0.85\n",
1997 | "Bert LR test F1_score 0.45945945945945943\n"
1998 | ]
1999 | }
2000 | ],
2001 | "source": [
2002 | "lr_best = LogisticRegression(penalty='l1',C=0.0001,solver='saga')\n",
2003 | "lr_best.fit(bert_train, y_train)\n",
2004 | "bert_y_pred = lr_best.predict(bert_test)\n",
2005 | "\n",
2006 | "print('Bert LR test accuracy %s' % metrics.accuracy_score(y_test, bert_y_pred))\n",
2007 | "#逻辑回归模型在测试集上的F1_Score\n",
2008 | "print('Bert LR test F1_score %s' % metrics.f1_score(y_test, bert_y_pred,average=\"macro\"))"
2009 | ]
2010 | },
2011 | {
2012 | "cell_type": "markdown",
2013 | "metadata": {},
2014 | "source": [
2015 | "#### 保存和加载模型"
2016 | ]
2017 | },
2018 | {
2019 | "cell_type": "code",
2020 | "execution_count": 108,
2021 | "metadata": {},
2022 | "outputs": [],
2023 | "source": [
2024 | "# 将bert训练出的模型存储下来,sklearn中的模型一般都用pickle.dump(model,f)来保存\n",
2025 | "import pickle\n",
2026 | "with open('bert_lr_star_predict.model','wb') as f:\n",
2027 | " pickle.dump(lr_best,f)"
2028 | ]
2029 | },
2030 | {
2031 | "cell_type": "code",
2032 | "execution_count": 109,
2033 | "metadata": {},
2034 | "outputs": [
2035 | {
2036 | "name": "stdout",
2037 | "output_type": "stream",
2038 | "text": [
2039 | "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n"
2040 | ]
2041 | }
2042 | ],
2043 | "source": [
2044 | "# 加载模型方法:\n",
2045 | "with open('bert_lr_star_predict.model', 'rb') as f:\n",
2046 | " lr_best = pickle.load(f)\n",
2047 | " # 照常引用模型方法:\n",
2048 | " print(lr_best.predict(bert_test))"
2049 | ]
2050 | },
2051 | {
2052 | "cell_type": "markdown",
2053 | "metadata": {},
2054 | "source": [
2055 | "### 使用bert,并结合朴素贝叶斯模型"
2056 | ]
2057 | },
2058 | {
2059 | "cell_type": "code",
2060 | "execution_count": 106,
2061 | "metadata": {},
2062 | "outputs": [
2063 | {
2064 | "name": "stdout",
2065 | "output_type": "stream",
2066 | "text": [
2067 | "Bert NaiveBayes test accuracy 0.8\n",
2068 | "Bert NaiveBayes test F1_score 0.6875\n"
2069 | ]
2070 | }
2071 | ],
2072 | "source": [
2073 | "from sklearn.naive_bayes import GaussianNB\n",
2074 | "#使用bert特征训练一个朴素贝叶斯模型\n",
2075 | "gnb_bert = GaussianNB().fit(bert_train,y_train)\n",
2076 | "#使用朴素贝叶斯模型得到的测试集上的预测值\n",
2077 | "test_pred_gnb_bert = gnb_bert.predict(bert_test)\n",
2078 | "#朴素贝叶斯模型在测试集上准确率\n",
2079 | "print('Bert NaiveBayes test accuracy %s' % metrics.accuracy_score(y_test, test_pred_gnb_bert))\n",
2080 | "#朴素贝叶斯模型在测试集上的F1_Score\n",
2081 | "print('Bert NaiveBayes test F1_score %s' % metrics.f1_score(y_test, test_pred_gnb_bert,average=\"macro\"))"
2082 | ]
2083 | },
2084 | {
2085 | "cell_type": "markdown",
2086 | "metadata": {},
2087 | "source": [
2088 | "### 5. 实现打分"
2089 | ]
2090 | },
2091 | {
2092 | "cell_type": "code",
2093 | "execution_count": 78,
2094 | "metadata": {},
2095 | "outputs": [
2096 | {
2097 | "name": "stdout",
2098 | "output_type": "stream",
2099 | "text": [
2100 | "复仇者联盟 : 好电影\n",
2101 | "差评差评 : 烂电影\n"
2102 | ]
2103 | }
2104 | ],
2105 | "source": [
2106 | "# 输入评论文本\n",
2107 | "queries=[\"复仇者联盟\",\"差评差评\"]\n",
2108 | "queries_list=[]\n",
2109 | "for query in queries:\n",
2110 | " words=comment_cut(query)\n",
2111 | " pure_query=rm_stop_word(words)\n",
2112 | " query_vec=get_sentence_vec(pure_query)\n",
2113 | " queries_list.append(query_vec)\n",
2114 | " \n",
2115 | "query_predict=LR_best.predict(np.array(queries_list))\n",
2116 | "labels=[]\n",
2117 | "for i in query_predict:\n",
2118 | " if i==1:\n",
2119 | " labels.append(\"好电影\")\n",
2120 | " elif i==0:\n",
2121 | " labels.append(\"烂电影\")\n",
2122 | "for movie in range(len(queries)):\n",
2123 | " print(queries[movie],\":\",labels[movie])"
2124 | ]
2125 | },
2126 | {
2127 | "cell_type": "code",
2128 | "execution_count": null,
2129 | "metadata": {},
2130 | "outputs": [],
2131 | "source": []
2132 | }
2133 | ],
2134 | "metadata": {
2135 | "kernelspec": {
2136 | "display_name": "Python 3 (ipykernel)",
2137 | "language": "python",
2138 | "name": "python3"
2139 | },
2140 | "language_info": {
2141 | "codemirror_mode": {
2142 | "name": "ipython",
2143 | "version": 3
2144 | },
2145 | "file_extension": ".py",
2146 | "mimetype": "text/x-python",
2147 | "name": "python",
2148 | "nbconvert_exporter": "python",
2149 | "pygments_lexer": "ipython3",
2150 | "version": "3.9.7"
2151 | }
2152 | },
2153 | "nbformat": 4,
2154 | "nbformat_minor": 4
2155 | }
2156 |
--------------------------------------------------------------------------------
/image/example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XuekaiChen/text_classification/ba7138fa55a2bb0049261f3d6409634256ec16ae/image/example.png
--------------------------------------------------------------------------------
/image/数据样式.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XuekaiChen/text_classification/ba7138fa55a2bb0049261f3d6409634256ec16ae/image/数据样式.png
--------------------------------------------------------------------------------