├── README.md ├── data ├── deal.py ├── neg.csv ├── neutral.csv └── pos.csv ├── lstm ├── .ipynb_checkpoints │ └── test-checkpoint.ipynb ├── lstm_test.py ├── lstm_train.py └── test.ipynb ├── model ├── Word2vec_model.pkl ├── lstm.h5 └── lstm.yml └── requirements.txt /README.md: -------------------------------------------------------------------------------- 1 | ## 基于LSTM三分类的文本情感分析 2 | 3 | ### 背景介绍 4 | 5 | 文本情感分析作为NLP的常见任务,具有很高的实际应用价值。本文将采用LSTM模型,训练一个能够识别文本postive, neutral, negative三种情感的分类器。 6 | 7 | 本文的目的是快速熟悉LSTM做情感分析任务,所以本文提到的只是一个baseline,并在最后分析了其优劣。对于真正的文本情感分析,在本文提到的模型之上,还可以做很多工作,以后有空的话,笔者可以再做优化。 8 | 9 | ### 理论介绍 10 | 11 | #### RNN应用场景 12 | 13 | RNN相对于传统的神经网络,它允许我们对向量序列进行操作:输入序列、输出序列、或大部分的输入输出序列。如下图所示,每一个矩形是一个向量,箭头则表示函数(比如矩阵相乘)。输入向量用红色标出,输出向量用蓝色标出,绿色的矩形是RNN的状态(下面会详细介绍)。从做到右:(1)没有使用RNN的Vanilla模型,从固定大小的输入得到固定大小输出(比如图像分类)。(2)序列输出(比如图片字幕,输入一张图片输出一段文字序列)。(3)序列输入(比如情感分析,输入一段文字然后将它分类成积极或者消极情感)。(4)序列输入和序列输出(比如机器翻译:一个RNN读取一条英文语句然后将它以法语形式输出)。(5)同步序列输入输出(比如视频分类,对视频中每一帧打标签)。我们注意到在每一个案例中,都没有对序列长度进行预先特定约束,因为递归变换(绿色部分)是固定的,而且我们可以多次使用。 14 | 15 | ![](http://7xritj.com1.z0.glb.clouddn.com/16-5-25/61310852.jpg) 16 | 17 | 18 | #### word2vec 算法 19 | 20 | 建模环节中最重要的一步是特征提取,在自然语言处理中也不例外。在自然语言处理中,最核心的一个问题是,如何把一个句子用数字的形式有效地表达出来?如果能够完成这一步,句子的分类就不成问题了。显然,一个最初等的思路是:给每个词语赋予唯一的编号1,2,3,4...,然后把句子看成是编号的集合,比如假设1,2,3,4分别代表“我”、“你”、“爱”、“恨”,那么“我爱你”就是[1, 3, 2],“我恨你”就是[1, 4, 2]。这种思路看起来有效,实际上非常有问题,比如一个稳定的模型会认为3跟4是很接近的,因此[1, 3, 2]和[1, 4, 2]应当给出接近的分类结果,但是按照我们的编号,3跟4所代表的词语意思完全相反,分类结果不可能相同。因此,这种编码方式不可能给出好的结果。 21 | 22 | 读者也许会想到,我将意思相近的词语的编号凑在一堆(给予相近的编号)不就行了?嗯,确实如果,如果有办法把相近的词语编号放在一起,那么确实会大大提高模型的准确率。可是问题来了,如果给出每个词语唯一的编号,并且将相近的词语编号设为相近,实际上是假设了语义的单一性,也就是说,语义仅仅是一维的。然而事实并非如此,语义应该是多维的。 23 | 24 | 比如我们谈到“家园”,有的人会想到近义词“家庭”,从“家庭”又会想到“亲人”,这些都是有相近意思的词语;另外,从“家园”,有的人会想到“地球”,从“地球”又会想到“火星”。换句话说,“亲人”、“火星”都可以看作是“家园”的二级近似,但是“亲人”跟“火星”本身就没有什么明显的联系了。此外,从语义上来讲,“大学”、“舒适”也可以看做是“家园”的二级近似,显然,如果仅通过一个唯一的编号,是很难把这些词语放到适合的位置的。 25 | 26 | ![](http://kexue.fm/usr/uploads/2015/08/1893427039.png) 27 | 28 | **Word2Vec:高维来了** 29 | 30 | 从上面的讨论可以知道,很多词语的意思是各个方向发散开的,而不是单纯的一个方向,因此唯一的编号不是特别理想。那么,多个编号如何?换句话说,将词语对应一个多维向量?不错,这正是非常正确的思路。 31 | 32 | 为什么多维向量可行?首先,多维向量解决了词语的多方向发散问题,仅仅是二维向量就可以360度全方位旋转了,何况是更高维呢(实际应用中一般是几百维)。其次,还有一个比较实际的问题,就是多维向量允许我们用变化较小的数字来表征词语。怎么说?我们知道,就中文而言,词语的数量就多达数十万,如果给每个词语唯一的编号,那么编号就是从1到几十万变化,变化幅度如此之大,模型的稳定性是很难保证的。如果是高维向量,比如说20维,那么仅需要0和1就可以表达2^20=1048576220=1048576(100万)个词语了。变化较小则能够保证模型的稳定性。 33 | 34 | 扯了这么多,还没有真正谈到点子上。现在思路是有了,问题是,如何把这些词语放到正确的高维向量中?而且重点是,要在没有语言背景的情况下做到这件事情?(换句话说,如果我想处理英语语言任务,并不需要先学好英语,而是只需要大量收集英语文章,这该多么方便呀!)在这里我们不可能也不必要进行更多的原理上的展开,而是要介绍:而基于这个思路,有一个Google开源的著名的工具——Word2Vec。 35 | 36 | 简单来说,**Word2Vec就是完成了上面所说的我们想要做的事情——用高维向量(词向量,Word Embedding)表示词语**,并把相近意思的词语放在相近的位置,而且用的是实数向量(不局限于整数)。我们只需要有大量的某语言的语料,就可以用它来训练模型,获得词向量。词向量好处前面已经提到过一些,或者说,它就是问了解决前面所提到的问题而产生的。另外的一些好处是:词向量可以方便做聚类,用欧氏距离或余弦相似度都可以找出两个具有相近意思的词语。这就相当于解决了“一义多词”的问题(遗憾的是,似乎没什么好思路可以解决一词多义的问题。) 37 | 38 | 关于Word2Vec的数学原理,读者可以参考这系列文章。而Word2Vec的实现,Google官方提供了C语言的源代码,读者可以自行编译。而**Python的Gensim库**中也提供现成的Word2Vec作为子库(事实上,这个版本貌似比官方的版本更加强大)。 39 | 40 | #### 句向量 41 | 42 | 接下来要解决的问题是:我们已经分好词,并且已经将词语转换为高维向量,那么句子就对应着词向量的集合,也就是矩阵,类似于图像处理,图像数字化后也对应一个像素矩阵;可是模型的输入一般只接受一维的特征,那怎么办呢?一个比较简单的想法是将矩阵展平,也就是将词向量一个接一个,组成一个更长的向量。这个思路是可以,但是这样就会使得我们的输入维度高达几千维甚至几万维,事实上是难以实现的。(如果说几万维对于今天的计算机来说不是问题的话,那么对于1000x1000的图像,就是高达100万维了!) 43 | 44 | 在自然语言处理中,通常用到的方法是递归神经网络或循环神经网络(都叫RNNs)。**它们的作用跟卷积神经网络是一样的,将矩阵形式的输入编码为较低维度的一维向量,而保留大多数有用信息**。 45 | 46 | 47 | ![](http://kexue.fm/usr/uploads/2015/08/2067741257.png) 48 | 49 | 50 | ### Show me the code 51 | 52 | 工程代码主要是结合参考资料2做三分类的文本情感分析; 53 | 54 | 55 | #### 数据预处理与词向量模型训练 56 | 57 | 参考资料二中有很翔实的处理过程,包括: 58 | 59 | 1. 不同类别数据整理成输入矩阵 60 | 2. jieba分词 61 | 3. Word2Vec词向量模型训练 62 | 63 | 本文中就不做重复介绍了,想要了解的,可以去参考资料二的博文中查找。 64 | 65 | 三分类除了涉及到positive和negative两种情感外,还有一种neural情感,从原始数据集中可以提取到有语义转折的句子,“然而”,“但”都是关键词。从而可以得到3份不同语义的数据集。 66 | 67 | #### LSTM三分类模型 68 | 69 | 代码需要注意的几点是,第一是,标签需要使用keras.utils.to_categorical来yummy,第二是LSTM二分类的参数设置跟二分有区别,选用softmax,并且loss函数也要改成categorical_crossentropy,代码如下: 70 | 71 | ```python 72 | def get_data(index_dict,word_vectors,combined,y): 73 | 74 | n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 75 | embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 76 | for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 77 | embedding_weights[index, :] = word_vectors[word] 78 | x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2) 79 | y_train = keras.utils.to_categorical(y_train,num_classes=3) 80 | y_test = keras.utils.to_categorical(y_test,num_classes=3) 81 | # print x_train.shape,y_train.shape 82 | return n_symbols,embedding_weights,x_train,y_train,x_test,y_test 83 | 84 | 85 | ##定义网络结构 86 | def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test): 87 | print 'Defining a Simple Keras Model...' 88 | model = Sequential() # or Graph or whatever 89 | model.add(Embedding(output_dim=vocab_dim, 90 | input_dim=n_symbols, 91 | mask_zero=True, 92 | weights=[embedding_weights], 93 | input_length=input_length)) # Adding Input Length 94 | model.add(LSTM(output_dim=50, activation='tanh')) 95 | model.add(Dropout(0.5)) 96 | model.add(Dense(3, activation='softmax')) # Dense=>全连接层,输出维度=3 97 | model.add(Activation('softmax')) 98 | 99 | print 'Compiling the Model...' 100 | model.compile(loss='categorical_crossentropy', 101 | optimizer='adam',metrics=['accuracy']) 102 | 103 | print "Train..." # batch_size=32 104 | model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch,verbose=1) 105 | 106 | print "Evaluate..." 107 | score = model.evaluate(x_test, y_test, 108 | batch_size=batch_size) 109 | 110 | yaml_string = model.to_yaml() 111 | with open('../model/lstm.yml', 'w') as outfile: 112 | outfile.write( yaml.dump(yaml_string, default_flow_style=True) ) 113 | model.save_weights('../model/lstm.h5') 114 | print 'Test score:', score 115 | ``` 116 | 117 | #### 测试 118 | 119 | 代码如下: 120 | 121 | ```python 122 | def lstm_predict(string): 123 | print 'loading model......' 124 | with open('../model/lstm.yml', 'r') as f: 125 | yaml_string = yaml.load(f) 126 | model = model_from_yaml(yaml_string) 127 | 128 | print 'loading weights......' 129 | model.load_weights('../model/lstm.h5') 130 | model.compile(loss='categorical_crossentropy', 131 | optimizer='adam',metrics=['accuracy']) 132 | data=input_transform(string) 133 | data.reshape(1,-1) 134 | #print data 135 | result=model.predict_classes(data) 136 | # print result # [[1]] 137 | if result[0]==1: 138 | print string,' positive' 139 | elif result[0]==0: 140 | print string,' neutral' 141 | else: 142 | print string,' negative' 143 | ``` 144 | 145 | 经过检测,发现,原先在二分类模型中的“不是太好”,“不错不错”这样子带有前后语义转换的句子,都能正确预测,实战效果提升明显,但是也有缺点,缺点是中性评价出现的概率不高,笔者分析原因是,首先从数据集数量和质量着手,中性数据集的数量要比其他两个数据集少一半多,并且通过简单规则“然而”,“但”提取出来的中性数据集质量也不是很高,所以才会出现偏差。总而言之,训练数据的质量是非常重要的,如何获取高质量高数量的训练样本,也就成了新的难题。 146 | 147 | 148 | - 参考资料 149 | 150 | [文本情感分类(二):深度学习模型](http://spaces.ac.cn/archives/3414/) 151 | 152 | [Shopping Reviews sentiment analysis](https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/) 153 | 154 | -------------------------------------------------------------------------------- /data/deal.py: -------------------------------------------------------------------------------- 1 | #! /bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | with open("n_pos.csv", "w") as n: 5 | with open("pos.csv", "r") as p: 6 | for line in p.readlines(): 7 | if line == "\"\n": 8 | continue 9 | n.write(line) 10 | 11 | line = "\"" 12 | print len(line) -------------------------------------------------------------------------------- /lstm/.ipynb_checkpoints/test-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 9, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "Skipping line 2607: expected 1 fields, saw 9\n", 13 | "Skipping line 3143: expected 1 fields, saw 2\n", 14 | "Skipping line 3173: expected 1 fields, saw 8\n", 15 | "\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "\n", 22 | "neg=pd.read_csv('../data/neg.csv',header=None,index_col=None)\n", 23 | "pos=pd.read_csv('../data/pos.csv',header=None,index_col=None,error_bad_lines=False)\n", 24 | "neu=pd.read_csv('../data/neutral.csv', header=None, index_col=None)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 11, 30 | "metadata": {}, 31 | "outputs": [ 32 | { 33 | "data": { 34 | "text/plain": [ 35 | "0 做为一本声名在外的流行书,说的还是广州的外企,按道理应该和我的生存环境差不多啊。但是一看之下...\n", 36 | "1 作者完全是以一个过来的自认为是成功者的角度去写这个问题,感觉很不客观。虽然不是很喜欢,但是,...\n", 37 | "2 作者提倡内调,不信任化妆品,这点赞同。但是所列举的方法太麻烦,配料也不好找。不是太实用。\n", 38 | "3 作者的文笔还行,但通篇感觉太琐碎,有点文人的无病呻吟。自由主义者。作者的品性不敢苟同,无民族...\n", 39 | "4 作者倒是个很小资的人,但有点自恋的感觉,书并没有什么大帮助\n", 40 | "5 作为个人经验在网上谈谈可以,但拿来出书就有点过了,书中还有些明显的谬误。不过文笔还不错,建议...\n", 41 | "6 昨天刚兴奋地写了评论,今天便遇一闹心事,因把此套书推荐给很多朋友,朋友就拖我在网上购,结果前...\n", 42 | "7 纵观整部书(上下两册) 从文字,到结构,人物,情节 没有一个地方是可取的虽然有过从业经验 ...\n", 43 | "8 字很大,内容不够充实当初看大家评论说得很好才买的但实际上却没那么好,感觉深度也不够如果你还在...\n", 44 | "9 中国社会科学出版社出的版本可能有删节,但未查到相关说明。\n", 45 | "10 纸张的质量也不好,文字部分更是倾斜的,盗版的很不负责任,虽然不评价胡兰成本人,但是文字还是美...\n", 46 | "11 职场如战场在这部小说里被阐述的淋漓尽致,拉拉工作勤奋如老黄牛,但性格却更似倔牛;王伟虽正直但...\n", 47 | "12 只因李安的电影《色,戒》,才买了张爱玲的小说来读。读后的感觉是失望——那么短浅,迷惑——李导...\n", 48 | "13 之前看到大家都说非常好 于是 很心动 也买了本 回来后看看 非常一般它讲很多就是要我们承认自...\n", 49 | "14 整本书没几个英文,好像小孩子看的,但是小孩子也许看不懂的那种.失望\n", 50 | "15 整本书给我的感觉就是一农民暴富了后害怕别人也富,挤占了他的地位。但又不想把害怕的思想暴露底那...\n", 51 | "16 这书写得很乱, 不系统, 不正规。而且废话, 大话连篇, 好不容易说到正题了, 比如该如何用...\n", 52 | "17 这是一本小说集,好多章节故事,我都已经看过了。书印刷还是不错的。但是还是买整套的好,有点后悔。\n", 53 | "18 这是我这一年内看到最差的一本书,我用不到2个小时看完(我实在不愿意在这本书上浪费太多时间),...\n", 54 | "19 这是我看过文字写得很糟糕的书,因为买了,还是耐着性子看完了,但是总体来说不好,文字、内容、结...\n", 55 | "20 这几天旅行中也在看《杜拉拉升职记》,的确是值得推荐的职场读物。虽然是虚拟的,但很多skill...\n", 56 | "21 这个作家的书真的很一般,除非和她有交叉的经历,否则很难找到感觉.细腻是日本人的长处,但很平凡...\n", 57 | "22 这个书首位的排名言过其实,感觉比不上圈子全套。1)故事里面的故事很多牵强赴会,逻辑性差。2)...\n", 58 | "23 这本书总体来说还可以,但是有些地方自己不是很懂,比如按穴位都不知道大约位置在那里\n", 59 | "24 这本书完全是根据这部电影走红后出版的商业书。内容很不全。里面收集了张的很多长篇小说,但都简化了。\n", 60 | "25 这本书虽然内容不多,但是插图很大,很生动和形象,里面的故事都是小孩子碰到过的事情,并且很有启...\n", 61 | "26 这本书送到的时候就是全湿了,但是就给你们的客服打电话了,说会安排人过来取书.并办退款手续.可...\n", 62 | "27 这本书是本着是本热销书看的,但看完后觉得没什么意思,或者有些做作。。。\n", 63 | "28 这本书没有我想象的好,书本身的质量和印刷还可以 ,但内容与书名不符,应该改名字;而且有很多牵...\n", 64 | "29 这本书买之前是看了这么多的五星评价而才买来的,以为应该应值得拥有的一本书!看完这样书后,在我...\n", 65 | " ... \n", 66 | "4325 一直都知道LG-C960拥有目前手机业界中的较高像素,其突出想象的外形设计令人匪夷所思.虽然...\n", 67 | "4326 1、滑盖设计,纤薄机身,按键方式很有个性,电容触控式,只要用手指遮挡住相应功能键的红色灯即可...\n", 68 | "4327 这手机外观真的是太漂亮,如果你看到真机我觉得你一定会爱上她的美!而且屏幕我觉得颜色也很鲜艳!...\n", 69 | "4328 1.外形不错。我喜欢直板机,估计看我这个帖子的朋友也都是这样;我的是全黑的,很酷;还有一款...\n", 70 | "4329 屏做的真不错,跟同事一起对比了一下,我这个屏明显比他那个机子强,亮但是不刺眼,我同事调了亮度...\n", 71 | "4330 外观不错,感觉比A800那种正统外形显得活泼多了,喜欢!屏幕比较大,色彩也好,就是拍出照片效...\n", 72 | "4331 摄像头做的不错,片子出来很清楚而且颜色好,这机子内存也够大,我总爱带个机子逛街,淘到好东西先...\n", 73 | "4332 功能还算实在,尽管不是很好,但是,内外双屏的效果还不错,铃声也行了,手写笔慢是慢了点,但比起...\n", 74 | "4333 1,外观,还可以!比较好看2,屏幕,在阳光下不好看清楚!在其他的地方还是很好的!当然没有TF...\n", 75 | "4334 一直用的这只机子,虽然功能不多,但是手机本身应有的功能全有了,我感觉没什么不好。那些加了比如...\n", 76 | "4335 电池做的不错,连打电话加发短信,我能用一周,一般周6会充电,现在已成习惯了!机身够薄,机子也...\n", 77 | "4336 继承了西门子键盘快捷功能菜单的功能,总共有15个按键可以自定义功能,按照自己的习惯定义好后基...\n", 78 | "4337 元旦拿到了这款机子,在充了一夜的电,用了1天后,越来越发觉了这款机子的一些优越性能,在此简单...\n", 79 | "4338 M100不错的机子,整体设计简洁,直板机只有一个屏,所以省是不用说的了,6万5千色的屏,差不...\n", 80 | "4339 开机声音很响亮!这个“16和弦”声音特别响亮,表现效果很出众,有点Z2的感觉。铃声够大,放包...\n", 81 | "4340 1.书写功能方面:有很多像SonyEricsson的手写部分都会把用户局限在一个很小的区域内...\n", 82 | "4341 设计独特,纯粹的MP3造型,加上轻巧的机身,拿在手里有\"一见钟情\"的感觉;因为很想买一个MP...\n", 83 | "4342 用这只机子很舒服,信号方面不错,机子运行也稳定,通话是声音比较清楚,室外我也试过,听着没问题...\n", 84 | "4343 机子不大不小,我用很合适;外屏幕小了点,但是看来电号码还是够用,很方便;自动开关机功能我很喜...\n", 85 | "4344 由于是寄回老家的没有看到东西,但听家里人说还不错,还没有安装。有待后期追加评论!\n", 86 | "4345 双十一买的,还没安装,但是很满意!\n", 87 | "4346 商品不错 送货也快 但是服务员售后真的不行 快递哥哥3天从西安送到南宁并且送进家门\n", 88 | "4347 到货速度很快,宝贝包装完好。虽然还没来得及安装使用,但因为有朋友正使用该款,质量有保证。\n", 89 | "4348 还没拆开,但是包装的很好。第二次来买了,还送了赠品,nice。谢谢\n", 90 | "4349 价格实惠,快递也快,安装也快,虽然安装小师傅年轻是新手,但很耐心负责比较满意吧!保温效果很满意.\n", 91 | "4350 虽然家里用不到 但商家的服务态度 超级棒!赞!\n", 92 | "4351 已经安装上了,但没有试,我相信美的质量,应该不会有问题。先5分好评吧。\n", 93 | "4352 很不错,到货很快,当天给客服打电话,下午安装人员就上门了,速度啊!但但当时没有水,没有试试,...\n", 94 | "4353 买来放在出租房里的,所以自己也没试过,但是安装服务人员特别好,最大限度地给省钱,两套热水澡装...\n", 95 | "4354 买来放在出租房里的,所以自己也没试过,但是安装服务人员特别好,最大限度地给省钱,两套热水澡装...\n", 96 | "Name: 0, Length: 4355, dtype: object" 97 | ] 98 | }, 99 | "execution_count": 11, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "neu[0]" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 19, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "(21088,)" 117 | ] 118 | }, 119 | "execution_count": 19, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "import numpy as np\n", 126 | "\n", 127 | "combined = np.concatenate((pos[0], neu[0], neg[0]))\n", 128 | "combined.shape" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 15, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "(21088,)" 140 | ] 141 | }, 142 | "execution_count": 15, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "# pos -> 1; neu -> 0; neg -> -1\n", 149 | "y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neu), dtype=int), -1*np.ones(len(neg),dtype=int)))\n", 150 | "y.shape" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 20, 156 | "metadata": { 157 | "collapsed": true 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "import jieba\n", 162 | "\n", 163 | "#对句子经行分词,并去掉换行符\n", 164 | "def tokenizer(text):\n", 165 | " ''' Simple Parser converting each document to lower-case, then\n", 166 | " removing the breaks for new lines and finally splitting on the\n", 167 | " whitespace\n", 168 | " '''\n", 169 | " text = [jieba.lcut(document.replace('\\n', '')) for document in text]\n", 170 | " return text\n", 171 | "\n", 172 | "combined = tokenizer(combined)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 21, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "Training a Word2vec model...\n" 185 | ] 186 | } 187 | ], 188 | "source": [ 189 | "from gensim.models.word2vec import Word2Vec\n", 190 | "from gensim.corpora.dictionary import Dictionary\n", 191 | "from keras.preprocessing import sequence\n", 192 | "import multiprocessing\n", 193 | "\n", 194 | "cpu_count = multiprocessing.cpu_count() # 4\n", 195 | "vocab_dim = 100\n", 196 | "n_iterations = 10 # ideally more..\n", 197 | "n_exposures = 10 # 所有频数超过10的词语\n", 198 | "window_size = 7\n", 199 | "n_epoch = 4\n", 200 | "input_length = 100\n", 201 | "maxlen = 100\n", 202 | "\n", 203 | "def create_dictionaries(model=None,\n", 204 | " combined=None):\n", 205 | " ''' Function does are number of Jobs:\n", 206 | " 1- Creates a word to index mapping\n", 207 | " 2- Creates a word to vector mapping\n", 208 | " 3- Transforms the Training and Testing Dictionaries\n", 209 | "\n", 210 | " '''\n", 211 | " if (combined is not None) and (model is not None):\n", 212 | " gensim_dict = Dictionary()\n", 213 | " gensim_dict.doc2bow(model.vocab.keys(),\n", 214 | " allow_update=True)\n", 215 | " # freqxiao10->0 所以k+1\n", 216 | " w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k)\n", 217 | " w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word))\n", 218 | "\n", 219 | " def parse_dataset(combined): # 闭包-->临时使用\n", 220 | " ''' Words become integers\n", 221 | " '''\n", 222 | " data=[]\n", 223 | " for sentence in combined:\n", 224 | " new_txt = []\n", 225 | " for word in sentence:\n", 226 | " try:\n", 227 | " new_txt.append(w2indx[word])\n", 228 | " except:\n", 229 | " new_txt.append(0) # freqxiao10->0\n", 230 | " data.append(new_txt)\n", 231 | " return data # word=>index\n", 232 | " combined=parse_dataset(combined)\n", 233 | " combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0\n", 234 | " return w2indx, w2vec,combined\n", 235 | " else:\n", 236 | " print 'No data provided...'\n", 237 | "\n", 238 | "\n", 239 | "#创建词语字典,并返回每个词语的索引,词向量,以及每个句子所对应的词语索引\n", 240 | "def word2vec_train(combined):\n", 241 | "\n", 242 | " model = Word2Vec(size=vocab_dim,\n", 243 | " min_count=n_exposures,\n", 244 | " window=window_size,\n", 245 | " workers=cpu_count,\n", 246 | " iter=n_iterations)\n", 247 | " model.build_vocab(combined) # input: list\n", 248 | " model.train(combined)\n", 249 | " model.save('../model/Word2vec_model.pkl')\n", 250 | " index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)\n", 251 | " return index_dict, word_vectors,combined\n", 252 | "\n", 253 | "print 'Training a Word2vec model...'\n", 254 | "index_dict, word_vectors,combined=word2vec_train(combined)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 25, 260 | "metadata": {}, 261 | "outputs": [ 262 | { 263 | "name": "stdout", 264 | "output_type": "stream", 265 | "text": [ 266 | "Setting up Arrays for Keras Embedding Layer...\n", 267 | "x_train.shape and y_train.shape:\n", 268 | "(16870, 100) (16870, 3)\n", 269 | "Defining a Simple Keras Model...\n" 270 | ] 271 | }, 272 | { 273 | "name": "stderr", 274 | "output_type": "stream", 275 | "text": [ 276 | "/home/zcy/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:38: UserWarning: Update your `LSTM` call to the Keras 2 API: `LSTM(units=50, activation=\"tanh\", recurrent_activation=\"hard_sigmoid\")`\n" 277 | ] 278 | }, 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "Compiling the Model...\n", 284 | "Train...\n", 285 | "Epoch 1/4\n", 286 | "16870/16870 [==============================] - 78s 5ms/step - loss: 0.9022 - acc: 0.6408\n", 287 | "Epoch 2/4\n", 288 | "16870/16870 [==============================] - 78s 5ms/step - loss: 0.7677 - acc: 0.7836\n", 289 | "Epoch 3/4\n", 290 | "16870/16870 [==============================] - 76s 4ms/step - loss: 0.6804 - acc: 0.8724\n", 291 | "Epoch 4/4\n", 292 | "16870/16870 [==============================] - 56s 3ms/step - loss: 0.6627 - acc: 0.8888\n", 293 | "Evaluate...\n", 294 | "4218/4218 [==============================] - 4s 947us/step\n", 295 | "Test score: [0.67400303481596235, 0.87624466577487403]\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "from sklearn.cross_validation import train_test_split\n", 301 | "from keras.models import Sequential\n", 302 | "from keras.layers.embeddings import Embedding\n", 303 | "from keras.layers.recurrent import LSTM\n", 304 | "from keras.layers.core import Dense, Dropout,Activation\n", 305 | "from keras.models import model_from_yaml\n", 306 | "np.random.seed(1337) # For Reproducibility\n", 307 | "import sys\n", 308 | "sys.setrecursionlimit(1000000)\n", 309 | "import yaml\n", 310 | "import keras\n", 311 | "\n", 312 | "batch_size = 32\n", 313 | "\n", 314 | "\n", 315 | "def get_data(index_dict,word_vectors,combined,y):\n", 316 | "\n", 317 | " n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1\n", 318 | " embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0\n", 319 | " for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量\n", 320 | " embedding_weights[index, :] = word_vectors[word]\n", 321 | " x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2)\n", 322 | " y_train = keras.utils.to_categorical(y_train,num_classes=3) \n", 323 | " y_test = keras.utils.to_categorical(y_test,num_classes=3)\n", 324 | " # print x_train.shape,y_train.shape\n", 325 | " return n_symbols,embedding_weights,x_train,y_train,x_test,y_test\n", 326 | "\n", 327 | "\n", 328 | "##定义网络结构\n", 329 | "def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):\n", 330 | " print 'Defining a Simple Keras Model...'\n", 331 | " model = Sequential() # or Graph or whatever\n", 332 | " model.add(Embedding(output_dim=vocab_dim,\n", 333 | " input_dim=n_symbols,\n", 334 | " mask_zero=True,\n", 335 | " weights=[embedding_weights],\n", 336 | " input_length=input_length)) # Adding Input Length\n", 337 | " model.add(LSTM(output_dim=50, activation='tanh', inner_activation='hard_sigmoid'))\n", 338 | " model.add(Dropout(0.5))\n", 339 | " model.add(Dense(3, activation='softmax')) # Dense=>全连接层,输出维度=1\n", 340 | " model.add(Activation('softmax'))\n", 341 | "\n", 342 | " print 'Compiling the Model...'\n", 343 | " model.compile(loss='categorical_crossentropy',\n", 344 | " optimizer='adam',metrics=['accuracy'])\n", 345 | "\n", 346 | " print \"Train...\" # batch_size=32\n", 347 | " model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch,verbose=1)\n", 348 | "\n", 349 | " print \"Evaluate...\"\n", 350 | " score = model.evaluate(x_test, y_test,\n", 351 | " batch_size=batch_size)\n", 352 | "\n", 353 | " yaml_string = model.to_yaml()\n", 354 | " with open('../model/lstm.yml', 'w') as outfile:\n", 355 | " outfile.write( yaml.dump(yaml_string, default_flow_style=True) )\n", 356 | " model.save_weights('../model/lstm.h5')\n", 357 | " print 'Test score:', score\n", 358 | "\n", 359 | "print 'Setting up Arrays for Keras Embedding Layer...'\n", 360 | "n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)\n", 361 | "print \"x_train.shape and y_train.shape:\"\n", 362 | "print x_train.shape,y_train.shape\n", 363 | "train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 31, 369 | "metadata": { 370 | "collapsed": true 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "\"\"\"\n", 375 | "预测\n", 376 | "\"\"\"\n", 377 | "import jieba\n", 378 | "import numpy as np\n", 379 | "from gensim.models.word2vec import Word2Vec\n", 380 | "from gensim.corpora.dictionary import Dictionary\n", 381 | "from keras.preprocessing import sequence\n", 382 | "\n", 383 | "import yaml\n", 384 | "from keras.models import model_from_yaml\n", 385 | "np.random.seed(1337) # For Reproducibility\n", 386 | "import sys\n", 387 | "sys.setrecursionlimit(1000000)\n", 388 | "\n", 389 | "# define parameters\n", 390 | "maxlen = 100\n", 391 | "\n", 392 | "def create_dictionaries(model=None,\n", 393 | " combined=None):\n", 394 | " ''' Function does are number of Jobs:\n", 395 | " 1- Creates a word to index mapping\n", 396 | " 2- Creates a word to vector mapping\n", 397 | " 3- Transforms the Training and Testing Dictionaries\n", 398 | "\n", 399 | " '''\n", 400 | " if (combined is not None) and (model is not None):\n", 401 | " gensim_dict = Dictionary()\n", 402 | " gensim_dict.doc2bow(model.vocab.keys(),\n", 403 | " allow_update=True)\n", 404 | " # freqxiao10->0 所以k+1\n", 405 | " w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k)\n", 406 | " w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word))\n", 407 | "\n", 408 | " def parse_dataset(combined): # 闭包-->临时使用\n", 409 | " ''' Words become integers\n", 410 | " '''\n", 411 | " data=[]\n", 412 | " for sentence in combined:\n", 413 | " new_txt = []\n", 414 | " for word in sentence:\n", 415 | " try:\n", 416 | " new_txt.append(w2indx[word])\n", 417 | " except:\n", 418 | " new_txt.append(0) # freqxiao10->0\n", 419 | " data.append(new_txt)\n", 420 | " return data # word=>index\n", 421 | " combined=parse_dataset(combined)\n", 422 | " combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0\n", 423 | " return w2indx, w2vec,combined\n", 424 | " else:\n", 425 | " print 'No data provided...'\n", 426 | "\n", 427 | "\n", 428 | "def input_transform(string):\n", 429 | " words=jieba.lcut(string)\n", 430 | " words=np.array(words).reshape(1,-1)\n", 431 | " model=Word2Vec.load('../model/Word2vec_model.pkl')\n", 432 | " _,_,combined=create_dictionaries(model,words)\n", 433 | " return combined\n", 434 | "\n", 435 | "\n", 436 | "def lstm_predict(string):\n", 437 | " print 'loading model......'\n", 438 | " with open('../model/lstm.yml', 'r') as f:\n", 439 | " yaml_string = yaml.load(f)\n", 440 | " model = model_from_yaml(yaml_string)\n", 441 | "\n", 442 | " print 'loading weights......'\n", 443 | " model.load_weights('../model/lstm.h5')\n", 444 | " model.compile(loss='categorical_crossentropy',\n", 445 | " optimizer='adam',metrics=['accuracy'])\n", 446 | " data=input_transform(string)\n", 447 | " data.reshape(1,-1)\n", 448 | " #print data\n", 449 | " result=model.predict_classes(data)\n", 450 | " print result # [[1]]\n", 451 | " if result[0]==1:\n", 452 | " print string,' positive'\n", 453 | " elif result[0]==0:\n", 454 | " print string,' neural'\n", 455 | " else:\n", 456 | " print string,' negative'" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 51, 462 | "metadata": {}, 463 | "outputs": [ 464 | { 465 | "name": "stdout", 466 | "output_type": "stream", 467 | "text": [ 468 | "loading model......\n", 469 | "loading weights......\n", 470 | "[1]\n", 471 | "不错不错 positive\n" 472 | ] 473 | } 474 | ], 475 | "source": [ 476 | "# string='酒店的环境非常好,价格也便宜,值得推荐'\n", 477 | "# string='手机质量太差了,傻逼店家,赚黑心钱,以后再也不会买了'\n", 478 | "# string = \"这是我看过文字写得很糟糕的书,因为买了,还是耐着性子看完了,但是总体来说不好,文字、内容、结构都不好\"\n", 479 | "# string = \"虽说是职场指导书,但是写的有点干涩,我读一半就看不下去了!\"\n", 480 | "# string = \"书的质量还好,但是内容实在没意思。本以为会侧重心理方面的分析,但实际上是婚外恋内容。\"\n", 481 | "# string = \"不是太好\"\n", 482 | "# string = \"不错不错\"\n", 483 | "string = \"非常好非常好!!\"\n", 484 | "# string = \"真的一般,没什么可以学习的\"\n", 485 | "\n", 486 | "lstm_predict(string)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": true 494 | }, 495 | "outputs": [], 496 | "source": [] 497 | } 498 | ], 499 | "metadata": { 500 | "kernelspec": { 501 | "display_name": "Python 2", 502 | "language": "python", 503 | "name": "python2" 504 | }, 505 | "language_info": { 506 | "codemirror_mode": { 507 | "name": "ipython", 508 | "version": 2 509 | }, 510 | "file_extension": ".py", 511 | "mimetype": "text/x-python", 512 | "name": "python", 513 | "nbconvert_exporter": "python", 514 | "pygments_lexer": "ipython2", 515 | "version": "2.7.13" 516 | } 517 | }, 518 | "nbformat": 4, 519 | "nbformat_minor": 2 520 | } 521 | -------------------------------------------------------------------------------- /lstm/lstm_test.py: -------------------------------------------------------------------------------- 1 | #! /bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | 预测 5 | """ 6 | import jieba 7 | import numpy as np 8 | from gensim.models.word2vec import Word2Vec 9 | from gensim.corpora.dictionary import Dictionary 10 | from keras.preprocessing import sequence 11 | 12 | import yaml 13 | from keras.models import model_from_yaml 14 | np.random.seed(1337) # For Reproducibility 15 | import sys 16 | sys.setrecursionlimit(1000000) 17 | 18 | # define parameters 19 | maxlen = 100 20 | 21 | def create_dictionaries(model=None, 22 | combined=None): 23 | ''' Function does are number of Jobs: 24 | 1- Creates a word to index mapping 25 | 2- Creates a word to vector mapping 26 | 3- Transforms the Training and Testing Dictionaries 27 | 28 | ''' 29 | if (combined is not None) and (model is not None): 30 | gensim_dict = Dictionary() 31 | gensim_dict.doc2bow(model.vocab.keys(), 32 | allow_update=True) 33 | # freqxiao10->0 所以k+1 34 | w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k) 35 | w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word)) 36 | 37 | def parse_dataset(combined): # 闭包-->临时使用 38 | ''' Words become integers 39 | ''' 40 | data=[] 41 | for sentence in combined: 42 | new_txt = [] 43 | for word in sentence: 44 | try: 45 | new_txt.append(w2indx[word]) 46 | except: 47 | new_txt.append(0) # freqxiao10->0 48 | data.append(new_txt) 49 | return data # word=>index 50 | combined=parse_dataset(combined) 51 | combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 52 | return w2indx, w2vec,combined 53 | else: 54 | print 'No data provided...' 55 | 56 | 57 | def input_transform(string): 58 | words=jieba.lcut(string) 59 | words=np.array(words).reshape(1,-1) 60 | model=Word2Vec.load('../model/Word2vec_model.pkl') 61 | _,_,combined=create_dictionaries(model,words) 62 | return combined 63 | 64 | 65 | def lstm_predict(string): 66 | print 'loading model......' 67 | with open('../model/lstm.yml', 'r') as f: 68 | yaml_string = yaml.load(f) 69 | model = model_from_yaml(yaml_string) 70 | 71 | print 'loading weights......' 72 | model.load_weights('../model/lstm.h5') 73 | model.compile(loss='categorical_crossentropy', 74 | optimizer='adam',metrics=['accuracy']) 75 | data=input_transform(string) 76 | data.reshape(1,-1) 77 | #print data 78 | result=model.predict_classes(data) 79 | # print result # [[1]] 80 | if result[0]==1: 81 | print string,' positive' 82 | elif result[0]==0: 83 | print string,' neural' 84 | else: 85 | print string,' negative' 86 | 87 | 88 | if __name__=='__main__': 89 | # string='酒店的环境非常好,价格也便宜,值得推荐' 90 | # string='手机质量太差了,傻逼店家,赚黑心钱,以后再也不会买了' 91 | # string = "这是我看过文字写得很糟糕的书,因为买了,还是耐着性子看完了,但是总体来说不好,文字、内容、结构都不好" 92 | # string = "虽说是职场指导书,但是写的有点干涩,我读一半就看不下去了!" 93 | # string = "书的质量还好,但是内容实在没意思。本以为会侧重心理方面的分析,但实际上是婚外恋内容。" 94 | # string = "不是太好" 95 | # string = "不错不错" 96 | string = "真的一般,没什么可以学习的" 97 | 98 | lstm_predict(string) -------------------------------------------------------------------------------- /lstm/lstm_train.py: -------------------------------------------------------------------------------- 1 | #! /bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | 训练网络,并保存模型,其中LSTM的实现采用Python中的keras库 5 | """ 6 | import pandas as pd 7 | import numpy as np 8 | import jieba 9 | import multiprocessing 10 | 11 | from gensim.models.word2vec import Word2Vec 12 | from gensim.corpora.dictionary import Dictionary 13 | from keras.preprocessing import sequence 14 | 15 | from sklearn.cross_validation import train_test_split 16 | from keras.models import Sequential 17 | from keras.layers.embeddings import Embedding 18 | from keras.layers.recurrent import LSTM 19 | from keras.layers.core import Dense, Dropout,Activation 20 | from keras.models import model_from_yaml 21 | np.random.seed(1337) # For Reproducibility 22 | import sys 23 | sys.setrecursionlimit(1000000) 24 | import yaml 25 | 26 | # set parameters: 27 | cpu_count = multiprocessing.cpu_count() # 4 28 | vocab_dim = 100 29 | n_iterations = 1 # ideally more.. 30 | n_exposures = 10 # 所有频数超过10的词语 31 | window_size = 7 32 | n_epoch = 4 33 | input_length = 100 34 | maxlen = 100 35 | 36 | batch_size = 32 37 | 38 | 39 | def loadfile(): 40 | neg=pd.read_csv('../data/neg.csv',header=None,index_col=None) 41 | pos=pd.read_csv('../data/pos.csv',header=None,index_col=None,error_bad_lines=False) 42 | neu=pd.read_csv('../data/neutral.csv', header=None, index_col=None) 43 | 44 | combined = np.concatenate((pos[0], neu[0], neg[0])) 45 | y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neu), dtype=int), 46 | -1*np.ones(len(neg),dtype=int))) 47 | 48 | return combined,y 49 | 50 | 51 | #对句子经行分词,并去掉换行符 52 | def tokenizer(text): 53 | ''' Simple Parser converting each document to lower-case, then 54 | removing the breaks for new lines and finally splitting on the 55 | whitespace 56 | ''' 57 | text = [jieba.lcut(document.replace('\n', '')) for document in text] 58 | return text 59 | 60 | 61 | def create_dictionaries(model=None, 62 | combined=None): 63 | ''' Function does are number of Jobs: 64 | 1- Creates a word to index mapping 65 | 2- Creates a word to vector mapping 66 | 3- Transforms the Training and Testing Dictionaries 67 | 68 | ''' 69 | if (combined is not None) and (model is not None): 70 | gensim_dict = Dictionary() 71 | gensim_dict.doc2bow(model.vocab.keys(), 72 | allow_update=True) 73 | # freqxiao10->0 所以k+1 74 | w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k) 75 | w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word)) 76 | 77 | def parse_dataset(combined): # 闭包-->临时使用 78 | ''' Words become integers 79 | ''' 80 | data=[] 81 | for sentence in combined: 82 | new_txt = [] 83 | for word in sentence: 84 | try: 85 | new_txt.append(w2indx[word]) 86 | except: 87 | new_txt.append(0) # freqxiao10->0 88 | data.append(new_txt) 89 | return data # word=>index 90 | combined=parse_dataset(combined) 91 | combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0 92 | return w2indx, w2vec,combined 93 | else: 94 | print 'No data provided...' 95 | 96 | 97 | #创建词语字典,并返回每个词语的索引,词向量,以及每个句子所对应的词语索引 98 | def word2vec_train(combined): 99 | 100 | model = Word2Vec(size=vocab_dim, 101 | min_count=n_exposures, 102 | window=window_size, 103 | workers=cpu_count, 104 | iter=n_iterations) 105 | model.build_vocab(combined) # input: list 106 | model.train(combined) 107 | model.save('../lstm_data_test/Word2vec_model.pkl') 108 | index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined) 109 | return index_dict, word_vectors,combined 110 | 111 | 112 | def get_data(index_dict,word_vectors,combined,y): 113 | 114 | n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1 115 | embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0 116 | for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量 117 | embedding_weights[index, :] = word_vectors[word] 118 | x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2) 119 | y_train = keras.utils.to_categorical(y_train,num_classes=3) 120 | y_test = keras.utils.to_categorical(y_test,num_classes=3) 121 | # print x_train.shape,y_train.shape 122 | return n_symbols,embedding_weights,x_train,y_train,x_test,y_test 123 | 124 | 125 | ##定义网络结构 126 | def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test): 127 | print 'Defining a Simple Keras Model...' 128 | model = Sequential() # or Graph or whatever 129 | model.add(Embedding(output_dim=vocab_dim, 130 | input_dim=n_symbols, 131 | mask_zero=True, 132 | weights=[embedding_weights], 133 | input_length=input_length)) # Adding Input Length 134 | model.add(LSTM(output_dim=50, activation='tanh')) 135 | model.add(Dropout(0.5)) 136 | model.add(Dense(3, activation='softmax')) # Dense=>全连接层,输出维度=3 137 | model.add(Activation('softmax')) 138 | 139 | print 'Compiling the Model...' 140 | model.compile(loss='categorical_crossentropy', 141 | optimizer='adam',metrics=['accuracy']) 142 | 143 | print "Train..." # batch_size=32 144 | model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch,verbose=1) 145 | 146 | print "Evaluate..." 147 | score = model.evaluate(x_test, y_test, 148 | batch_size=batch_size) 149 | 150 | yaml_string = model.to_yaml() 151 | with open('../model/lstm.yml', 'w') as outfile: 152 | outfile.write( yaml.dump(yaml_string, default_flow_style=True) ) 153 | model.save_weights('../model/lstm.h5') 154 | print 'Test score:', score 155 | 156 | 157 | #训练模型,并保存 158 | print 'Loading Data...' 159 | combined,y=loadfile() 160 | print len(combined),len(y) 161 | print 'Tokenising...' 162 | combined = tokenizer(combined) 163 | print 'Training a Word2vec model...' 164 | index_dict, word_vectors,combined=word2vec_train(combined) 165 | 166 | print 'Setting up Arrays for Keras Embedding Layer...' 167 | n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y) 168 | print "x_train.shape and y_train.shape:" 169 | print x_train.shape,y_train.shape 170 | train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test) -------------------------------------------------------------------------------- /lstm/test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 9, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "Skipping line 2607: expected 1 fields, saw 9\n", 13 | "Skipping line 3143: expected 1 fields, saw 2\n", 14 | "Skipping line 3173: expected 1 fields, saw 8\n", 15 | "\n" 16 | ] 17 | } 18 | ], 19 | "source": [ 20 | "import pandas as pd\n", 21 | "\n", 22 | "neg=pd.read_csv('../data/neg.csv',header=None,index_col=None)\n", 23 | "pos=pd.read_csv('../data/pos.csv',header=None,index_col=None,error_bad_lines=False)\n", 24 | "neu=pd.read_csv('../data/neutral.csv', header=None, index_col=None)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 11, 30 | "metadata": {}, 31 | "outputs": [ 32 | { 33 | "data": { 34 | "text/plain": [ 35 | "0 做为一本声名在外的流行书,说的还是广州的外企,按道理应该和我的生存环境差不多啊。但是一看之下...\n", 36 | "1 作者完全是以一个过来的自认为是成功者的角度去写这个问题,感觉很不客观。虽然不是很喜欢,但是,...\n", 37 | "2 作者提倡内调,不信任化妆品,这点赞同。但是所列举的方法太麻烦,配料也不好找。不是太实用。\n", 38 | "3 作者的文笔还行,但通篇感觉太琐碎,有点文人的无病呻吟。自由主义者。作者的品性不敢苟同,无民族...\n", 39 | "4 作者倒是个很小资的人,但有点自恋的感觉,书并没有什么大帮助\n", 40 | "5 作为个人经验在网上谈谈可以,但拿来出书就有点过了,书中还有些明显的谬误。不过文笔还不错,建议...\n", 41 | "6 昨天刚兴奋地写了评论,今天便遇一闹心事,因把此套书推荐给很多朋友,朋友就拖我在网上购,结果前...\n", 42 | "7 纵观整部书(上下两册) 从文字,到结构,人物,情节 没有一个地方是可取的虽然有过从业经验 ...\n", 43 | "8 字很大,内容不够充实当初看大家评论说得很好才买的但实际上却没那么好,感觉深度也不够如果你还在...\n", 44 | "9 中国社会科学出版社出的版本可能有删节,但未查到相关说明。\n", 45 | "10 纸张的质量也不好,文字部分更是倾斜的,盗版的很不负责任,虽然不评价胡兰成本人,但是文字还是美...\n", 46 | "11 职场如战场在这部小说里被阐述的淋漓尽致,拉拉工作勤奋如老黄牛,但性格却更似倔牛;王伟虽正直但...\n", 47 | "12 只因李安的电影《色,戒》,才买了张爱玲的小说来读。读后的感觉是失望——那么短浅,迷惑——李导...\n", 48 | "13 之前看到大家都说非常好 于是 很心动 也买了本 回来后看看 非常一般它讲很多就是要我们承认自...\n", 49 | "14 整本书没几个英文,好像小孩子看的,但是小孩子也许看不懂的那种.失望\n", 50 | "15 整本书给我的感觉就是一农民暴富了后害怕别人也富,挤占了他的地位。但又不想把害怕的思想暴露底那...\n", 51 | "16 这书写得很乱, 不系统, 不正规。而且废话, 大话连篇, 好不容易说到正题了, 比如该如何用...\n", 52 | "17 这是一本小说集,好多章节故事,我都已经看过了。书印刷还是不错的。但是还是买整套的好,有点后悔。\n", 53 | "18 这是我这一年内看到最差的一本书,我用不到2个小时看完(我实在不愿意在这本书上浪费太多时间),...\n", 54 | "19 这是我看过文字写得很糟糕的书,因为买了,还是耐着性子看完了,但是总体来说不好,文字、内容、结...\n", 55 | "20 这几天旅行中也在看《杜拉拉升职记》,的确是值得推荐的职场读物。虽然是虚拟的,但很多skill...\n", 56 | "21 这个作家的书真的很一般,除非和她有交叉的经历,否则很难找到感觉.细腻是日本人的长处,但很平凡...\n", 57 | "22 这个书首位的排名言过其实,感觉比不上圈子全套。1)故事里面的故事很多牵强赴会,逻辑性差。2)...\n", 58 | "23 这本书总体来说还可以,但是有些地方自己不是很懂,比如按穴位都不知道大约位置在那里\n", 59 | "24 这本书完全是根据这部电影走红后出版的商业书。内容很不全。里面收集了张的很多长篇小说,但都简化了。\n", 60 | "25 这本书虽然内容不多,但是插图很大,很生动和形象,里面的故事都是小孩子碰到过的事情,并且很有启...\n", 61 | "26 这本书送到的时候就是全湿了,但是就给你们的客服打电话了,说会安排人过来取书.并办退款手续.可...\n", 62 | "27 这本书是本着是本热销书看的,但看完后觉得没什么意思,或者有些做作。。。\n", 63 | "28 这本书没有我想象的好,书本身的质量和印刷还可以 ,但内容与书名不符,应该改名字;而且有很多牵...\n", 64 | "29 这本书买之前是看了这么多的五星评价而才买来的,以为应该应值得拥有的一本书!看完这样书后,在我...\n", 65 | " ... \n", 66 | "4325 一直都知道LG-C960拥有目前手机业界中的较高像素,其突出想象的外形设计令人匪夷所思.虽然...\n", 67 | "4326 1、滑盖设计,纤薄机身,按键方式很有个性,电容触控式,只要用手指遮挡住相应功能键的红色灯即可...\n", 68 | "4327 这手机外观真的是太漂亮,如果你看到真机我觉得你一定会爱上她的美!而且屏幕我觉得颜色也很鲜艳!...\n", 69 | "4328 1.外形不错。我喜欢直板机,估计看我这个帖子的朋友也都是这样;我的是全黑的,很酷;还有一款...\n", 70 | "4329 屏做的真不错,跟同事一起对比了一下,我这个屏明显比他那个机子强,亮但是不刺眼,我同事调了亮度...\n", 71 | "4330 外观不错,感觉比A800那种正统外形显得活泼多了,喜欢!屏幕比较大,色彩也好,就是拍出照片效...\n", 72 | "4331 摄像头做的不错,片子出来很清楚而且颜色好,这机子内存也够大,我总爱带个机子逛街,淘到好东西先...\n", 73 | "4332 功能还算实在,尽管不是很好,但是,内外双屏的效果还不错,铃声也行了,手写笔慢是慢了点,但比起...\n", 74 | "4333 1,外观,还可以!比较好看2,屏幕,在阳光下不好看清楚!在其他的地方还是很好的!当然没有TF...\n", 75 | "4334 一直用的这只机子,虽然功能不多,但是手机本身应有的功能全有了,我感觉没什么不好。那些加了比如...\n", 76 | "4335 电池做的不错,连打电话加发短信,我能用一周,一般周6会充电,现在已成习惯了!机身够薄,机子也...\n", 77 | "4336 继承了西门子键盘快捷功能菜单的功能,总共有15个按键可以自定义功能,按照自己的习惯定义好后基...\n", 78 | "4337 元旦拿到了这款机子,在充了一夜的电,用了1天后,越来越发觉了这款机子的一些优越性能,在此简单...\n", 79 | "4338 M100不错的机子,整体设计简洁,直板机只有一个屏,所以省是不用说的了,6万5千色的屏,差不...\n", 80 | "4339 开机声音很响亮!这个“16和弦”声音特别响亮,表现效果很出众,有点Z2的感觉。铃声够大,放包...\n", 81 | "4340 1.书写功能方面:有很多像SonyEricsson的手写部分都会把用户局限在一个很小的区域内...\n", 82 | "4341 设计独特,纯粹的MP3造型,加上轻巧的机身,拿在手里有\"一见钟情\"的感觉;因为很想买一个MP...\n", 83 | "4342 用这只机子很舒服,信号方面不错,机子运行也稳定,通话是声音比较清楚,室外我也试过,听着没问题...\n", 84 | "4343 机子不大不小,我用很合适;外屏幕小了点,但是看来电号码还是够用,很方便;自动开关机功能我很喜...\n", 85 | "4344 由于是寄回老家的没有看到东西,但听家里人说还不错,还没有安装。有待后期追加评论!\n", 86 | "4345 双十一买的,还没安装,但是很满意!\n", 87 | "4346 商品不错 送货也快 但是服务员售后真的不行 快递哥哥3天从西安送到南宁并且送进家门\n", 88 | "4347 到货速度很快,宝贝包装完好。虽然还没来得及安装使用,但因为有朋友正使用该款,质量有保证。\n", 89 | "4348 还没拆开,但是包装的很好。第二次来买了,还送了赠品,nice。谢谢\n", 90 | "4349 价格实惠,快递也快,安装也快,虽然安装小师傅年轻是新手,但很耐心负责比较满意吧!保温效果很满意.\n", 91 | "4350 虽然家里用不到 但商家的服务态度 超级棒!赞!\n", 92 | "4351 已经安装上了,但没有试,我相信美的质量,应该不会有问题。先5分好评吧。\n", 93 | "4352 很不错,到货很快,当天给客服打电话,下午安装人员就上门了,速度啊!但但当时没有水,没有试试,...\n", 94 | "4353 买来放在出租房里的,所以自己也没试过,但是安装服务人员特别好,最大限度地给省钱,两套热水澡装...\n", 95 | "4354 买来放在出租房里的,所以自己也没试过,但是安装服务人员特别好,最大限度地给省钱,两套热水澡装...\n", 96 | "Name: 0, Length: 4355, dtype: object" 97 | ] 98 | }, 99 | "execution_count": 11, 100 | "metadata": {}, 101 | "output_type": "execute_result" 102 | } 103 | ], 104 | "source": [ 105 | "neu[0]" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 19, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "(21088,)" 117 | ] 118 | }, 119 | "execution_count": 19, 120 | "metadata": {}, 121 | "output_type": "execute_result" 122 | } 123 | ], 124 | "source": [ 125 | "import numpy as np\n", 126 | "\n", 127 | "combined = np.concatenate((pos[0], neu[0], neg[0]))\n", 128 | "combined.shape" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 15, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "(21088,)" 140 | ] 141 | }, 142 | "execution_count": 15, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "# pos -> 1; neu -> 0; neg -> -1\n", 149 | "y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neu), dtype=int), -1*np.ones(len(neg),dtype=int)))\n", 150 | "y.shape" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 20, 156 | "metadata": { 157 | "collapsed": true 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "import jieba\n", 162 | "\n", 163 | "#对句子经行分词,并去掉换行符\n", 164 | "def tokenizer(text):\n", 165 | " ''' Simple Parser converting each document to lower-case, then\n", 166 | " removing the breaks for new lines and finally splitting on the\n", 167 | " whitespace\n", 168 | " '''\n", 169 | " text = [jieba.lcut(document.replace('\\n', '')) for document in text]\n", 170 | " return text\n", 171 | "\n", 172 | "combined = tokenizer(combined)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 21, 178 | "metadata": {}, 179 | "outputs": [ 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "Training a Word2vec model...\n" 185 | ] 186 | } 187 | ], 188 | "source": [ 189 | "from gensim.models.word2vec import Word2Vec\n", 190 | "from gensim.corpora.dictionary import Dictionary\n", 191 | "from keras.preprocessing import sequence\n", 192 | "import multiprocessing\n", 193 | "\n", 194 | "cpu_count = multiprocessing.cpu_count() # 4\n", 195 | "vocab_dim = 100\n", 196 | "n_iterations = 10 # ideally more..\n", 197 | "n_exposures = 10 # 所有频数超过10的词语\n", 198 | "window_size = 7\n", 199 | "n_epoch = 4\n", 200 | "input_length = 100\n", 201 | "maxlen = 100\n", 202 | "\n", 203 | "def create_dictionaries(model=None,\n", 204 | " combined=None):\n", 205 | " ''' Function does are number of Jobs:\n", 206 | " 1- Creates a word to index mapping\n", 207 | " 2- Creates a word to vector mapping\n", 208 | " 3- Transforms the Training and Testing Dictionaries\n", 209 | "\n", 210 | " '''\n", 211 | " if (combined is not None) and (model is not None):\n", 212 | " gensim_dict = Dictionary()\n", 213 | " gensim_dict.doc2bow(model.vocab.keys(),\n", 214 | " allow_update=True)\n", 215 | " # freqxiao10->0 所以k+1\n", 216 | " w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k)\n", 217 | " w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word))\n", 218 | "\n", 219 | " def parse_dataset(combined): # 闭包-->临时使用\n", 220 | " ''' Words become integers\n", 221 | " '''\n", 222 | " data=[]\n", 223 | " for sentence in combined:\n", 224 | " new_txt = []\n", 225 | " for word in sentence:\n", 226 | " try:\n", 227 | " new_txt.append(w2indx[word])\n", 228 | " except:\n", 229 | " new_txt.append(0) # freqxiao10->0\n", 230 | " data.append(new_txt)\n", 231 | " return data # word=>index\n", 232 | " combined=parse_dataset(combined)\n", 233 | " combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0\n", 234 | " return w2indx, w2vec,combined\n", 235 | " else:\n", 236 | " print 'No data provided...'\n", 237 | "\n", 238 | "\n", 239 | "#创建词语字典,并返回每个词语的索引,词向量,以及每个句子所对应的词语索引\n", 240 | "def word2vec_train(combined):\n", 241 | "\n", 242 | " model = Word2Vec(size=vocab_dim,\n", 243 | " min_count=n_exposures,\n", 244 | " window=window_size,\n", 245 | " workers=cpu_count,\n", 246 | " iter=n_iterations)\n", 247 | " model.build_vocab(combined) # input: list\n", 248 | " model.train(combined)\n", 249 | " model.save('../model/Word2vec_model.pkl')\n", 250 | " index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)\n", 251 | " return index_dict, word_vectors,combined\n", 252 | "\n", 253 | "print 'Training a Word2vec model...'\n", 254 | "index_dict, word_vectors,combined=word2vec_train(combined)" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 25, 260 | "metadata": {}, 261 | "outputs": [ 262 | { 263 | "name": "stdout", 264 | "output_type": "stream", 265 | "text": [ 266 | "Setting up Arrays for Keras Embedding Layer...\n", 267 | "x_train.shape and y_train.shape:\n", 268 | "(16870, 100) (16870, 3)\n", 269 | "Defining a Simple Keras Model...\n" 270 | ] 271 | }, 272 | { 273 | "name": "stderr", 274 | "output_type": "stream", 275 | "text": [ 276 | "/home/zcy/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:38: UserWarning: Update your `LSTM` call to the Keras 2 API: `LSTM(units=50, activation=\"tanh\", recurrent_activation=\"hard_sigmoid\")`\n" 277 | ] 278 | }, 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "Compiling the Model...\n", 284 | "Train...\n", 285 | "Epoch 1/4\n", 286 | "16870/16870 [==============================] - 78s 5ms/step - loss: 0.9022 - acc: 0.6408\n", 287 | "Epoch 2/4\n", 288 | "16870/16870 [==============================] - 78s 5ms/step - loss: 0.7677 - acc: 0.7836\n", 289 | "Epoch 3/4\n", 290 | "16870/16870 [==============================] - 76s 4ms/step - loss: 0.6804 - acc: 0.8724\n", 291 | "Epoch 4/4\n", 292 | "16870/16870 [==============================] - 56s 3ms/step - loss: 0.6627 - acc: 0.8888\n", 293 | "Evaluate...\n", 294 | "4218/4218 [==============================] - 4s 947us/step\n", 295 | "Test score: [0.67400303481596235, 0.87624466577487403]\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "from sklearn.cross_validation import train_test_split\n", 301 | "from keras.models import Sequential\n", 302 | "from keras.layers.embeddings import Embedding\n", 303 | "from keras.layers.recurrent import LSTM\n", 304 | "from keras.layers.core import Dense, Dropout,Activation\n", 305 | "from keras.models import model_from_yaml\n", 306 | "np.random.seed(1337) # For Reproducibility\n", 307 | "import sys\n", 308 | "sys.setrecursionlimit(1000000)\n", 309 | "import yaml\n", 310 | "import keras\n", 311 | "\n", 312 | "batch_size = 32\n", 313 | "\n", 314 | "\n", 315 | "def get_data(index_dict,word_vectors,combined,y):\n", 316 | "\n", 317 | " n_symbols = len(index_dict) + 1 # 所有单词的索引数,频数小于10的词语索引为0,所以加1\n", 318 | " embedding_weights = np.zeros((n_symbols, vocab_dim)) # 初始化 索引为0的词语,词向量全为0\n", 319 | " for word, index in index_dict.items(): # 从索引为1的词语开始,对每个词语对应其词向量\n", 320 | " embedding_weights[index, :] = word_vectors[word]\n", 321 | " x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=0.2)\n", 322 | " y_train = keras.utils.to_categorical(y_train,num_classes=3) \n", 323 | " y_test = keras.utils.to_categorical(y_test,num_classes=3)\n", 324 | " # print x_train.shape,y_train.shape\n", 325 | " return n_symbols,embedding_weights,x_train,y_train,x_test,y_test\n", 326 | "\n", 327 | "\n", 328 | "##定义网络结构\n", 329 | "def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):\n", 330 | " print 'Defining a Simple Keras Model...'\n", 331 | " model = Sequential() # or Graph or whatever\n", 332 | " model.add(Embedding(output_dim=vocab_dim,\n", 333 | " input_dim=n_symbols,\n", 334 | " mask_zero=True,\n", 335 | " weights=[embedding_weights],\n", 336 | " input_length=input_length)) # Adding Input Length\n", 337 | " model.add(LSTM(output_dim=50, activation='tanh', inner_activation='hard_sigmoid'))\n", 338 | " model.add(Dropout(0.5))\n", 339 | " model.add(Dense(3, activation='softmax')) # Dense=>全连接层,输出维度=1\n", 340 | " model.add(Activation('softmax'))\n", 341 | "\n", 342 | " print 'Compiling the Model...'\n", 343 | " model.compile(loss='categorical_crossentropy',\n", 344 | " optimizer='adam',metrics=['accuracy'])\n", 345 | "\n", 346 | " print \"Train...\" # batch_size=32\n", 347 | " model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epoch,verbose=1)\n", 348 | "\n", 349 | " print \"Evaluate...\"\n", 350 | " score = model.evaluate(x_test, y_test,\n", 351 | " batch_size=batch_size)\n", 352 | "\n", 353 | " yaml_string = model.to_yaml()\n", 354 | " with open('../model/lstm.yml', 'w') as outfile:\n", 355 | " outfile.write( yaml.dump(yaml_string, default_flow_style=True) )\n", 356 | " model.save_weights('../model/lstm.h5')\n", 357 | " print 'Test score:', score\n", 358 | "\n", 359 | "print 'Setting up Arrays for Keras Embedding Layer...'\n", 360 | "n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)\n", 361 | "print \"x_train.shape and y_train.shape:\"\n", 362 | "print x_train.shape,y_train.shape\n", 363 | "train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 31, 369 | "metadata": { 370 | "collapsed": true 371 | }, 372 | "outputs": [], 373 | "source": [ 374 | "\"\"\"\n", 375 | "预测\n", 376 | "\"\"\"\n", 377 | "import jieba\n", 378 | "import numpy as np\n", 379 | "from gensim.models.word2vec import Word2Vec\n", 380 | "from gensim.corpora.dictionary import Dictionary\n", 381 | "from keras.preprocessing import sequence\n", 382 | "\n", 383 | "import yaml\n", 384 | "from keras.models import model_from_yaml\n", 385 | "np.random.seed(1337) # For Reproducibility\n", 386 | "import sys\n", 387 | "sys.setrecursionlimit(1000000)\n", 388 | "\n", 389 | "# define parameters\n", 390 | "maxlen = 100\n", 391 | "\n", 392 | "def create_dictionaries(model=None,\n", 393 | " combined=None):\n", 394 | " ''' Function does are number of Jobs:\n", 395 | " 1- Creates a word to index mapping\n", 396 | " 2- Creates a word to vector mapping\n", 397 | " 3- Transforms the Training and Testing Dictionaries\n", 398 | "\n", 399 | " '''\n", 400 | " if (combined is not None) and (model is not None):\n", 401 | " gensim_dict = Dictionary()\n", 402 | " gensim_dict.doc2bow(model.vocab.keys(),\n", 403 | " allow_update=True)\n", 404 | " # freqxiao10->0 所以k+1\n", 405 | " w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引,(k->v)=>(v->k)\n", 406 | " w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量, (word->model(word))\n", 407 | "\n", 408 | " def parse_dataset(combined): # 闭包-->临时使用\n", 409 | " ''' Words become integers\n", 410 | " '''\n", 411 | " data=[]\n", 412 | " for sentence in combined:\n", 413 | " new_txt = []\n", 414 | " for word in sentence:\n", 415 | " try:\n", 416 | " new_txt.append(w2indx[word])\n", 417 | " except:\n", 418 | " new_txt.append(0) # freqxiao10->0\n", 419 | " data.append(new_txt)\n", 420 | " return data # word=>index\n", 421 | " combined=parse_dataset(combined)\n", 422 | " combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0\n", 423 | " return w2indx, w2vec,combined\n", 424 | " else:\n", 425 | " print 'No data provided...'\n", 426 | "\n", 427 | "\n", 428 | "def input_transform(string):\n", 429 | " words=jieba.lcut(string)\n", 430 | " words=np.array(words).reshape(1,-1)\n", 431 | " model=Word2Vec.load('../model/Word2vec_model.pkl')\n", 432 | " _,_,combined=create_dictionaries(model,words)\n", 433 | " return combined\n", 434 | "\n", 435 | "\n", 436 | "def lstm_predict(string):\n", 437 | " print 'loading model......'\n", 438 | " with open('../model/lstm.yml', 'r') as f:\n", 439 | " yaml_string = yaml.load(f)\n", 440 | " model = model_from_yaml(yaml_string)\n", 441 | "\n", 442 | " print 'loading weights......'\n", 443 | " model.load_weights('../model/lstm.h5')\n", 444 | " model.compile(loss='categorical_crossentropy',\n", 445 | " optimizer='adam',metrics=['accuracy'])\n", 446 | " data=input_transform(string)\n", 447 | " data.reshape(1,-1)\n", 448 | " #print data\n", 449 | " result=model.predict_classes(data)\n", 450 | " print result # [[1]]\n", 451 | " if result[0]==1:\n", 452 | " print string,' positive'\n", 453 | " elif result[0]==0:\n", 454 | " print string,' neural'\n", 455 | " else:\n", 456 | " print string,' negative'" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 51, 462 | "metadata": {}, 463 | "outputs": [ 464 | { 465 | "name": "stdout", 466 | "output_type": "stream", 467 | "text": [ 468 | "loading model......\n", 469 | "loading weights......\n", 470 | "[1]\n", 471 | "不错不错 positive\n" 472 | ] 473 | } 474 | ], 475 | "source": [ 476 | "# string='酒店的环境非常好,价格也便宜,值得推荐'\n", 477 | "# string='手机质量太差了,傻逼店家,赚黑心钱,以后再也不会买了'\n", 478 | "# string = \"这是我看过文字写得很糟糕的书,因为买了,还是耐着性子看完了,但是总体来说不好,文字、内容、结构都不好\"\n", 479 | "# string = \"虽说是职场指导书,但是写的有点干涩,我读一半就看不下去了!\"\n", 480 | "# string = \"书的质量还好,但是内容实在没意思。本以为会侧重心理方面的分析,但实际上是婚外恋内容。\"\n", 481 | "# string = \"不是太好\"\n", 482 | "# string = \"不错不错\"\n", 483 | "string = \"非常好非常好!!\"\n", 484 | "# string = \"真的一般,没什么可以学习的\"\n", 485 | "\n", 486 | "lstm_predict(string)" 487 | ] 488 | }, 489 | { 490 | "cell_type": "code", 491 | "execution_count": null, 492 | "metadata": { 493 | "collapsed": true 494 | }, 495 | "outputs": [], 496 | "source": [] 497 | } 498 | ], 499 | "metadata": { 500 | "kernelspec": { 501 | "display_name": "Python 2", 502 | "language": "python", 503 | "name": "python2" 504 | }, 505 | "language_info": { 506 | "codemirror_mode": { 507 | "name": "ipython", 508 | "version": 2 509 | }, 510 | "file_extension": ".py", 511 | "mimetype": "text/x-python", 512 | "name": "python", 513 | "nbconvert_exporter": "python", 514 | "pygments_lexer": "ipython2", 515 | "version": "2.7.13" 516 | } 517 | }, 518 | "nbformat": 4, 519 | "nbformat_minor": 2 520 | } 521 | -------------------------------------------------------------------------------- /model/Word2vec_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Edward1Chou/SentimentAnalysis/9f4f3ca0b77d68694c442b04bad7d44576811cb5/model/Word2vec_model.pkl -------------------------------------------------------------------------------- /model/lstm.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Edward1Chou/SentimentAnalysis/9f4f3ca0b77d68694c442b04bad7d44576811cb5/model/lstm.h5 -------------------------------------------------------------------------------- /model/lstm.yml: -------------------------------------------------------------------------------- 1 | "backend: !!python/unicode 'tensorflow'\nclass_name: Sequential\nconfig:\n- class_name:\ 2 | \ Embedding\n config:\n activity_regularizer: null\n batch_input_shape: !!python/tuple\ 3 | \ [null, 100]\n dtype: float32\n embeddings_constraint: null\n embeddings_initializer:\n\ 4 | \ class_name: RandomUniform\n config: {maxval: 0.05, minval: -0.05, seed:\ 5 | \ null}\n embeddings_regularizer: null\n input_dim: 8305\n input_length:\ 6 | \ 100\n mask_zero: true\n name: embedding_4\n output_dim: 100\n trainable:\ 7 | \ true\n- class_name: LSTM\n config:\n activation: tanh\n activity_regularizer:\ 8 | \ null\n bias_constraint: null\n bias_initializer:\n class_name: Zeros\n\ 9 | \ config: {}\n bias_regularizer: null\n dropout: 0.0\n go_backwards:\ 10 | \ false\n implementation: 1\n kernel_constraint: null\n kernel_initializer:\n\ 11 | \ class_name: VarianceScaling\n config: {distribution: uniform, mode:\ 12 | \ fan_avg, scale: 1.0, seed: null}\n kernel_regularizer: null\n name: lstm_4\n\ 13 | \ recurrent_activation: hard_sigmoid\n recurrent_constraint: null\n recurrent_dropout:\ 14 | \ 0.0\n recurrent_initializer:\n class_name: Orthogonal\n config: {gain:\ 15 | \ 1.0, seed: null}\n recurrent_regularizer: null\n return_sequences: false\n\ 16 | \ return_state: false\n stateful: false\n trainable: true\n unit_forget_bias:\ 17 | \ true\n units: 50\n unroll: false\n use_bias: true\n- class_name: Dropout\n\ 18 | \ config: {name: dropout_4, noise_shape: null, rate: 0.5, seed: null, trainable:\ 19 | \ true}\n- class_name: Dense\n config:\n activation: softmax\n activity_regularizer:\ 20 | \ null\n bias_constraint: null\n bias_initializer:\n class_name: Zeros\n\ 21 | \ config: {}\n bias_regularizer: null\n kernel_constraint: null\n \ 22 | \ kernel_initializer:\n class_name: VarianceScaling\n config: {distribution:\ 23 | \ uniform, mode: fan_avg, scale: 1.0, seed: null}\n kernel_regularizer: null\n\ 24 | \ name: dense_4\n trainable: true\n units: 3\n use_bias: true\n- class_name:\ 25 | \ Activation\n config: {activation: softmax, name: activation_4, trainable: true}\n\ 26 | keras_version: 2.1.1\n" 27 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Keras==1.1.1 2 | gensim==0.13.3 3 | jieba==0.38 4 | sklearn==0.0 5 | --------------------------------------------------------------------------------