├── .gitignore
├── LICENSE
├── README.md
├── SentimentAnalysis
    ├── SentimentAnalysis.py
    ├── __init__.py
    ├── creat_data
    │   ├── __init__.py
    │   ├── ali.py
    │   ├── baidu.py
    │   ├── bat.py
    │   ├── config.py
    │   └── tencent.py
    ├── data
    │   └── traindata.xlsx
    ├── data_info.json
    ├── flask_api.py
    ├── models
    │   ├── __init__.py
    │   ├── classify.h5
    │   ├── classify.model
    │   ├── keras_log_plot.py
    │   ├── neural_bulit.py
    │   ├── parameter
    │   │   ├── __init__.py
    │   │   └── optimizers.py
    │   ├── sklearn_config.py
    │   ├── sklearn_supervised.py
    │   └── vocab_word2vec.model
    └── sentence_transform
    │   ├── __init__.py
    │   ├── creat_vocab_word2vec.py
    │   ├── sentence_2_sparse.py
    │   └── sentence_2_tokenizer.py
├── demo.py
└── picture
    ├── Conv1D.png
    ├── SVM.png
    ├── api1.png
    ├── api2.png
    ├── api3.png
    └── label.png


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | .idea/misc.xml
 3 | .idea/modules.xml
 4 | .idea/Sentiment-analysis.iml
 5 | *.xml
 6 | *.pyc
 7 | try.py
 8 | creat_data/config.py
 9 | creat_label_mysql.py
10 | SentimentAnalysis/creat_data/bat_1.py
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 renjunxiang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Sentiment-analysis:情感分析
  2 | 
  3 | [![](https://img.shields.io/badge/Python-3.5-blue.svg)](https://www.python.org/)<br>
  4 | [![](https://img.shields.io/badge/baidu--aip-2.1.0.0-brightgreen.svg)](https://pypi.python.org/pypi/baidu-aip/2.1.0.0)
  5 | [![](https://img.shields.io/badge/pandas-0.21.0-brightgreen.svg)](https://pypi.python.org/pypi/pandas/0.21.0)
  6 | [![](https://img.shields.io/badge/numpy-1.13.1-brightgreen.svg)](https://pypi.python.org/pypi/numpy/1.13.1)
  7 | [![](https://img.shields.io/badge/matplotlib-2.1.0-brightgreen.svg)](https://pypi.python.org/pypi/matplotlib/2.1.0)
  8 | [![](https://img.shields.io/badge/jieba-0.39-brightgreen.svg)](https://pypi.python.org/pypi/jieba/0.39)
  9 | [![](https://img.shields.io/badge/gensim-3.2.0-brightgreen.svg)](https://pypi.python.org/pypi/gensim/3.2.0)
 10 | [![](https://img.shields.io/badge/scikit--learn-0.19.1-brightgreen.svg)](https://pypi.python.org/pypi/scikit-learn/0.19.1)
 11 | [![](https://img.shields.io/badge/requests-2.18.4-brightgreen.svg)](https://pypi.python.org/pypi/requests/2.18.4)
 12 | 
 13 | ## 语言
 14 | Python3.5<br>
 15 | ## 依赖库
 16 | baidu-aip=2.1.0.0<br>
 17 | pandas=0.21.0<br>
 18 | numpy=1.13.1<br>
 19 | jieba=0.39<br>
 20 | gensim=3.2.0<br>
 21 | scikit-learn=0.19.1<br>
 22 | keras=2.1.1<br>
 23 | requests=2.18.4<br>
 24 | 
 25 | 
 26 | 
 27 | ## 项目介绍
 28 | * 通过对已有标签的帖子进行训练，实现新帖子的情感分类，SentimentAnalysis文件夹可以直接作为模块使用。<br>
 29 | * 已完成机器学习算法中KNN、SVM和Logistic的封装，神经网络算法中的一维卷积核LSTM封装。训练集为一万条记录，SVM效果最好，准确率在87%左右.<br>
 30 | * ***PS：该项目在上一个项目Text-Classification基础上封装而成~仍有很多不足，欢迎萌新、大佬多多指导！***
 31 | 
 32 | ## 用法简介
 33 | * ### 导入模块，创建模型
 34 | ``` python
 35 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
 36 | model = SentimentAnalysis()
 37 | ```
 38 | 
 39 | * ### 借助第三方平台，打情感标签。
 40 | ``` python
 41 | # 用于在缺乏标签的时候利用BAT三家的接口创建训练集，5000条文档共耗时约45分钟
 42 | texts=['国王喜欢吃苹果',
 43 |        '国王非常喜欢吃苹果',
 44 |        '国王讨厌吃苹果',
 45 |        '国王非常讨厌吃苹果']
 46 | texts_withlabel=model.creat_label(texts)
 47 | ```
 48 | 
 49 | * ### 通过gensim模块创建词向量词包
 50 | ``` python
 51 | model.creat_vocab(texts=texts,
 52 |                   sg=0,
 53 |                   size=5,
 54 |                   window=5,
 55 |                   min_count=1,
 56 |                   vocab_savepath=os.getcwd() + '/vocab_word2vec.model')
 57 | # 也可以导入词向量
 58 | model.load_vocab_word2vec(os.getcwd() + '/models/vocab_word2vec.model')
 59 | # 词向量模型
 60 | model.vocab_word2vec
 61 | ```
 62 | 
 63 | * ### 通过scikit-learn进行机器学习
 64 | ``` python
 65 | model.train(texts=train_data,
 66 |             label=train_label,
 67 |             model_name='SVM',
 68 |             model_savepath=os.getcwd() + '/classify.model')
 69 | # 也可以导入机器学习模型
 70 | model.load_model(model_loadpath=os.getcwd() + '/classify.model')
 71 | # 训练的模型
 72 | model.model
 73 | # 训练集标签
 74 | model.label
 75 | ```
 76 | 
 77 | * ### 通过keras进行深度学习(模型的后缀不同)
 78 | ``` python
 79 | model.train(texts=train_data,
 80 |             label=train_label,
 81 |             model_name='Conv1D',
 82 |             batch_size=100,
 83 |             epochs=2,
 84 |             verbose=1,
 85 |             maxlen=None,
 86 |             model_savepath=os.getcwd() + '/classify.h5')
 87 | 
 88 | # 导入深度学习模型
 89 | model.load_model(model_loadpath=os.getcwd() + '/classify.h5')
 90 | # 训练的模型
 91 | model.model
 92 | # 训练的日志
 93 | model.train_log
 94 | # 可视化训练过程
 95 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
 96 | keras_log_plot(model.train_log)
 97 | # 训练集标签
 98 | model.label
 99 | ```
100 | 
101 | * ### 预测
102 | ``` python
103 | # 概率
104 | result_prob = model.predict_prob(texts=test_data)
105 | result_prob = pd.DataFrame(result_prob, columns=model.label)
106 | result_prob['predict'] = result_prob.idxmax(axis=1)
107 | result_prob['data'] = test_data
108 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
109 | print('prob:\n', result_prob)
110 | 
111 | # 分类
112 | result = model.predict(texts=test_data)
113 | print('score:', np.sum(result == np.array(test_label)) / len(result))
114 | result = pd.DataFrame({'data': test_data,
115 |                        'label': test_label,
116 |                        'predict': result},
117 |                       columns=['data', 'label', 'predict'])
118 | print('test\n', result)
119 | ```
120 | 
121 | * ### 开启API
122 | ``` python
123 | # 需要先训练好模型
124 | model.open_api()
125 | #http://0.0.0.0:5000/SentimentAnalyse/?model_name=模型名称&prob=是否需要返回概率&text=分类文本
126 | ```
127 | 
128 | * ### 其他说明
129 | 在训练集很小的情况下，sklearn的概率输出predict_prob会不准。目前发现，SVM会出现所有标签概率一样，暂时没看源码，猜测是离超平面过近不计算概率，predict不会出现这个情况。
130 | 
131 | ## 一个简单的demo
132 | ``` python
133 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
134 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
135 | import numpy as np
136 | 
137 | train_data = ['国王喜欢吃苹果',
138 |               '国王非常喜欢吃苹果',
139 |               '国王讨厌吃苹果',
140 |               '国王非常讨厌吃苹果']
141 | train_label = ['正面', '正面', '负面', '负面']
142 | # print('train data\n',
143 | #       pd.DataFrame({'data': train_data,
144 | #                     'label': train_label},
145 | #                    columns=['data', 'label']))
146 | test_data = ['涛哥喜欢吃苹果',
147 |              '涛哥讨厌吃苹果',
148 |              '涛哥非常喜欢吃苹果',
149 |              '涛哥非常讨厌吃苹果']
150 | test_label = ['正面', '负面', '正面', '负面']
151 | 
152 | # 创建模型
153 | model = SentimentAnalysis()
154 | 
155 | # 查看bat打的标签
156 | print(model.creat_label(test_data))
157 | 
158 | # 建模获取词向量词包
159 | model.creat_vocab(texts=train_data,
160 |                   sg=0,
161 |                   size=5,
162 |                   window=5,
163 |                   min_count=1,
164 |                   vocab_savepath=os.getcwd() + '/vocab_word2vec.model')
165 | 
166 | # 导入词向量词包
167 | # model.load_vocab_word2vec(vocab_loadpath=os.getcwd() + '/vocab_word2vec.model')
168 | 
169 | ###################################################################################
170 | # 进行机器学习
171 | model.train(texts=train_data,
172 |             label=train_label,
173 |             model_name='SVM',
174 |             model_savepath=os.getcwd() + '/classify.model')
175 | 
176 | # 导入机器学习模型
177 | # model.load_model(model_loadpath=os.getcwd() + '/classify.model')
178 | 
179 | # 进行预测:概率
180 | result_prob = model.predict_prob(texts=test_data)
181 | result_prob = pd.DataFrame(result_prob, columns=model.label)
182 | result_prob['predict'] = result_prob.idxmax(axis=1)
183 | result_prob['data'] = test_data
184 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
185 | print('prob:\n', result_prob)
186 | 
187 | # 进行预测:分类
188 | result = model.predict(texts=test_data)
189 | print('score:', np.sum(result == np.array(test_label)) / len(result))
190 | result = pd.DataFrame({'data': test_data,
191 |                        'label': test_label,
192 |                        'predict': result},
193 |                       columns=['data', 'label', 'predict'])
194 | print('test\n', result)
195 | ###################################################################################
196 | # 进行深度学习
197 | model.train(texts=train_data,
198 |             label=train_label,
199 |             model_name='Conv1D',
200 |             batch_size=100,
201 |             epochs=2,
202 |             verbose=1,
203 |             maxlen=None,
204 |             model_savepath=os.getcwd() + '/classify.h5')
205 | 
206 | # 导入深度学习模型
207 | # model.load_model(model_loadpath=os.getcwd() + '/classify.h5')
208 | 
209 | # 进行预测:概率
210 | result_prob = model.predict_prob(texts=test_data)
211 | result_prob = pd.DataFrame(result_prob, columns=model.label)
212 | result_prob['predict'] = result_prob.idxmax(axis=1)
213 | print(result_prob)
214 | 
215 | # 进行预测:分类
216 | result = model.predict(texts=test_data)
217 | print(result)
218 | print('score:', np.sum(result == np.array(test_label)) / len(result))
219 | result = pd.DataFrame({'data': test_data,
220 |                        'label': test_label,
221 |                        'predict': result},
222 |                       columns=['data', 'label', 'predict'])
223 | print('test\n', result)
224 | 
225 | keras_log_plot(model.train_log)
226 | 
227 | ```
228 | bat打标签<br>
229 | ![bat](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/label.png)<br>
230 | SVM<br>
231 | ![SVM](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/SVM.png)<br>
232 | Conv1D<br>
233 | ![Conv1D](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/Conv1D.png)<br>
234 | 
235 | ## 简单API
236 | 做了一个api：http://192.168.3.59:5000/SentimentAnalyse/?model_name=模型名称&prob=是否需要返回概率&text=分类文本<br>
237 | 192.168.3.59：ip地址，由服务器决定<br>
238 | 模型名称：目前支持：SVM,Conv1D<br>
239 | prob：0返回分类，1返回概率<br>
240 | ``` python
241 | from SentimentAnalysis import SentimentAnalysis
242 | model = SentimentAnalysis()
243 | model.open_api()
244 | ```
245 | 
246 | ### 例子
247 | 
248 | * __SVM模型，返回概率__<br>
249 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=SVM&prob=1&text=东西很不错<br>
250 | ![api1](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/api1.png)<br>
251 | 
252 | * __Conv1D模型，返回分类__<br>
253 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=Conv1D&prob=0&text=东西很不错<br>
254 | ![api2](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/api2.png)<br>
255 | 
256 | * __文本中的词语均不在词向量词库中__<br>
257 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=SVM&prob=0&text=呜呜呜<br>
258 | ![api3](https://github.com/renjunxiang/Sentiment-analysis/blob/master/picture/api3.png)<br>
259 | 
260 | 
261 | 
262 | 
263 | 
264 | 
265 | 
266 | 
267 | 
268 | 
269 | 
270 | 
271 | 
272 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/SentimentAnalysis.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | import os
  4 | from sklearn.externals import joblib
  5 | from gensim.models import word2vec
  6 | from keras.preprocessing.sequence import pad_sequences
  7 | from keras.callbacks import History
  8 | from keras.models import load_model
  9 | import jieba
 10 | import json
 11 | 
 12 | from SentimentAnalysis.creat_data import bat
 13 | from SentimentAnalysis.sentence_transform.creat_vocab_word2vec import creat_vocab_word2vec
 14 | from SentimentAnalysis.models.sklearn_supervised import sklearn_supervised
 15 | from SentimentAnalysis.models import sklearn_config
 16 | from SentimentAnalysis.models.neural_bulit import neural_bulit
 17 | 
 18 | jieba.setLogLevel('WARN')
 19 | DIR = os.path.dirname(__file__)
 20 | 
 21 | class SentimentAnalysis():
 22 |     # def __init__(self):
 23 |     #     pass
 24 | 
 25 |     # open api
 26 |     def open_api(self):
 27 |         os.system("python %s" % (DIR + '/flask_api.py'))
 28 |         os.close()
 29 | 
 30 |     # get labels from baidu,ali,tencent
 31 |     def creat_label(self, texts):
 32 |         results_dataframe = bat.creat_label(texts)
 33 |         return results_dataframe
 34 | 
 35 |     def creat_vocab(self,
 36 |                     texts=None,
 37 |                     sg=0,
 38 |                     size=5,
 39 |                     window=5,
 40 |                     min_count=1,
 41 |                     vocab_savepath=DIR + '/models/vocab_word2vec.model'):
 42 |         '''
 43 |         get dictionary by word2vec
 44 |         :param texts: list of text
 45 |         :param sg: 0 CBOW,1 skip-gram
 46 |         :param size: the dimensionality of the feature vectors
 47 |         :param window: the maximum distance between the current and predicted word within a sentence
 48 |         :param min_count: ignore all words with total frequency lower than this
 49 |         :param vocab_savepath: path to save word2vec dictionary
 50 |         :return: None
 51 |         '''
 52 |         # 构建词向量词库
 53 |         self.vocab_word2vec = creat_vocab_word2vec(texts=texts,
 54 |                                                    sg=sg,
 55 |                                                    vocab_savepath=vocab_savepath,
 56 |                                                    size=size,
 57 |                                                    window=window,
 58 |                                                    min_count=min_count)
 59 | 
 60 |     def load_vocab_word2vec(self,
 61 |                             vocab_loadpath=DIR + '/models/vocab_word2vec.model'):
 62 |         '''
 63 |         load dictionary
 64 |         :param vocab_loadpath: path to load word2vec dictionary
 65 |         :return: 
 66 |         '''
 67 |         self.vocab_word2vec = word2vec.Word2Vec.load(vocab_loadpath)
 68 | 
 69 |     def train(self,
 70 |               texts=None,
 71 |               label=None,
 72 |               model_name='SVM',
 73 |               model_savepath=DIR + '/models/classify.model',
 74 |               net_shape=None,
 75 |               batch_size=100,  # 神经网络参数
 76 |               epochs=2,  # 神经网络参数
 77 |               verbose=2,  # 神经网络参数
 78 |               maxlen=None,  # 神经网络参数
 79 |               **sklearn_param):
 80 |         '''
 81 |         use sklearn/keras to train model
 82 |         :param texts: x
 83 |         :param label: y
 84 |         :param model_name: name want to train
 85 |         :param model_savepath: model save path
 86 |         :param batch_size: for keras fit
 87 |         :param epochs: for keras fit
 88 |         :param verbose: for keras fit
 89 |         :param maxlen: for keras pad_sequences
 90 |         :param sklearn_param: param for sklearn
 91 |         :return: None
 92 |         '''
 93 |         self.model_name = model_name
 94 |         self.label = np.unique(np.array(label))
 95 |         # 文本转词向量
 96 |         vocab_word2vec = self.vocab_word2vec
 97 |         texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
 98 |         data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
 99 |         if maxlen is None:
100 |             maxlen = max([len(i) for i in texts_cut])
101 |         self.maxlen = maxlen
102 |         # sklearn模型，词向量计算均值
103 |         if model_name in ['SVM', 'KNN', 'Logistic']:
104 |             data = [sum(i) / len(i) for i in data]
105 |             # 配置sklearn模型参数
106 |             if model_name == 'SVM':
107 |                 if sklearn_param == {}:
108 |                     sklearn_param = sklearn_config.SVC
109 |             elif model_name == 'KNN':
110 |                 if sklearn_param == {}:
111 |                     sklearn_param = sklearn_config.KNN
112 |             elif model_name == 'Logistic':
113 |                 if sklearn_param == {}:
114 |                     sklearn_param = sklearn_config.Logistic
115 |             # 返回训练模型
116 |             self.model = sklearn_supervised(data=data,
117 |                                             label=label,
118 |                                             model_savepath=model_savepath,
119 |                                             model_name=model_name,
120 |                                             **sklearn_param)
121 | 
122 |         # keras神经网络模型，
123 |         elif model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
124 |             data = pad_sequences(data, maxlen=maxlen, padding='post', value=0, dtype='float32')
125 |             label_transform = np.array(pd.get_dummies(label))
126 |             if net_shape is None:
127 |                 if model_name == 'Conv1D_LSTM':
128 |                     net_shape = [
129 |                         {'name': 'InputLayer', 'input_shape': data.shape[1:]},
130 |                         {'name': 'Conv1D', 'filters': 64, 'kernel_size': 3, 'strides': 1, 'padding': 'same',
131 |                          'activation': 'relu'},
132 |                         {'name': 'MaxPooling1D', 'pool_size': 5, 'padding': 'same', 'strides': 2},
133 |                         {'name': 'LSTM', 'units': 16, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid',
134 |                          'dropout': 0., 'recurrent_dropout': 0.},
135 |                         {'name': 'Flatten'},
136 |                         {'name': 'Dense', 'activation': 'relu', 'units': 64},
137 |                         {'name': 'Dropout', 'rate': 0.2, },
138 |                         {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
139 |                     ]
140 | 
141 |                 elif model_name == 'LSTM':
142 |                     net_shape = [
143 |                         {'name': 'InputLayer', 'input_shape': data.shape[1:]},
144 |                         {'name': 'Masking'},
145 |                         {'name': 'LSTM', 'units': 16, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid',
146 |                          'dropout': 0., 'recurrent_dropout': 0.},
147 |                         {'name': 'Dense', 'activation': 'relu', 'units': 64},
148 |                         {'name': 'Dropout', 'rate': 0.2, },
149 |                         {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
150 |                     ]
151 |                 elif model_name == 'Conv1D':
152 |                     net_shape = [
153 |                         {'name': 'InputLayer', 'input_shape': data.shape[1:]},
154 |                         {'name': 'Conv1D', 'filters': 64, 'kernel_size': 3, 'strides': 1, 'padding': 'same',
155 |                          'activation': 'relu'},
156 |                         {'name': 'MaxPooling1D', 'pool_size': 5, 'padding': 'same', 'strides': 2},
157 |                         {'name': 'Flatten'},
158 |                         {'name': 'Dense', 'activation': 'relu', 'units': 64},
159 |                         {'name': 'Dropout', 'rate': 0.2, },
160 |                         {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
161 |                     ]
162 | 
163 |             model = neural_bulit(net_shape=net_shape,
164 |                                  optimizer_name='Adagrad',
165 |                                  lr=0.001,
166 |                                  loss='categorical_crossentropy')
167 |             history = History()
168 |             model.fit(data, label_transform,
169 |                       batch_size=batch_size, epochs=epochs, verbose=verbose, callbacks=[history])
170 |             train_log = pd.DataFrame(history.history)
171 |             self.model = model
172 |             self.train_log = train_log
173 |             if model_savepath != None:
174 |                 model.save(model_savepath)
175 |         with open(DIR + '/data_info.json', mode='w', encoding='utf-8') as f:
176 |             json.dump({'maxlen': maxlen, 'label': list(self.label)}, f)
177 | 
178 |     def load_model(self,
179 |                    model_loadpath=DIR + '/models/classify.model',
180 |                    model_name=None,
181 |                    data_info_path=DIR + '/data_info.json'):
182 |         '''
183 |         load sklearn/keras model
184 |         :param model_loadpath: path to load sklearn/keras model
185 |         :param model_name: load model name 
186 |         :param data_info_path: date information path
187 |         :return: None
188 |         '''
189 | 
190 |         with open(data_info_path, encoding='utf-8') as f:
191 |             data_info = json.load(f)
192 |         self.maxlen = data_info['maxlen']
193 |         self.label = data_info['label']
194 |         self.model_name = model_name
195 | 
196 |         if self.model_name in ['SVM', 'KNN', 'Logistic']:
197 |             self.model = joblib.load(model_loadpath)
198 |         elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
199 |             self.model = load_model(model_loadpath)
200 | 
201 |     def predict_prob(self,
202 |                      texts=None):
203 |         '''
204 |         predict probability
205 |         :param texts:  list of text
206 |         :return: list of probability
207 |         '''
208 |         # 文本转词向量
209 |         vocab_word2vec = self.vocab_word2vec
210 |         if self.model_name in ['SVM', 'KNN', 'Logistic']:
211 |             texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
212 |             data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
213 |             data = [sum(i) / len(i) for i in data]
214 |             self.testdata = data
215 |             results = self.model.predict_proba(data)
216 |         elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
217 |             texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
218 |             data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
219 |             data = pad_sequences(data, maxlen=self.maxlen, padding='post', value=0, dtype='float32')
220 |             self.testdata = data
221 |             results = self.model.predict(data)
222 |         return results
223 | 
224 |     def predict(self,
225 |                 texts=None):
226 |         '''
227 |         predict class
228 |         :param texts:  list of text
229 |         :return: list of classify
230 |         '''
231 |         # 文本转词向量
232 |         vocab_word2vec = self.vocab_word2vec
233 |         if self.model_name in ['SVM', 'KNN', 'Logistic']:
234 |             texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
235 |             data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
236 |             data = [sum(i) / len(i) for i in data]
237 |             self.testdata = data
238 |             results = self.model.predict(data)
239 |         elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
240 |             texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
241 |             data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
242 |             data = pad_sequences(data, maxlen=self.maxlen, padding='post', value=0, dtype='float32')
243 |             self.testdata = data
244 |             results = self.model.predict(data)
245 |             results = pd.DataFrame(results, columns=self.label)
246 |             results = results.idxmax(axis=1)
247 |         return results
248 | 
249 | 
250 | if __name__ == '__main__':
251 |     train_data = ['国王喜欢吃苹果',
252 |                   '国王非常喜欢吃苹果',
253 |                   '国王讨厌吃苹果',
254 |                   '国王非常讨厌吃苹果']
255 |     train_label = ['正面', '正面', '负面', '负面']
256 |     # print('train data\n',
257 |     #       pd.DataFrame({'data': train_data,
258 |     #                     'label': train_label},
259 |     #                    columns=['data', 'label']))
260 |     test_data = ['涛哥喜欢吃苹果',
261 |                  '涛哥讨厌吃苹果',
262 |                  '涛哥非常喜欢吃苹果',
263 |                  '涛哥非常讨厌吃苹果']
264 |     test_label = ['正面', '负面', '正面', '负面']
265 | 
266 |     # 创建模型
267 |     model = SentimentAnalysis()
268 | 
269 |     # 查看bat打的标签
270 |     print(model.creat_label(test_data))
271 | 
272 |     model.creat_vocab(texts=train_data,
273 |                       sg=0,
274 |                       size=5,
275 |                       window=5,
276 |                       min_count=1,
277 |                       vocab_savepath=None)
278 | 
279 |     # 导入词向量词包
280 |     # model.load_vocab_word2vec(vocab_loadpath=DIR + '/vocab_word2vec.model')
281 | 
282 |     ###################################################################################
283 |     # 进行机器学习
284 |     model.train(texts=train_data,
285 |                 label=train_label,
286 |                 model_name='SVM',
287 |                 model_savepath=DIR + '/models/classify.model')
288 | 
289 |     # 导入机器学习模型
290 |     model.load_model(model_loadpath=DIR + '/models/classify.model',
291 |                      model_name='SVM',
292 |                      data_info_path=DIR + '/data_info.json')
293 | 
294 |     # 进行预测:概率
295 |     result_prob = model.predict_prob(texts=test_data)
296 |     result_prob = pd.DataFrame(result_prob, columns=model.label)
297 |     result_prob['predict'] = result_prob.idxmax(axis=1)
298 |     result_prob['data'] = test_data
299 |     result_prob = result_prob[['data'] + list(model.label) + ['predict']]
300 |     print('prob:\n', result_prob)
301 |     print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
302 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['models','sentence_transform','data','test','creat_data']
2 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['baidu','ali','tencent','bat','config']
2 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/ali.py:
--------------------------------------------------------------------------------
  1 | from SentimentAnalysis.creat_data.config import ali
  2 | import datetime
  3 | import hashlib
  4 | import base64
  5 | from urllib.parse import urlparse
  6 | import hmac
  7 | import pandas as pd
  8 | import numpy as np
  9 | import requests
 10 | import json
 11 | import time
 12 | 
 13 | org_code = ali['account']['id_1']['org_code']
 14 | akID = ali['account']['id_1']['akID']
 15 | akSecret = ali['account']['id_1']['akSecret']
 16 | 
 17 | def creat_label(texts,
 18 |                 org_code=org_code,
 19 |                 akID=akID,
 20 |                 akSecret=akSecret):
 21 |     '''
 22 |     :param texts: 需要打标签的文档列表
 23 |     :param AppID: 腾讯ai账号信息，默认调用配置文件id_1
 24 |     :param AppKey: 腾讯ai账号信息，默认调用配置文件id_1
 25 |     :return: 打好标签的列表，包括原始文档、标签、置信水平、是否成功
 26 |     '''
 27 |     url = org_code.join(ali['api']['Sentiment']['url'].split('{org_code}'))
 28 | 
 29 |     results = []
 30 | 
 31 |     def to_sha1_base64(stringToSign, akSecret):
 32 |         hmacsha1 = hmac.new(akSecret.encode('utf-8'),
 33 |                             stringToSign.encode('utf-8'),
 34 |                             hashlib.sha1)
 35 |         return base64.b64encode(hmacsha1.digest()).decode('utf-8')
 36 | 
 37 |     # 逐句调用接口判断
 38 |     count_i=0
 39 |     for one_text in texts:
 40 |         # one_text = '喜欢'
 41 |         time_now = datetime.datetime.strftime(datetime.datetime.utcnow(), "%a, %d %b %Y %H:%M:%S GMT")
 42 |         # time_now = time.strftime("%a, %d %b %Y %H:%M:%S GMT", time.localtime()) #这个也可以
 43 |         options = {'url': url,
 44 |                    'method': 'POST',
 45 |                    'headers': {'accept': 'application/json',
 46 |                                'content-type': 'application/json',
 47 |                                'date': time_now,
 48 |                                'authorization': ''},
 49 |                    'body': json.dumps({'text': one_text}, separators=(',', ':'))}
 50 | 
 51 |         body = ''
 52 |         if 'body' in options:
 53 |             body = options['body']
 54 |         # print(body)
 55 |         bodymd5 = ''
 56 |         if not body == '':
 57 |             bodymd5 = base64.b64encode(
 58 |                 hashlib.md5(json.dumps({'text': one_text}, separators=(',', ':')).encode('utf-8')).digest()).decode(
 59 |                 'utf-8')
 60 | 
 61 |         # print(bodymd5)
 62 | 
 63 |         urlPath = urlparse(url)
 64 |         if urlPath.query != '':
 65 |             urlPath = urlPath.path + "?" + urlPath.query
 66 |         else:
 67 |             urlPath = urlPath.path
 68 |         stringToSign = 'POST' + '\n' + \
 69 |                        options['headers']['accept'] + '\n' + \
 70 |                        bodymd5 + '\n' + \
 71 |                        options['headers']['content-type'] + '\n' \
 72 |                        + options['headers']['date'] + '\n' + urlPath
 73 | 
 74 |         # print(stringToSign)
 75 |         signature = to_sha1_base64(stringToSign=stringToSign,
 76 |                                    akSecret=akSecret)
 77 |         # print(signature)
 78 |         authHeader = 'Dataplus ' + akID + ':' + signature
 79 |         # print(authHeader)
 80 |         options['headers']['authorization'] = authHeader
 81 |         r = requests.post(url=url,
 82 |                           headers={'accept': 'application/json',
 83 |                                    'content-type': 'application/json',
 84 |                                    'date': time_now,
 85 |                                    'authorization': authHeader},
 86 |                           data=json.dumps({'text': one_text}, separators=(',', ':')))  # 获取分析结果
 87 |         try:
 88 |             result = json.loads(r.text)
 89 |             # print(result)
 90 |             results.append([one_text,
 91 |                             result['data']['text_polarity'],
 92 |                             0,
 93 |                             'ok'
 94 |                             ])
 95 |         except:
 96 |             results.append([one_text,
 97 |                             -100,
 98 |                             -100,
 99 |                             'error'
100 |                             ])
101 |         count_i += 1
102 |         if count_i % 50 == 0:
103 |             print('ali finish:%d' % (count_i))
104 |         r.close()
105 |     return results
106 | 
107 | 
108 | if __name__ == '__main__':
109 |     results = creat_label(texts=['价格便宜啦，比原来优惠多了',
110 |                                  '壁挂效果差，果然一分价钱一分货',
111 |                                  '东西一般般，诶呀',
112 |                                  '快递非常快，电视很惊艳，非常喜欢',
113 |                                  '到货很快，师傅很热情专业。',
114 |                                  '讨厌你',
115 |                                  '一般'
116 |                                  ])
117 |     results = pd.DataFrame(results, columns=['evaluation',
118 |                                              'label',
119 |                                              'ret',
120 |                                              'msg'])
121 |     results['label'] = np.where(results['label'] == '1', '正面',
122 |                                 np.where(results['label'] == '0', '中性',
123 |                                          np.where(results['label'] == '-1', '负面', '非法')))
124 |     print(results)
125 | 
126 | 
127 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/baidu.py:
--------------------------------------------------------------------------------
  1 | # from aip import AipNlp #现在不能用了
  2 | from SentimentAnalysis.creat_data.config import baidu
  3 | import pandas as pd
  4 | import numpy as np
  5 | import json
  6 | import requests
  7 | 
  8 | APP_ID = baidu['account']['id_1']['APP_ID']
  9 | API_KEY = baidu['account']['id_1']['API_KEY']
 10 | SECRET_KEY = baidu['account']['id_1']['SECRET_KEY']
 11 | 
 12 | 
 13 | # 逐句调用接口判断
 14 | def creat_label(texts,
 15 |                 interface='SDK',
 16 |                 APP_ID=APP_ID,
 17 |                 API_KEY=API_KEY,
 18 |                 SECRET_KEY=SECRET_KEY):
 19 |     '''
 20 |     :param texts: 需要打标签的文档列表
 21 |     :param interface: 接口方式，SDK和API
 22 |     :param APP_ID: 百度ai账号信息，默认调用配置文件id_1
 23 |     :param API_KEY: 百度ai账号信息，默认调用配置文件id_1
 24 |     :param SECRET_KEY: 百度ai账号信息，默认调用配置文件id_1
 25 |     :return: 打好标签的列表，包括原始文档、标签、置信水平、正负面概率、是否成功
 26 |     '''
 27 |     # 创建连接
 28 | 
 29 |     results = []
 30 |     if interface == 'SDK':
 31 |         pass
 32 |         # client = AipNlp(APP_ID=APP_ID,
 33 |         #                 API_KEY=API_KEY,
 34 |         #                 SECRET_KEY=SECRET_KEY)
 35 |         # for one_text in texts:
 36 |         #     result = client.sentimentClassify(one_text)
 37 |         #     if 'error_code' in result:
 38 |         #         results.append([one_text,
 39 |         #                         0,
 40 |         #                         0,
 41 |         #                         0,
 42 |         #                         0,
 43 |         #                         result['error_code'],
 44 |         #                         result['error_msg']
 45 |         #                         ])
 46 |         #     else:
 47 |         #         results.append([one_text,
 48 |         #                         result['items'][0]['sentiment'],
 49 |         #                         result['items'][0]['confidence'],
 50 |         #                         result['items'][0]['positive_prob'],
 51 |         #                         result['items'][0]['negative_prob'],
 52 |         #                         0,
 53 |         #                         'ok'
 54 |         #                         ])
 55 |     elif interface == 'API':
 56 |         # 获取access_token
 57 |         url = baidu['access_token_url']
 58 |         params = {'grant_type': 'client_credentials',
 59 |                   'client_id': baidu['account']['id_1']['API_KEY'],
 60 |                   'client_secret': baidu['account']['id_1']['SECRET_KEY']}
 61 |         r = requests.post(url, params=params)
 62 |         access_token = json.loads(r.text)['access_token']
 63 |         r.close()
 64 | 
 65 |         url = baidu['api']['sentiment_classify']['url']
 66 |         params = {'access_token': access_token}
 67 |         headers = {'Content-Type': baidu['api']['sentiment_classify']['Content-Type']}
 68 |         count_i=0
 69 |         for one_text in texts:
 70 |             data = json.dumps({'text': one_text})
 71 |             r = requests.post(url=url,
 72 |                               params=params,
 73 |                               headers=headers,
 74 |                               data=data)
 75 |             result = json.loads(r.text)
 76 |             if 'error_code' in result:
 77 |                 results.append([one_text,
 78 |                                 0,
 79 |                                 0,
 80 |                                 0,
 81 |                                 0,
 82 |                                 result['error_code'],
 83 |                                 result['error_msg']
 84 |                                 ])
 85 |             else:
 86 |                 results.append([one_text,
 87 |                                 result['items'][0]['sentiment'],
 88 |                                 result['items'][0]['confidence'],
 89 |                                 result['items'][0]['positive_prob'],
 90 |                                 result['items'][0]['negative_prob'],
 91 |                                 0,
 92 |                                 'ok'
 93 |                                 ])
 94 |             count_i += 1
 95 |             if count_i % 50 == 0:
 96 |                 print('baidu finish:%d' % (count_i))
 97 |             r.close()
 98 |     else:
 99 |         print('ERROR: No interface named %s' % (interface))
100 |     return results
101 | 
102 | 
103 | if __name__ == '__main__':
104 |     results = creat_label(texts=['价格便宜啦，比原来优惠多了',
105 |                        '壁挂效果差，果然一分价钱一分货',
106 |                        '东西一般般，诶呀',
107 |                        '讨厌你',
108 |                        '一般'],
109 |                           interface='API')
110 |     results = pd.DataFrame(results, columns=['evaluation',
111 |                                              'label',
112 |                                              'confidence',
113 |                                              'positive_prob',
114 |                                              'negative_prob',
115 |                                              'ret',
116 |                                              'msg'])
117 |     results['label'] = np.where(results['label'] == 2,
118 |                                 '正面',
119 |                                 np.where(results['label'] == 1, '中性', '负面'))
120 |     print(results)
121 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/bat.py:
--------------------------------------------------------------------------------
 1 | from SentimentAnalysis.creat_data import baidu, ali, tencent
 2 | import pandas as pd
 3 | # from collections import OrderedDict
 4 | import numpy as np
 5 | 
 6 | 
 7 | def creat_label(texts):
 8 |     results = []
 9 |     count_i = 0
10 |     for one_text in texts:
11 |         result_baidu = baidu.creat_label([one_text], interface='API')
12 |         result_ali = ali.creat_label([one_text])
13 |         result_tencent = tencent.creat_label([one_text])
14 | 
15 |         result_all = [one_text,
16 |                       result_baidu[0][1], result_baidu[0][6],
17 |                       result_ali[0][1], result_ali[0][3],
18 |                       result_tencent[0][1], result_tencent[0][4]]
19 |         results.append(result_all)
20 | 
21 |         # result = OrderedDict()
22 |         # result['evaluation'] = result_all[0]
23 |         # result['label_baidu'] = result_all[1]
24 |         # result['msg_baidu'] = result_all[2]
25 |         # result['label_ali'] = result_all[3]
26 |         # result['msg_ali'] = result_all[4]
27 |         # result['label_tencent'] = result_all[5]
28 |         # result['msg_tencent'] = result_all[6]
29 | 
30 |         count_i += 1
31 |         if count_i % 50 == 0:
32 |             print('baidu finish:%d' % (count_i))
33 | 
34 |     results_dataframe = pd.DataFrame(results,
35 |                                      columns=['evaluation',
36 |                                               'label_baidu', 'msg_baidu',
37 |                                               'label_ali', 'msg_ali',
38 |                                               'label_tencent', 'msg_tencent'])
39 |     results_dataframe['label_baidu'] = np.where(results_dataframe['label_baidu'] == 2,
40 |                                                 '正面',
41 |                                                 np.where(results_dataframe['label_baidu'] == 1, '中性', '负面'))
42 |     results_dataframe['label_ali'] = np.where(results_dataframe['label_ali'] == '1', '正面',
43 |                                               np.where(results_dataframe['label_ali'] == '0', '中性',
44 |                                                        np.where(results_dataframe['label_ali'] == '-1', '负面', '非法')))
45 |     results_dataframe['label_tencent'] = np.where(results_dataframe['label_tencent'] == 1, '正面',
46 |                                                   np.where(results_dataframe['label_tencent'] == 0, '中性', '负面'))
47 |     return results_dataframe
48 | 
49 | 
50 | if __name__ == '__main__':
51 |     print(creat_label(['价格便宜啦，比原来优惠多了',
52 |                        '壁挂效果差，果然一分价钱一分货',
53 |                        '东西一般般，诶呀',
54 |                        '讨厌你',
55 |                        '一般'
56 |                        ]))
57 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/config.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | baidu = {'access_token_url': 'https://aip.baidubce.com/oauth/2.0/token',
 4 |          'api': {
 5 |              'sentiment_classify': {
 6 |                  'Content-Type': 'application/json',
 7 |                  'url': 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify'}},
 8 |          'account': {
 9 |              'id_1': {'APP_ID': '000',
10 |                       'API_KEY': '000',
11 |                       'SECRET_KEY': '000'},
12 |              'id_2': {'APP_ID': '000',
13 |                       'API_KEY': '000',
14 |                       'SECRET_KEY': '000'}}
15 |          }
16 | 
17 | tencent = {'api': {
18 |     'nlp_textpolar': {
19 |         'url': 'https://api.ai.qq.com/fcgi-bin/nlp/nlp_textpolar'}},
20 |     'account': {
21 |         'id_1': {'APP_ID': '000',
22 |                  'AppKey': '000'},
23 |         'id_2': {'APP_ID': '000',
24 |                  'API_KEY': '000'}}
25 | }
26 | 
27 | ali = {'api': {
28 |     'Sentiment': {
29 |         'url': 'https://dtplus-cn-shanghai.data.aliyuncs.com/{org_code}/nlp/api/Sentiment/ecommerce'}},
30 |     'account': {
31 |         'id_1': {'org_code': '000',
32 |                  'akID': '000',
33 |                  'akSecret': '000'
34 |                  },
35 |         'id_2': {'org_code': '000',
36 |                  'akID': '000',
37 |                  'akSecret': '000'
38 |                  }}
39 | }
40 | 
41 | label_path=os.getcwd()
42 | 
43 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/tencent.py:
--------------------------------------------------------------------------------
 1 | from SentimentAnalysis.creat_data.config import tencent
 2 | import pandas as pd
 3 | import numpy as np
 4 | import requests
 5 | import json
 6 | import time
 7 | import random
 8 | import hashlib
 9 | from urllib import parse
10 | from collections import OrderedDict
11 | 
12 | AppID = tencent['account']['id_1']['APP_ID']
13 | AppKey = tencent['account']['id_1']['AppKey']
14 | 
15 | def cal_sign(params_raw,AppKey=AppKey):
16 |     # 官方文档例子为php，给出python版本
17 |     # params_raw = {'app_id': '10000',
18 |     #               'time_stamp': '1493449657',
19 |     #               'nonce_str': '20e3408a79',
20 |     #               'key1': '腾讯AI开放平台',
21 |     #               'key2': '示例仅供参考',
22 |     #               'sign': ''}
23 |     # AppKey = 'a95eceb1ac8c24ee28b70f7dbba912bf'
24 |     # cal_sign(params_raw=params_raw,
25 |     #          AppKey=AppKey)
26 |     # 返回：BE918C28827E0783D1E5F8E6D7C37A61
27 |     params = OrderedDict()
28 |     for i in sorted(params_raw):
29 |         if params_raw[i] != '':
30 |             params[i] = params_raw[i]
31 |     newurl = parse.urlencode(params)
32 |     newurl += ('&app_key=' + AppKey)
33 |     sign = hashlib.md5(newurl.encode("latin1")).hexdigest().upper()
34 |     return sign
35 | 
36 | 
37 | def creat_label(texts,
38 |                 AppID=AppID,
39 |                 AppKey=AppKey):
40 |     '''
41 |     :param texts: 需要打标签的文档列表
42 |     :param AppID: 腾讯ai账号信息，默认调用配置文件id_1
43 |     :param AppKey: 腾讯ai账号信息，默认调用配置文件id_1
44 |     :return: 打好标签的列表，包括原始文档、标签、置信水平、是否成功
45 |     '''
46 | 
47 |     url = tencent['api']['nlp_textpolar']['url']
48 |     results = []
49 |     # 逐句调用接口判断
50 |     count_i=0
51 |     for one_text in texts:
52 |         params = {'app_id': AppID,
53 |                   'time_stamp': int(time.time()),
54 |                   'nonce_str': ''.join([random.choice('1234567890abcdefghijklmnopqrstuvwxyz') for i in range(10)]),
55 |                   'sign': '',
56 |                   'text': one_text}
57 |         params['sign'] = cal_sign(params_raw=params,
58 |                                   AppKey=AppKey)  # 获取sign
59 |         r = requests.post(url=url,
60 |                           params=params)  # 获取分析结果
61 |         result = json.loads(r.text)
62 |         # print(result)
63 |         results.append([one_text,
64 |                         result['data']['polar'],
65 |                         result['data']['confd'],
66 |                         result['ret'],
67 |                         result['msg']
68 |                         ])
69 |         r.close()
70 |         count_i += 1
71 |         if count_i % 50 == 0:
72 |             print('tencent finish:%d' % (count_i))
73 |     return results
74 | 
75 | 
76 | if __name__ == '__main__':
77 |     results = creat_label(texts=['价格便宜啦，比原来优惠多了',
78 |                        '壁挂效果差，果然一分价钱一分货',
79 |                        '东西一般般，诶呀',
80 |                        '讨厌你',
81 |                        '一般'])
82 |     results = pd.DataFrame(results, columns=['evaluation',
83 |                                              'label',
84 |                                              'confidence',
85 |                                              'ret',
86 |                                              'msg'])
87 |     results['label'] = np.where(results['label'] == 1, '正面',
88 |                                 np.where(results['label'] == 0, '中性', '负面'))
89 |     print(results)
90 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/data/traindata.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/data/traindata.xlsx


--------------------------------------------------------------------------------
/SentimentAnalysis/data_info.json:
--------------------------------------------------------------------------------
1 | {"maxlen": 383, "label": ["\u4e2d\u6027", "\u6b63\u9762", "\u8d1f\u9762"]}


--------------------------------------------------------------------------------
/SentimentAnalysis/flask_api.py:
--------------------------------------------------------------------------------
 1 | # -*- coding:utf-8 -*-
 2 | 
 3 | import pandas as pd
 4 | import numpy as np
 5 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
 6 | import os
 7 | from flask import Flask, request
 8 | from flask_restful import Resource, Api
 9 | 
10 | app = Flask(__name__)
11 | app.config.update(RESTFUL_JSON=dict(ensure_ascii=False))
12 | api = Api(app)
13 | 
14 | DIR = os.path.dirname(__file__)
15 | class sentiment_analyse(Resource):
16 |     def get(self):
17 |         model_name = request.args.get('model_name')
18 |         text = request.args.get('text')
19 |         prob=request.args.get('prob')
20 |         '''
21 |         model_name='SVM'
22 |         text='刚买就降价了桑心'
23 |         '''
24 | 
25 |         model = SentimentAnalysis()
26 |         # 导入词向量词包
27 |         model.load_vocab_word2vec(vocab_loadpath=DIR + '/models/vocab_word2vec.model')
28 | 
29 |         if model_name in ['SVM', 'KNN', 'Logistic']:
30 |             # 导入机器学习模型
31 |             model.load_model(model_loadpath=DIR + '/models/classify.model',
32 |                              model_name=model_name,
33 |                              data_info_path=DIR + '/data_info.json')
34 |         elif model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
35 |             # 导入深度学习模型
36 |             model.load_model(model_loadpath=DIR + '/models/classify.h5',
37 |                              model_name=model_name,
38 |                              data_info_path=DIR + '/data_info.json')
39 | 
40 |         try:
41 |             if prob == '1':
42 |                 # 进行预测:概率
43 |                 result_prob = model.predict_prob(texts=[text])
44 |                 result_prob = result_prob.astype(np.float64)
45 |                 result_prob = pd.DataFrame(result_prob, columns=model.label)
46 |                 result_prob['predict'] = result_prob.idxmax(axis=1)
47 |                 result_prob['text'] = [text]
48 |                 result_prob = result_prob[['text'] + list(model.label) + ['predict']]
49 |                 result = [{i: result_prob.loc[0, i]} for i in ['text'] + list(model.label) + ['predict']]
50 |             else:
51 |                 # 进行预测:类别
52 |                 result_classify = model.predict(texts=[text])
53 |                 result = [{'text': text},{'predict':result_classify[0]}]
54 | 
55 |             return result
56 |         except Exception as e:
57 |             return '该文本的词语均不在词库中，无法识别'+' (error: %s)'%e
58 | 
59 | #http://192.168.3.59:5000/SentimentAnalyse/?model_name=Conv1D&prob=1&text=东西为什么这么烂
60 | api.add_resource(sentiment_analyse, '/SentimentAnalyse/')
61 | 
62 | if __name__ == '__main__':
63 |     app.run(debug=True, host='0.0.0.0')
64 | 
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['keras_log_plot',
2 |            'neural_bulit',
3 |            'sklearn_supervised',
4 |            'sklearn_config',
5 |            'parameter']
6 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/classify.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/classify.h5


--------------------------------------------------------------------------------
/SentimentAnalysis/models/classify.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/classify.model


--------------------------------------------------------------------------------
/SentimentAnalysis/models/keras_log_plot.py:
--------------------------------------------------------------------------------
 1 | import matplotlib.pyplot as plt
 2 | 
 3 | 
 4 | def keras_log_plot(train_log=None):
 5 |     plt.plot(train_log['acc'], label='acc', color='red')
 6 |     plt.plot(train_log['loss'], label='loss', color='yellow')
 7 |     if 'val_acc' in train_log.columns:
 8 |         plt.plot(train_log['val_acc'], label='val_acc', color='green')
 9 |     if 'val_loss' in train_log.columns:
10 |         plt.plot(train_log['val_loss'], label='val_loss', color='blue')
11 |     plt.legend()
12 |     plt.show()
13 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/neural_bulit.py:
--------------------------------------------------------------------------------
  1 | from keras.models import Sequential
  2 | from keras.layers.core import Dense, initializers, Flatten, Dropout, Masking
  3 | from keras.layers import Conv1D, InputLayer
  4 | from keras.layers.recurrent import LSTM
  5 | from keras.layers.pooling import MaxPooling1D
  6 | from SentimentAnalysis.models.parameter.optimizers import optimizers
  7 | 
  8 | def neural_bulit(net_shape,
  9 |                  optimizer_name='Adagrad',
 10 |                  lr=0.001,
 11 |                  loss='categorical_crossentropy'):
 12 |     '''
 13 |     :param net_shape: 神经网络格式
 14 |     net_shape = [
 15 |              {'name': 'InputLayer','input_shape': [10, 5]},
 16 |              {'name': 'Dropout','rate': 0.2,},
 17 |              {'name': 'Masking'},
 18 |              {'name': 'LSTM','units': 16,'activation': 'tanh','recurrent_activation': 'hard_sigmoid','dropout': 0.,'recurrent_dropout': 0.},
 19 |              {'name': 'Conv1D','filters': 64,'kernel_size': 3,'strides': 1,'padding': 'same','activation': 'relu'},
 20 |              {'name': 'MaxPooling1D','pool_size': 5,'padding': 'same','strides': 2},
 21 |              {'name': 'Flatten'},
 22 |              {'name': 'Dense','activation': 'relu','units': 64},
 23 |              {'name': 'softmax','activation': 'softmax','units': 2}
 24 |              ]
 25 |     :param optimizer_name: 优化器
 26 |     :param lr: 学习率
 27 |     :param loss: 损失函数
 28 |     :param return: 返回神经网络模型
 29 |     '''
 30 |     model = Sequential()
 31 | 
 32 |     def add_InputLayer(input_shape,
 33 |                        **param):
 34 |         model.add(InputLayer(input_shape=input_shape,
 35 |                              **param))
 36 | 
 37 |     def add_Dropout(rate=0.2,
 38 |                     **param):
 39 |         model.add(Dropout(rate=rate,
 40 |                           **param))
 41 | 
 42 |     def add_Masking(mask_value=0,
 43 |                     **param):
 44 |         model.add(Masking(mask_value=mask_value,
 45 |                           **param))
 46 | 
 47 |     def add_LSTM(units=16,
 48 |                  activation='tanh',
 49 |                  recurrent_activation='hard_sigmoid',
 50 |                  implementation=1,
 51 |                  dropout=0,
 52 |                  recurrent_dropout=0,
 53 |                  **param):
 54 |         model.add(LSTM(units=units,
 55 |                        activation=activation,
 56 |                        recurrent_activation=recurrent_activation,
 57 |                        implementation=implementation,
 58 |                        dropout=dropout,
 59 |                        recurrent_dropout=recurrent_dropout,
 60 |                        **param))
 61 | 
 62 |     def add_Conv1D(filters=16,  # 卷积核数量
 63 |                    kernel_size=3,  # 卷积核尺寸，或者[3]
 64 |                    strides=1,
 65 |                    padding='same',
 66 |                    activation='relu',
 67 |                    kernel_initializer=initializers.normal(stddev=0.1),
 68 |                    bias_initializer=initializers.normal(stddev=0.1),
 69 |                    **param):
 70 |         model.add(Conv1D(filters=filters,
 71 |                          kernel_size=kernel_size,
 72 |                          strides=strides,
 73 |                          padding=padding,
 74 |                          activation=activation,
 75 |                          kernel_initializer=kernel_initializer,
 76 |                          bias_initializer=bias_initializer,
 77 |                          **param))
 78 | 
 79 |     def add_MaxPooling1D(pool_size=3,  # 卷积核尺寸，或者[3]
 80 |                          strides=1,
 81 |                          padding='same',
 82 |                          **param):
 83 |         model.add(MaxPooling1D(pool_size=pool_size,
 84 |                                strides=strides,
 85 |                                padding=padding,
 86 |                                **param))
 87 | 
 88 |     def add_Flatten(**param):
 89 |         model.add(Flatten(**param))
 90 | 
 91 |     def add_Dense(units=16,
 92 |                   activation='relu',
 93 |                   kernel_initializer=initializers.normal(stddev=0.1),
 94 |                   **param):
 95 |         model.add(Dense(units=units,
 96 |                         activation=activation,
 97 |                         kernel_initializer=kernel_initializer,
 98 |                         **param))
 99 | 
100 |     for n in range(len(net_shape)):
101 |         if net_shape[n]['name'] == 'InputLayer':
102 |             del net_shape[n]['name']
103 |             add_InputLayer(name='num_' + str(n) + '_InputLayer',
104 |                            **net_shape[n])
105 |         elif net_shape[n]['name'] == 'Dropout':
106 |             del net_shape[n]['name']
107 |             add_Dropout(name='num_' + str(n) + '_Dropout',
108 |                         **net_shape[n])
109 |         elif net_shape[n]['name'] == 'Masking':
110 |             del net_shape[n]['name']
111 |             add_Masking(name='num_' + str(n) + '_Masking',
112 |                         **net_shape[n])
113 |         elif net_shape[n]['name'] == 'LSTM':
114 |             del net_shape[n]['name']
115 |             add_LSTM(name='num_' + str(n) + '_LSTM',
116 |                      **net_shape[n])
117 |         elif net_shape[n]['name'] == 'Conv1D':
118 |             del net_shape[n]['name']
119 |             add_Conv1D(name='num_' + str(n) + '_Conv1D',
120 |                        **net_shape[n])
121 |         elif net_shape[n]['name'] == 'MaxPooling1D':
122 |             del net_shape[n]['name']
123 |             add_MaxPooling1D(name='num_' + str(n) + '_MaxPooling1D',
124 |                              **net_shape[n])
125 |         elif net_shape[n]['name'] == 'Flatten':
126 |             del net_shape[n]['name']
127 |             add_Flatten(name='num_' + str(n) + '_Flatten',
128 |                         **net_shape[n])
129 |         elif net_shape[n]['name'] == 'Dense':
130 |             del net_shape[n]['name']
131 |             add_Dense(name='num_' + str(n) + '_Dense',
132 |                       **net_shape[n])
133 |         elif net_shape[n]['name'] == 'softmax':
134 |             del net_shape[n]['name']
135 |             add_Dense(name='num_' + str(n) + '_softmax',
136 |                       **net_shape[n])
137 | 
138 |     optimizer = optimizers(name=optimizer_name, lr=lr)
139 |     model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
140 | 
141 |     return model
142 | 
143 | 
144 | if __name__ == '__main__':
145 |     net_shape = [{'name': 'InputLayer',
146 |                   'input_shape': [10, 5],
147 |                   },
148 |                  {'name': 'Conv1D'
149 |                   },
150 |                  {'name': 'MaxPooling1D'
151 |                   },
152 |                  {'name': 'Flatten'
153 |                   },
154 |                  {'name': 'Dense'
155 |                   },
156 |                  {'name': 'Dropout'
157 |                   },
158 |                  {'name': 'softmax',
159 |                   'activation': 'softmax',
160 |                   'units': 3
161 |                   }
162 |                  ]
163 |     model = neural_bulit(net_shape=net_shape,
164 |                          optimizer_name='Adagrad',
165 |                          lr=0.001,
166 |                          loss='categorical_crossentropy')
167 |     model.summary()
168 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/parameter/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['optimizers']
2 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/parameter/optimizers.py:
--------------------------------------------------------------------------------
 1 | from keras.optimizers import SGD, Adagrad, Adam
 2 | 
 3 | 
 4 | def optimizers(name='SGD', lr=0.001):
 5 |     if name == 'SGD':
 6 |         optimizers_fun = SGD(lr=lr)
 7 |     elif name == 'Adagrad':
 8 |         optimizers_fun = Adagrad(lr=lr)
 9 |     elif name == 'Adam':
10 |         optimizers_fun = Adam(lr=lr)
11 | 
12 |     return optimizers_fun
13 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/sklearn_config.py:
--------------------------------------------------------------------------------
 1 | # from sklearn.neighbors import KNeighborsClassifier
 2 | # from sklearn.svm import SVC
 3 | # from sklearn.linear_model import LogisticRegression
 4 | 
 5 | SVC = {'kernel': 'linear',  # 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'
 6 |        'C': 1.0,
 7 |        'probability': True}
 8 | 
 9 | KNN = {'n_neighbors': 3}  # Number of neighbors to use
10 | 
11 | Logistic = {'solver': 'liblinear',
12 |             'penalty': 'l2',  # 'l1' or 'l2'
13 |             'C': 1.0}  # a positive float,Like in support vector machines, smaller values specify stronger
14 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/sklearn_supervised.py:
--------------------------------------------------------------------------------
 1 | from sklearn.neighbors import KNeighborsClassifier
 2 | from sklearn.svm import SVC
 3 | from sklearn.linear_model import LogisticRegression
 4 | from sklearn.externals import joblib
 5 | import os
 6 | 
 7 | DIR = os.path.dirname(__file__)
 8 | def sklearn_supervised(data=None,
 9 |                        label=None,
10 |                        model_savepath=DIR + '/sentence_transform/classify.model',
11 |                        model_name='SVM',
12 |                        **sklearn_param):
13 |     '''
14 |     :param data: 训练文本
15 |     :param label: 训练文本的标签
16 |     :param model_savepath: 模型保存路径
17 |     :param model_name: 机器学习分类模型,SVM,KNN,Logistic
18 |     :param return: 训练好的模型
19 |     '''
20 | 
21 |     if model_name == 'KNN':
22 |         # 调用KNN,近邻=5
23 |         model = KNeighborsClassifier(**sklearn_param)
24 |     elif model_name == 'SVM':
25 |         # 核函数为linear,惩罚系数为1.0
26 |         model = SVC(**sklearn_param)
27 |         model.fit(data, label)
28 |     elif model_name == 'Logistic':
29 |         model = LogisticRegression(**sklearn_param)  # 核函数为线性,惩罚系数为1
30 |         model.fit(data, label)
31 | 
32 |     if model_savepath != None:
33 |         joblib.dump(model, model_savepath)  # 保存模型
34 | 
35 | 
36 |     return model
37 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/models/vocab_word2vec.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/vocab_word2vec.model


--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['sentence_2_sparse',
2 |            'sentence_word2vec',
3 |            'sentence_2_tokenizer']
4 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/creat_vocab_word2vec.py:
--------------------------------------------------------------------------------
 1 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
 2 | import pandas as pd
 3 | import jieba
 4 | from gensim.models import word2vec, doc2vec
 5 | import numpy as np
 6 | import os
 7 | 
 8 | jieba.setLogLevel('WARN')
 9 | DIR = os.path.dirname(__file__)
10 | 
11 | def creat_vocab_word2vec(texts=None,
12 |                          sg=0,
13 |                          vocab_savepath=DIR + '/vocab_word2vec.model',
14 |                          size=5,
15 |                          window=5,
16 |                          min_count=1):
17 |     '''
18 |     
19 |     :param texts: list of text
20 |     :param sg: 0 CBOW,1 skip-gram
21 |     :param size: the dimensionality of the feature vectors
22 |     :param window: the maximum distance between the current and predicted word within a sentence
23 |     :param min_count: ignore all words with total frequency lower than this
24 |     :param vocab_savepath: path to save word2vec dictionary
25 |     :return: None
26 |     
27 |     '''
28 |     texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts]  # 分词
29 |     # 训练
30 |     model = word2vec.Word2Vec(texts_cut, sg=sg, size=size, window=window, min_count=min_count)
31 |     if vocab_savepath != None:
32 |         model.save(vocab_savepath)
33 | 
34 |     return model
35 | 
36 | 
37 | if __name__ == '__main__':
38 |     texts = ['全面从严治党',
39 |              '国际公约和国际法',
40 |              '中国航天科技集团有限公司']
41 |     vocab_word2vec = creat_vocab_word2vec(texts=texts,
42 |                                           sg=0,
43 |                                           vocab_savepath=DIR + '/vocab_word2vec.model',
44 |                                           size=5,
45 |                                           window=5,
46 |                                           min_count=1)
47 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/sentence_2_sparse.py:
--------------------------------------------------------------------------------
 1 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
 2 | import pandas as pd
 3 | import jieba
 4 | 
 5 | jieba.setLogLevel('WARN')
 6 | 
 7 | 
 8 | def sentence_2_sparse(train_data,
 9 |                       test_data=None,
10 |                       language='Chinese',
11 |                       hash=True,
12 |                       hashmodel='CountVectorizer'):
13 |     '''
14 |     
15 |     :param train_data: 训练集
16 |     :param test_data: 测试集
17 |     :param language: 语种
18 |     :param hash: 是否转哈希存储
19 |     :param hashmodel: 哈希计数的方式
20 |     :param return: 返回编码后稀疏矩阵
21 |     '''
22 |     # 分词转one-hot dataframe
23 |     if test_data==None:
24 |         if hash == False:
25 |             train_data = pd.DataFrame([pd.Series([word for word in jieba.lcut(sample) if word != ' ']).value_counts()
26 |                                        for sample in train_data]).fillna(0)
27 |         # 中文需要先分词空格分隔,再转稀疏矩阵
28 |         else:
29 |             if language == 'Chinese':
30 |                 train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
31 |             if hashmodel == 'CountVectorizer':  # 只计数
32 |                 count_train = CountVectorizer()
33 |                 train_data_hashcount = count_train.fit_transform(train_data)  # 训练数据转哈希计数
34 |             elif hashmodel == 'TfidfTransformer':  # 计数后计算tf-idf
35 |                 count_train = CountVectorizer()
36 |                 train_data_hashcount = count_train.fit_transform(train_data)  # 训练数据转哈希计数
37 |                 tfidftransformer = TfidfTransformer()
38 |                 train_data_hashcount = tfidftransformer.fit(train_data_hashcount).transform(train_data_hashcount)
39 |             elif hashmodel == 'HashingVectorizer':  # 哈希计算
40 |                 vectorizer = HashingVectorizer(stop_words=None, n_features=10000)
41 |                 train_data_hashcount = vectorizer.fit_transform(train_data)  # 训练数据转哈希后的特征,避免键值重叠导致过大有一个计算的
42 |             return train_data_hashcount
43 |         return train_data
44 |     else:
45 |         # 中文需要先分词空格分隔,再转稀疏矩阵,如果包含测试集,测试集转hash需要在训练集的词库基础上执行
46 |         if language == 'Chinese':
47 |             train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
48 |             test_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in test_data]
49 |         if hashmodel == 'CountVectorizer':  # 只计数
50 |             count_train = CountVectorizer()
51 |             train_data_hashcount = count_train.fit_transform(train_data)  # 训练数据转哈希计数
52 |             count_test = CountVectorizer(vocabulary=count_train.vocabulary_)  # 测试数据调用训练词库
53 |             test_data_hashcount = count_test.fit_transform(test_data)  # 测试数据转哈希计数
54 |         elif hashmodel == 'TfidfTransformer':  # 计数后计算tf-idf
55 |             count_train = CountVectorizer()
56 |             train_data_hashcount = count_train.fit_transform(train_data)  # 训练数据转哈希计数
57 |             count_test = CountVectorizer(vocabulary=count_train.vocabulary_)  # 测试数据调用训练词库
58 |             test_data_hashcount = count_test.fit_transform(test_data)  # 测试数据转哈希计数
59 |             tfidftransformer = TfidfTransformer()
60 |             train_data_hashcount = tfidftransformer.fit(train_data_hashcount).transform(train_data_hashcount)
61 |             test_data_hashcount = tfidftransformer.fit(test_data_hashcount).transform(test_data_hashcount)
62 |         elif hashmodel == 'HashingVectorizer':  # 哈希计算
63 |             vectorizer = HashingVectorizer(stop_words=None, n_features=10000)
64 |             train_data_hashcount = vectorizer.fit_transform(train_data)  # 训练数据转哈希后的特征,避免键值重叠导致过大有一个计算的
65 |             test_data_hashcount = vectorizer.fit_transform(test_data)  # 测试数据转哈希后的特征
66 |         return train_data_hashcount, test_data_hashcount
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     train_data = ['全面从严治党',
71 |                   '国际公约和国际法',
72 |                   '中国航天科技集团有限公司']
73 |     test_data = ['全面从严测试']
74 |     print('train_data\n',train_data,'\ntest_data\n',test_data)
75 |     print('sentence_2_sparse(train_data=train_data,hash=False)\n',
76 |           sentence_2_sparse(train_data=train_data, hash=False))
77 |     print('sentence_2_sparse(train_data=train_data,hash=True)\n',
78 |           sentence_2_sparse(train_data=train_data, hash=True))
79 |     m,n=sentence_2_sparse(train_data=train_data, test_data=test_data, hash=True)
80 |     print('sentence_2_sparse(train_data=train_data,test_data=test_data,hash=True)\n',
81 |           'train_data\n',m,'\ntest_data\n',n)
82 | 


--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/sentence_2_tokenizer.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import jieba
 3 | from keras.preprocessing.text import Tokenizer
 4 | 
 5 | jieba.setLogLevel('WARN')
 6 | 
 7 | 
 8 | def sentence_2_tokenizer(train_data,
 9 |                          test_data=None,
10 |                          num_words=None,
11 |                          word_index=False):
12 |     '''
13 |     
14 |     :param train_data: 训练集
15 |     :param test_data: 测试集
16 |     :param num_words: 词库大小,None则依据样本自动判定
17 |     :param word_index: 是否需要索引
18 |     :param return: 返回编码后数组
19 |     '''
20 |     train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
21 |     test_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in test_data]
22 |     data = train_data + test_data
23 |     tokenizer = Tokenizer(num_words=num_words)
24 |     tokenizer.fit_on_texts(data)
25 |     train_data = tokenizer.texts_to_sequences(train_data)
26 |     test_data = tokenizer.texts_to_sequences(test_data)
27 | 
28 |     if word_index == False:
29 |         if test_data == None:
30 |             return train_data
31 | 
32 |         else:
33 |             return train_data, test_data
34 |     else:
35 |         if test_data == None:
36 |             return train_data, tokenizer.word_index
37 | 
38 |         else:
39 |             return train_data, test_data, tokenizer.word_index
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     train_data = ['全面从严治党',
44 |                   '国际公约和国际法',
45 |                   '中国航天科技集团有限公司']
46 |     test_data = ['全面从严测试']
47 |     train_data_vec, test_data_vec, word_index = sentence_2_tokenizer(train_data=train_data,
48 |                                                                      test_data=test_data,
49 |                                                                      num_words=None,
50 |                                                                      word_index=True)
51 |     print(train_data, '\n', train_data_vec, '\n',
52 |           test_data[0], '\n', test_data_vec, '\n',
53 |           'word_index\n',word_index)
54 | 


--------------------------------------------------------------------------------
/demo.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | from sklearn.model_selection import train_test_split
 4 | import os
 5 | 
 6 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
 7 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
 8 | 
 9 | model = SentimentAnalysis()
10 | 
11 | dataset = pd.read_excel(os.getcwd() + '/data/traindata.xlsx', sheet_name=0)
12 | data = dataset['evaluation']
13 | label = dataset['label']
14 | train_data, test_data, train_label, test_label = train_test_split(data,
15 |                                                                   label,
16 |                                                                   test_size=0.1,
17 |                                                                   random_state=42)
18 | test_data = test_data.reset_index(drop=True)
19 | test_label = test_label.reset_index(drop=True)
20 | # 建模获取词向量词包
21 | model.creat_vocab(texts=train_data,
22 |                   sg=0,
23 |                   size=5,
24 |                   window=5,
25 |                   min_count=1,
26 |                   vocab_savepath=None)
27 | 
28 | # 导入词向量词包
29 | # model.load_vocab_word2vec(vocab_loadpath=os.getcwd() + '/vocab_word2vec.model')
30 | 
31 | ###################################################################################
32 | # 进行机器学习
33 | model.train(texts=train_data,
34 |             label=train_label,
35 |             model_name='SVM',
36 |             model_savepath=os.getcwd() + '/models/classify.model')
37 | 
38 | # 导入机器学习模型
39 | model.load_model(model_loadpath=os.getcwd() + '/models/classify.model',
40 |                  model_name='SVM',
41 |                  data_info_path=os.getcwd() + '/data_info.json')
42 | 
43 | # 进行预测:概率
44 | result_prob = model.predict_prob(texts=test_data)
45 | result_prob = pd.DataFrame(result_prob, columns=model.label)
46 | result_prob['predict'] = result_prob.idxmax(axis=1)
47 | result_prob['data'] = test_data
48 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
49 | print('prob:\n', result_prob)
50 | print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
51 | 
52 | ###################################################################################
53 | # 进行深度学习
54 | model.train(texts=train_data,
55 |             label=train_label,
56 |             model_name='Conv1D',
57 |             batch_size=200,
58 |             epochs=20,
59 |             verbose=2,
60 |             maxlen=None,
61 |             model_savepath=os.getcwd() + '/models/classify.h5')
62 | 
63 | # 导入深度学习模型
64 | model.load_model(model_loadpath=os.getcwd() + '/models/classify.h5',
65 |                  model_name='Conv1D',
66 |                  data_info_path=os.getcwd() + '/data_info.json')
67 | 
68 | # 进行预测:概率
69 | result_prob = model.predict_prob(texts=test_data)
70 | result_prob = pd.DataFrame(result_prob, columns=model.label)
71 | result_prob['predict'] = result_prob.idxmax(axis=1)
72 | result_prob['data'] = test_data
73 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
74 | print('prob:\n', result_prob)
75 | print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
76 | 
77 | keras_log_plot(model.train_log)
78 | 


--------------------------------------------------------------------------------
/picture/Conv1D.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/Conv1D.png


--------------------------------------------------------------------------------
/picture/SVM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/SVM.png


--------------------------------------------------------------------------------
/picture/api1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api1.png


--------------------------------------------------------------------------------
/picture/api2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api2.png


--------------------------------------------------------------------------------
/picture/api3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api3.png


--------------------------------------------------------------------------------
/picture/label.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/label.png


--------------------------------------------------------------------------------