├── .gitignore
├── LICENSE
├── README.md
├── SentimentAnalysis
├── SentimentAnalysis.py
├── __init__.py
├── creat_data
│ ├── __init__.py
│ ├── ali.py
│ ├── baidu.py
│ ├── bat.py
│ ├── config.py
│ └── tencent.py
├── data
│ └── traindata.xlsx
├── data_info.json
├── flask_api.py
├── models
│ ├── __init__.py
│ ├── classify.h5
│ ├── classify.model
│ ├── keras_log_plot.py
│ ├── neural_bulit.py
│ ├── parameter
│ │ ├── __init__.py
│ │ └── optimizers.py
│ ├── sklearn_config.py
│ ├── sklearn_supervised.py
│ └── vocab_word2vec.model
└── sentence_transform
│ ├── __init__.py
│ ├── creat_vocab_word2vec.py
│ ├── sentence_2_sparse.py
│ └── sentence_2_tokenizer.py
├── demo.py
└── picture
├── Conv1D.png
├── SVM.png
├── api1.png
├── api2.png
├── api3.png
└── label.png
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | .idea/misc.xml
3 | .idea/modules.xml
4 | .idea/Sentiment-analysis.iml
5 | *.xml
6 | *.pyc
7 | try.py
8 | creat_data/config.py
9 | creat_label_mysql.py
10 | SentimentAnalysis/creat_data/bat_1.py
11 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 renjunxiang
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Sentiment-analysis:情感分析
2 |
3 | [](https://www.python.org/)
4 | [](https://pypi.python.org/pypi/baidu-aip/2.1.0.0)
5 | [](https://pypi.python.org/pypi/pandas/0.21.0)
6 | [](https://pypi.python.org/pypi/numpy/1.13.1)
7 | [](https://pypi.python.org/pypi/matplotlib/2.1.0)
8 | [](https://pypi.python.org/pypi/jieba/0.39)
9 | [](https://pypi.python.org/pypi/gensim/3.2.0)
10 | [](https://pypi.python.org/pypi/scikit-learn/0.19.1)
11 | [](https://pypi.python.org/pypi/requests/2.18.4)
12 |
13 | ## 语言
14 | Python3.5
15 | ## 依赖库
16 | baidu-aip=2.1.0.0
17 | pandas=0.21.0
18 | numpy=1.13.1
19 | jieba=0.39
20 | gensim=3.2.0
21 | scikit-learn=0.19.1
22 | keras=2.1.1
23 | requests=2.18.4
24 |
25 |
26 |
27 | ## 项目介绍
28 | * 通过对已有标签的帖子进行训练,实现新帖子的情感分类,SentimentAnalysis文件夹可以直接作为模块使用。
29 | * 已完成机器学习算法中KNN、SVM和Logistic的封装,神经网络算法中的一维卷积核LSTM封装。训练集为一万条记录,SVM效果最好,准确率在87%左右.
30 | * ***PS:该项目在上一个项目Text-Classification基础上封装而成~仍有很多不足,欢迎萌新、大佬多多指导!***
31 |
32 | ## 用法简介
33 | * ### 导入模块,创建模型
34 | ``` python
35 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
36 | model = SentimentAnalysis()
37 | ```
38 |
39 | * ### 借助第三方平台,打情感标签。
40 | ``` python
41 | # 用于在缺乏标签的时候利用BAT三家的接口创建训练集,5000条文档共耗时约45分钟
42 | texts=['国王喜欢吃苹果',
43 | '国王非常喜欢吃苹果',
44 | '国王讨厌吃苹果',
45 | '国王非常讨厌吃苹果']
46 | texts_withlabel=model.creat_label(texts)
47 | ```
48 |
49 | * ### 通过gensim模块创建词向量词包
50 | ``` python
51 | model.creat_vocab(texts=texts,
52 | sg=0,
53 | size=5,
54 | window=5,
55 | min_count=1,
56 | vocab_savepath=os.getcwd() + '/vocab_word2vec.model')
57 | # 也可以导入词向量
58 | model.load_vocab_word2vec(os.getcwd() + '/models/vocab_word2vec.model')
59 | # 词向量模型
60 | model.vocab_word2vec
61 | ```
62 |
63 | * ### 通过scikit-learn进行机器学习
64 | ``` python
65 | model.train(texts=train_data,
66 | label=train_label,
67 | model_name='SVM',
68 | model_savepath=os.getcwd() + '/classify.model')
69 | # 也可以导入机器学习模型
70 | model.load_model(model_loadpath=os.getcwd() + '/classify.model')
71 | # 训练的模型
72 | model.model
73 | # 训练集标签
74 | model.label
75 | ```
76 |
77 | * ### 通过keras进行深度学习(模型的后缀不同)
78 | ``` python
79 | model.train(texts=train_data,
80 | label=train_label,
81 | model_name='Conv1D',
82 | batch_size=100,
83 | epochs=2,
84 | verbose=1,
85 | maxlen=None,
86 | model_savepath=os.getcwd() + '/classify.h5')
87 |
88 | # 导入深度学习模型
89 | model.load_model(model_loadpath=os.getcwd() + '/classify.h5')
90 | # 训练的模型
91 | model.model
92 | # 训练的日志
93 | model.train_log
94 | # 可视化训练过程
95 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
96 | keras_log_plot(model.train_log)
97 | # 训练集标签
98 | model.label
99 | ```
100 |
101 | * ### 预测
102 | ``` python
103 | # 概率
104 | result_prob = model.predict_prob(texts=test_data)
105 | result_prob = pd.DataFrame(result_prob, columns=model.label)
106 | result_prob['predict'] = result_prob.idxmax(axis=1)
107 | result_prob['data'] = test_data
108 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
109 | print('prob:\n', result_prob)
110 |
111 | # 分类
112 | result = model.predict(texts=test_data)
113 | print('score:', np.sum(result == np.array(test_label)) / len(result))
114 | result = pd.DataFrame({'data': test_data,
115 | 'label': test_label,
116 | 'predict': result},
117 | columns=['data', 'label', 'predict'])
118 | print('test\n', result)
119 | ```
120 |
121 | * ### 开启API
122 | ``` python
123 | # 需要先训练好模型
124 | model.open_api()
125 | #http://0.0.0.0:5000/SentimentAnalyse/?model_name=模型名称&prob=是否需要返回概率&text=分类文本
126 | ```
127 |
128 | * ### 其他说明
129 | 在训练集很小的情况下,sklearn的概率输出predict_prob会不准。目前发现,SVM会出现所有标签概率一样,暂时没看源码,猜测是离超平面过近不计算概率,predict不会出现这个情况。
130 |
131 | ## 一个简单的demo
132 | ``` python
133 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
134 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
135 | import numpy as np
136 |
137 | train_data = ['国王喜欢吃苹果',
138 | '国王非常喜欢吃苹果',
139 | '国王讨厌吃苹果',
140 | '国王非常讨厌吃苹果']
141 | train_label = ['正面', '正面', '负面', '负面']
142 | # print('train data\n',
143 | # pd.DataFrame({'data': train_data,
144 | # 'label': train_label},
145 | # columns=['data', 'label']))
146 | test_data = ['涛哥喜欢吃苹果',
147 | '涛哥讨厌吃苹果',
148 | '涛哥非常喜欢吃苹果',
149 | '涛哥非常讨厌吃苹果']
150 | test_label = ['正面', '负面', '正面', '负面']
151 |
152 | # 创建模型
153 | model = SentimentAnalysis()
154 |
155 | # 查看bat打的标签
156 | print(model.creat_label(test_data))
157 |
158 | # 建模获取词向量词包
159 | model.creat_vocab(texts=train_data,
160 | sg=0,
161 | size=5,
162 | window=5,
163 | min_count=1,
164 | vocab_savepath=os.getcwd() + '/vocab_word2vec.model')
165 |
166 | # 导入词向量词包
167 | # model.load_vocab_word2vec(vocab_loadpath=os.getcwd() + '/vocab_word2vec.model')
168 |
169 | ###################################################################################
170 | # 进行机器学习
171 | model.train(texts=train_data,
172 | label=train_label,
173 | model_name='SVM',
174 | model_savepath=os.getcwd() + '/classify.model')
175 |
176 | # 导入机器学习模型
177 | # model.load_model(model_loadpath=os.getcwd() + '/classify.model')
178 |
179 | # 进行预测:概率
180 | result_prob = model.predict_prob(texts=test_data)
181 | result_prob = pd.DataFrame(result_prob, columns=model.label)
182 | result_prob['predict'] = result_prob.idxmax(axis=1)
183 | result_prob['data'] = test_data
184 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
185 | print('prob:\n', result_prob)
186 |
187 | # 进行预测:分类
188 | result = model.predict(texts=test_data)
189 | print('score:', np.sum(result == np.array(test_label)) / len(result))
190 | result = pd.DataFrame({'data': test_data,
191 | 'label': test_label,
192 | 'predict': result},
193 | columns=['data', 'label', 'predict'])
194 | print('test\n', result)
195 | ###################################################################################
196 | # 进行深度学习
197 | model.train(texts=train_data,
198 | label=train_label,
199 | model_name='Conv1D',
200 | batch_size=100,
201 | epochs=2,
202 | verbose=1,
203 | maxlen=None,
204 | model_savepath=os.getcwd() + '/classify.h5')
205 |
206 | # 导入深度学习模型
207 | # model.load_model(model_loadpath=os.getcwd() + '/classify.h5')
208 |
209 | # 进行预测:概率
210 | result_prob = model.predict_prob(texts=test_data)
211 | result_prob = pd.DataFrame(result_prob, columns=model.label)
212 | result_prob['predict'] = result_prob.idxmax(axis=1)
213 | print(result_prob)
214 |
215 | # 进行预测:分类
216 | result = model.predict(texts=test_data)
217 | print(result)
218 | print('score:', np.sum(result == np.array(test_label)) / len(result))
219 | result = pd.DataFrame({'data': test_data,
220 | 'label': test_label,
221 | 'predict': result},
222 | columns=['data', 'label', 'predict'])
223 | print('test\n', result)
224 |
225 | keras_log_plot(model.train_log)
226 |
227 | ```
228 | bat打标签
229 | 
230 | SVM
231 | 
232 | Conv1D
233 | 
234 |
235 | ## 简单API
236 | 做了一个api:http://192.168.3.59:5000/SentimentAnalyse/?model_name=模型名称&prob=是否需要返回概率&text=分类文本
237 | 192.168.3.59:ip地址,由服务器决定
238 | 模型名称:目前支持:SVM,Conv1D
239 | prob:0返回分类,1返回概率
240 | ``` python
241 | from SentimentAnalysis import SentimentAnalysis
242 | model = SentimentAnalysis()
243 | model.open_api()
244 | ```
245 |
246 | ### 例子
247 |
248 | * __SVM模型,返回概率__
249 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=SVM&prob=1&text=东西很不错
250 | 
251 |
252 | * __Conv1D模型,返回分类__
253 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=Conv1D&prob=0&text=东西很不错
254 | 
255 |
256 | * __文本中的词语均不在词向量词库中__
257 | url:http://192.168.3.59:5000/SentimentAnalyse/?model_name=SVM&prob=0&text=呜呜呜
258 | 
259 |
260 |
261 |
262 |
263 |
264 |
265 |
266 |
267 |
268 |
269 |
270 |
271 |
272 |
--------------------------------------------------------------------------------
/SentimentAnalysis/SentimentAnalysis.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | import os
4 | from sklearn.externals import joblib
5 | from gensim.models import word2vec
6 | from keras.preprocessing.sequence import pad_sequences
7 | from keras.callbacks import History
8 | from keras.models import load_model
9 | import jieba
10 | import json
11 |
12 | from SentimentAnalysis.creat_data import bat
13 | from SentimentAnalysis.sentence_transform.creat_vocab_word2vec import creat_vocab_word2vec
14 | from SentimentAnalysis.models.sklearn_supervised import sklearn_supervised
15 | from SentimentAnalysis.models import sklearn_config
16 | from SentimentAnalysis.models.neural_bulit import neural_bulit
17 |
18 | jieba.setLogLevel('WARN')
19 | DIR = os.path.dirname(__file__)
20 |
21 | class SentimentAnalysis():
22 | # def __init__(self):
23 | # pass
24 |
25 | # open api
26 | def open_api(self):
27 | os.system("python %s" % (DIR + '/flask_api.py'))
28 | os.close()
29 |
30 | # get labels from baidu,ali,tencent
31 | def creat_label(self, texts):
32 | results_dataframe = bat.creat_label(texts)
33 | return results_dataframe
34 |
35 | def creat_vocab(self,
36 | texts=None,
37 | sg=0,
38 | size=5,
39 | window=5,
40 | min_count=1,
41 | vocab_savepath=DIR + '/models/vocab_word2vec.model'):
42 | '''
43 | get dictionary by word2vec
44 | :param texts: list of text
45 | :param sg: 0 CBOW,1 skip-gram
46 | :param size: the dimensionality of the feature vectors
47 | :param window: the maximum distance between the current and predicted word within a sentence
48 | :param min_count: ignore all words with total frequency lower than this
49 | :param vocab_savepath: path to save word2vec dictionary
50 | :return: None
51 | '''
52 | # 构建词向量词库
53 | self.vocab_word2vec = creat_vocab_word2vec(texts=texts,
54 | sg=sg,
55 | vocab_savepath=vocab_savepath,
56 | size=size,
57 | window=window,
58 | min_count=min_count)
59 |
60 | def load_vocab_word2vec(self,
61 | vocab_loadpath=DIR + '/models/vocab_word2vec.model'):
62 | '''
63 | load dictionary
64 | :param vocab_loadpath: path to load word2vec dictionary
65 | :return:
66 | '''
67 | self.vocab_word2vec = word2vec.Word2Vec.load(vocab_loadpath)
68 |
69 | def train(self,
70 | texts=None,
71 | label=None,
72 | model_name='SVM',
73 | model_savepath=DIR + '/models/classify.model',
74 | net_shape=None,
75 | batch_size=100, # 神经网络参数
76 | epochs=2, # 神经网络参数
77 | verbose=2, # 神经网络参数
78 | maxlen=None, # 神经网络参数
79 | **sklearn_param):
80 | '''
81 | use sklearn/keras to train model
82 | :param texts: x
83 | :param label: y
84 | :param model_name: name want to train
85 | :param model_savepath: model save path
86 | :param batch_size: for keras fit
87 | :param epochs: for keras fit
88 | :param verbose: for keras fit
89 | :param maxlen: for keras pad_sequences
90 | :param sklearn_param: param for sklearn
91 | :return: None
92 | '''
93 | self.model_name = model_name
94 | self.label = np.unique(np.array(label))
95 | # 文本转词向量
96 | vocab_word2vec = self.vocab_word2vec
97 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
98 | data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
99 | if maxlen is None:
100 | maxlen = max([len(i) for i in texts_cut])
101 | self.maxlen = maxlen
102 | # sklearn模型,词向量计算均值
103 | if model_name in ['SVM', 'KNN', 'Logistic']:
104 | data = [sum(i) / len(i) for i in data]
105 | # 配置sklearn模型参数
106 | if model_name == 'SVM':
107 | if sklearn_param == {}:
108 | sklearn_param = sklearn_config.SVC
109 | elif model_name == 'KNN':
110 | if sklearn_param == {}:
111 | sklearn_param = sklearn_config.KNN
112 | elif model_name == 'Logistic':
113 | if sklearn_param == {}:
114 | sklearn_param = sklearn_config.Logistic
115 | # 返回训练模型
116 | self.model = sklearn_supervised(data=data,
117 | label=label,
118 | model_savepath=model_savepath,
119 | model_name=model_name,
120 | **sklearn_param)
121 |
122 | # keras神经网络模型,
123 | elif model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
124 | data = pad_sequences(data, maxlen=maxlen, padding='post', value=0, dtype='float32')
125 | label_transform = np.array(pd.get_dummies(label))
126 | if net_shape is None:
127 | if model_name == 'Conv1D_LSTM':
128 | net_shape = [
129 | {'name': 'InputLayer', 'input_shape': data.shape[1:]},
130 | {'name': 'Conv1D', 'filters': 64, 'kernel_size': 3, 'strides': 1, 'padding': 'same',
131 | 'activation': 'relu'},
132 | {'name': 'MaxPooling1D', 'pool_size': 5, 'padding': 'same', 'strides': 2},
133 | {'name': 'LSTM', 'units': 16, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid',
134 | 'dropout': 0., 'recurrent_dropout': 0.},
135 | {'name': 'Flatten'},
136 | {'name': 'Dense', 'activation': 'relu', 'units': 64},
137 | {'name': 'Dropout', 'rate': 0.2, },
138 | {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
139 | ]
140 |
141 | elif model_name == 'LSTM':
142 | net_shape = [
143 | {'name': 'InputLayer', 'input_shape': data.shape[1:]},
144 | {'name': 'Masking'},
145 | {'name': 'LSTM', 'units': 16, 'activation': 'tanh', 'recurrent_activation': 'hard_sigmoid',
146 | 'dropout': 0., 'recurrent_dropout': 0.},
147 | {'name': 'Dense', 'activation': 'relu', 'units': 64},
148 | {'name': 'Dropout', 'rate': 0.2, },
149 | {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
150 | ]
151 | elif model_name == 'Conv1D':
152 | net_shape = [
153 | {'name': 'InputLayer', 'input_shape': data.shape[1:]},
154 | {'name': 'Conv1D', 'filters': 64, 'kernel_size': 3, 'strides': 1, 'padding': 'same',
155 | 'activation': 'relu'},
156 | {'name': 'MaxPooling1D', 'pool_size': 5, 'padding': 'same', 'strides': 2},
157 | {'name': 'Flatten'},
158 | {'name': 'Dense', 'activation': 'relu', 'units': 64},
159 | {'name': 'Dropout', 'rate': 0.2, },
160 | {'name': 'softmax', 'activation': 'softmax', 'units': len(np.unique(label))}
161 | ]
162 |
163 | model = neural_bulit(net_shape=net_shape,
164 | optimizer_name='Adagrad',
165 | lr=0.001,
166 | loss='categorical_crossentropy')
167 | history = History()
168 | model.fit(data, label_transform,
169 | batch_size=batch_size, epochs=epochs, verbose=verbose, callbacks=[history])
170 | train_log = pd.DataFrame(history.history)
171 | self.model = model
172 | self.train_log = train_log
173 | if model_savepath != None:
174 | model.save(model_savepath)
175 | with open(DIR + '/data_info.json', mode='w', encoding='utf-8') as f:
176 | json.dump({'maxlen': maxlen, 'label': list(self.label)}, f)
177 |
178 | def load_model(self,
179 | model_loadpath=DIR + '/models/classify.model',
180 | model_name=None,
181 | data_info_path=DIR + '/data_info.json'):
182 | '''
183 | load sklearn/keras model
184 | :param model_loadpath: path to load sklearn/keras model
185 | :param model_name: load model name
186 | :param data_info_path: date information path
187 | :return: None
188 | '''
189 |
190 | with open(data_info_path, encoding='utf-8') as f:
191 | data_info = json.load(f)
192 | self.maxlen = data_info['maxlen']
193 | self.label = data_info['label']
194 | self.model_name = model_name
195 |
196 | if self.model_name in ['SVM', 'KNN', 'Logistic']:
197 | self.model = joblib.load(model_loadpath)
198 | elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
199 | self.model = load_model(model_loadpath)
200 |
201 | def predict_prob(self,
202 | texts=None):
203 | '''
204 | predict probability
205 | :param texts: list of text
206 | :return: list of probability
207 | '''
208 | # 文本转词向量
209 | vocab_word2vec = self.vocab_word2vec
210 | if self.model_name in ['SVM', 'KNN', 'Logistic']:
211 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
212 | data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
213 | data = [sum(i) / len(i) for i in data]
214 | self.testdata = data
215 | results = self.model.predict_proba(data)
216 | elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
217 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
218 | data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
219 | data = pad_sequences(data, maxlen=self.maxlen, padding='post', value=0, dtype='float32')
220 | self.testdata = data
221 | results = self.model.predict(data)
222 | return results
223 |
224 | def predict(self,
225 | texts=None):
226 | '''
227 | predict class
228 | :param texts: list of text
229 | :return: list of classify
230 | '''
231 | # 文本转词向量
232 | vocab_word2vec = self.vocab_word2vec
233 | if self.model_name in ['SVM', 'KNN', 'Logistic']:
234 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
235 | data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
236 | data = [sum(i) / len(i) for i in data]
237 | self.testdata = data
238 | results = self.model.predict(data)
239 | elif self.model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
240 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
241 | data = [[vocab_word2vec[word] for word in one_text if word in vocab_word2vec] for one_text in texts_cut]
242 | data = pad_sequences(data, maxlen=self.maxlen, padding='post', value=0, dtype='float32')
243 | self.testdata = data
244 | results = self.model.predict(data)
245 | results = pd.DataFrame(results, columns=self.label)
246 | results = results.idxmax(axis=1)
247 | return results
248 |
249 |
250 | if __name__ == '__main__':
251 | train_data = ['国王喜欢吃苹果',
252 | '国王非常喜欢吃苹果',
253 | '国王讨厌吃苹果',
254 | '国王非常讨厌吃苹果']
255 | train_label = ['正面', '正面', '负面', '负面']
256 | # print('train data\n',
257 | # pd.DataFrame({'data': train_data,
258 | # 'label': train_label},
259 | # columns=['data', 'label']))
260 | test_data = ['涛哥喜欢吃苹果',
261 | '涛哥讨厌吃苹果',
262 | '涛哥非常喜欢吃苹果',
263 | '涛哥非常讨厌吃苹果']
264 | test_label = ['正面', '负面', '正面', '负面']
265 |
266 | # 创建模型
267 | model = SentimentAnalysis()
268 |
269 | # 查看bat打的标签
270 | print(model.creat_label(test_data))
271 |
272 | model.creat_vocab(texts=train_data,
273 | sg=0,
274 | size=5,
275 | window=5,
276 | min_count=1,
277 | vocab_savepath=None)
278 |
279 | # 导入词向量词包
280 | # model.load_vocab_word2vec(vocab_loadpath=DIR + '/vocab_word2vec.model')
281 |
282 | ###################################################################################
283 | # 进行机器学习
284 | model.train(texts=train_data,
285 | label=train_label,
286 | model_name='SVM',
287 | model_savepath=DIR + '/models/classify.model')
288 |
289 | # 导入机器学习模型
290 | model.load_model(model_loadpath=DIR + '/models/classify.model',
291 | model_name='SVM',
292 | data_info_path=DIR + '/data_info.json')
293 |
294 | # 进行预测:概率
295 | result_prob = model.predict_prob(texts=test_data)
296 | result_prob = pd.DataFrame(result_prob, columns=model.label)
297 | result_prob['predict'] = result_prob.idxmax(axis=1)
298 | result_prob['data'] = test_data
299 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
300 | print('prob:\n', result_prob)
301 | print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
302 |
--------------------------------------------------------------------------------
/SentimentAnalysis/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['models','sentence_transform','data','test','creat_data']
2 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['baidu','ali','tencent','bat','config']
2 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/ali.py:
--------------------------------------------------------------------------------
1 | from SentimentAnalysis.creat_data.config import ali
2 | import datetime
3 | import hashlib
4 | import base64
5 | from urllib.parse import urlparse
6 | import hmac
7 | import pandas as pd
8 | import numpy as np
9 | import requests
10 | import json
11 | import time
12 |
13 | org_code = ali['account']['id_1']['org_code']
14 | akID = ali['account']['id_1']['akID']
15 | akSecret = ali['account']['id_1']['akSecret']
16 |
17 | def creat_label(texts,
18 | org_code=org_code,
19 | akID=akID,
20 | akSecret=akSecret):
21 | '''
22 | :param texts: 需要打标签的文档列表
23 | :param AppID: 腾讯ai账号信息,默认调用配置文件id_1
24 | :param AppKey: 腾讯ai账号信息,默认调用配置文件id_1
25 | :return: 打好标签的列表,包括原始文档、标签、置信水平、是否成功
26 | '''
27 | url = org_code.join(ali['api']['Sentiment']['url'].split('{org_code}'))
28 |
29 | results = []
30 |
31 | def to_sha1_base64(stringToSign, akSecret):
32 | hmacsha1 = hmac.new(akSecret.encode('utf-8'),
33 | stringToSign.encode('utf-8'),
34 | hashlib.sha1)
35 | return base64.b64encode(hmacsha1.digest()).decode('utf-8')
36 |
37 | # 逐句调用接口判断
38 | count_i=0
39 | for one_text in texts:
40 | # one_text = '喜欢'
41 | time_now = datetime.datetime.strftime(datetime.datetime.utcnow(), "%a, %d %b %Y %H:%M:%S GMT")
42 | # time_now = time.strftime("%a, %d %b %Y %H:%M:%S GMT", time.localtime()) #这个也可以
43 | options = {'url': url,
44 | 'method': 'POST',
45 | 'headers': {'accept': 'application/json',
46 | 'content-type': 'application/json',
47 | 'date': time_now,
48 | 'authorization': ''},
49 | 'body': json.dumps({'text': one_text}, separators=(',', ':'))}
50 |
51 | body = ''
52 | if 'body' in options:
53 | body = options['body']
54 | # print(body)
55 | bodymd5 = ''
56 | if not body == '':
57 | bodymd5 = base64.b64encode(
58 | hashlib.md5(json.dumps({'text': one_text}, separators=(',', ':')).encode('utf-8')).digest()).decode(
59 | 'utf-8')
60 |
61 | # print(bodymd5)
62 |
63 | urlPath = urlparse(url)
64 | if urlPath.query != '':
65 | urlPath = urlPath.path + "?" + urlPath.query
66 | else:
67 | urlPath = urlPath.path
68 | stringToSign = 'POST' + '\n' + \
69 | options['headers']['accept'] + '\n' + \
70 | bodymd5 + '\n' + \
71 | options['headers']['content-type'] + '\n' \
72 | + options['headers']['date'] + '\n' + urlPath
73 |
74 | # print(stringToSign)
75 | signature = to_sha1_base64(stringToSign=stringToSign,
76 | akSecret=akSecret)
77 | # print(signature)
78 | authHeader = 'Dataplus ' + akID + ':' + signature
79 | # print(authHeader)
80 | options['headers']['authorization'] = authHeader
81 | r = requests.post(url=url,
82 | headers={'accept': 'application/json',
83 | 'content-type': 'application/json',
84 | 'date': time_now,
85 | 'authorization': authHeader},
86 | data=json.dumps({'text': one_text}, separators=(',', ':'))) # 获取分析结果
87 | try:
88 | result = json.loads(r.text)
89 | # print(result)
90 | results.append([one_text,
91 | result['data']['text_polarity'],
92 | 0,
93 | 'ok'
94 | ])
95 | except:
96 | results.append([one_text,
97 | -100,
98 | -100,
99 | 'error'
100 | ])
101 | count_i += 1
102 | if count_i % 50 == 0:
103 | print('ali finish:%d' % (count_i))
104 | r.close()
105 | return results
106 |
107 |
108 | if __name__ == '__main__':
109 | results = creat_label(texts=['价格便宜啦,比原来优惠多了',
110 | '壁挂效果差,果然一分价钱一分货',
111 | '东西一般般,诶呀',
112 | '快递非常快,电视很惊艳,非常喜欢',
113 | '到货很快,师傅很热情专业。',
114 | '讨厌你',
115 | '一般'
116 | ])
117 | results = pd.DataFrame(results, columns=['evaluation',
118 | 'label',
119 | 'ret',
120 | 'msg'])
121 | results['label'] = np.where(results['label'] == '1', '正面',
122 | np.where(results['label'] == '0', '中性',
123 | np.where(results['label'] == '-1', '负面', '非法')))
124 | print(results)
125 |
126 |
127 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/baidu.py:
--------------------------------------------------------------------------------
1 | # from aip import AipNlp #现在不能用了
2 | from SentimentAnalysis.creat_data.config import baidu
3 | import pandas as pd
4 | import numpy as np
5 | import json
6 | import requests
7 |
8 | APP_ID = baidu['account']['id_1']['APP_ID']
9 | API_KEY = baidu['account']['id_1']['API_KEY']
10 | SECRET_KEY = baidu['account']['id_1']['SECRET_KEY']
11 |
12 |
13 | # 逐句调用接口判断
14 | def creat_label(texts,
15 | interface='SDK',
16 | APP_ID=APP_ID,
17 | API_KEY=API_KEY,
18 | SECRET_KEY=SECRET_KEY):
19 | '''
20 | :param texts: 需要打标签的文档列表
21 | :param interface: 接口方式,SDK和API
22 | :param APP_ID: 百度ai账号信息,默认调用配置文件id_1
23 | :param API_KEY: 百度ai账号信息,默认调用配置文件id_1
24 | :param SECRET_KEY: 百度ai账号信息,默认调用配置文件id_1
25 | :return: 打好标签的列表,包括原始文档、标签、置信水平、正负面概率、是否成功
26 | '''
27 | # 创建连接
28 |
29 | results = []
30 | if interface == 'SDK':
31 | pass
32 | # client = AipNlp(APP_ID=APP_ID,
33 | # API_KEY=API_KEY,
34 | # SECRET_KEY=SECRET_KEY)
35 | # for one_text in texts:
36 | # result = client.sentimentClassify(one_text)
37 | # if 'error_code' in result:
38 | # results.append([one_text,
39 | # 0,
40 | # 0,
41 | # 0,
42 | # 0,
43 | # result['error_code'],
44 | # result['error_msg']
45 | # ])
46 | # else:
47 | # results.append([one_text,
48 | # result['items'][0]['sentiment'],
49 | # result['items'][0]['confidence'],
50 | # result['items'][0]['positive_prob'],
51 | # result['items'][0]['negative_prob'],
52 | # 0,
53 | # 'ok'
54 | # ])
55 | elif interface == 'API':
56 | # 获取access_token
57 | url = baidu['access_token_url']
58 | params = {'grant_type': 'client_credentials',
59 | 'client_id': baidu['account']['id_1']['API_KEY'],
60 | 'client_secret': baidu['account']['id_1']['SECRET_KEY']}
61 | r = requests.post(url, params=params)
62 | access_token = json.loads(r.text)['access_token']
63 | r.close()
64 |
65 | url = baidu['api']['sentiment_classify']['url']
66 | params = {'access_token': access_token}
67 | headers = {'Content-Type': baidu['api']['sentiment_classify']['Content-Type']}
68 | count_i=0
69 | for one_text in texts:
70 | data = json.dumps({'text': one_text})
71 | r = requests.post(url=url,
72 | params=params,
73 | headers=headers,
74 | data=data)
75 | result = json.loads(r.text)
76 | if 'error_code' in result:
77 | results.append([one_text,
78 | 0,
79 | 0,
80 | 0,
81 | 0,
82 | result['error_code'],
83 | result['error_msg']
84 | ])
85 | else:
86 | results.append([one_text,
87 | result['items'][0]['sentiment'],
88 | result['items'][0]['confidence'],
89 | result['items'][0]['positive_prob'],
90 | result['items'][0]['negative_prob'],
91 | 0,
92 | 'ok'
93 | ])
94 | count_i += 1
95 | if count_i % 50 == 0:
96 | print('baidu finish:%d' % (count_i))
97 | r.close()
98 | else:
99 | print('ERROR: No interface named %s' % (interface))
100 | return results
101 |
102 |
103 | if __name__ == '__main__':
104 | results = creat_label(texts=['价格便宜啦,比原来优惠多了',
105 | '壁挂效果差,果然一分价钱一分货',
106 | '东西一般般,诶呀',
107 | '讨厌你',
108 | '一般'],
109 | interface='API')
110 | results = pd.DataFrame(results, columns=['evaluation',
111 | 'label',
112 | 'confidence',
113 | 'positive_prob',
114 | 'negative_prob',
115 | 'ret',
116 | 'msg'])
117 | results['label'] = np.where(results['label'] == 2,
118 | '正面',
119 | np.where(results['label'] == 1, '中性', '负面'))
120 | print(results)
121 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/bat.py:
--------------------------------------------------------------------------------
1 | from SentimentAnalysis.creat_data import baidu, ali, tencent
2 | import pandas as pd
3 | # from collections import OrderedDict
4 | import numpy as np
5 |
6 |
7 | def creat_label(texts):
8 | results = []
9 | count_i = 0
10 | for one_text in texts:
11 | result_baidu = baidu.creat_label([one_text], interface='API')
12 | result_ali = ali.creat_label([one_text])
13 | result_tencent = tencent.creat_label([one_text])
14 |
15 | result_all = [one_text,
16 | result_baidu[0][1], result_baidu[0][6],
17 | result_ali[0][1], result_ali[0][3],
18 | result_tencent[0][1], result_tencent[0][4]]
19 | results.append(result_all)
20 |
21 | # result = OrderedDict()
22 | # result['evaluation'] = result_all[0]
23 | # result['label_baidu'] = result_all[1]
24 | # result['msg_baidu'] = result_all[2]
25 | # result['label_ali'] = result_all[3]
26 | # result['msg_ali'] = result_all[4]
27 | # result['label_tencent'] = result_all[5]
28 | # result['msg_tencent'] = result_all[6]
29 |
30 | count_i += 1
31 | if count_i % 50 == 0:
32 | print('baidu finish:%d' % (count_i))
33 |
34 | results_dataframe = pd.DataFrame(results,
35 | columns=['evaluation',
36 | 'label_baidu', 'msg_baidu',
37 | 'label_ali', 'msg_ali',
38 | 'label_tencent', 'msg_tencent'])
39 | results_dataframe['label_baidu'] = np.where(results_dataframe['label_baidu'] == 2,
40 | '正面',
41 | np.where(results_dataframe['label_baidu'] == 1, '中性', '负面'))
42 | results_dataframe['label_ali'] = np.where(results_dataframe['label_ali'] == '1', '正面',
43 | np.where(results_dataframe['label_ali'] == '0', '中性',
44 | np.where(results_dataframe['label_ali'] == '-1', '负面', '非法')))
45 | results_dataframe['label_tencent'] = np.where(results_dataframe['label_tencent'] == 1, '正面',
46 | np.where(results_dataframe['label_tencent'] == 0, '中性', '负面'))
47 | return results_dataframe
48 |
49 |
50 | if __name__ == '__main__':
51 | print(creat_label(['价格便宜啦,比原来优惠多了',
52 | '壁挂效果差,果然一分价钱一分货',
53 | '东西一般般,诶呀',
54 | '讨厌你',
55 | '一般'
56 | ]))
57 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/config.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | baidu = {'access_token_url': 'https://aip.baidubce.com/oauth/2.0/token',
4 | 'api': {
5 | 'sentiment_classify': {
6 | 'Content-Type': 'application/json',
7 | 'url': 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify'}},
8 | 'account': {
9 | 'id_1': {'APP_ID': '000',
10 | 'API_KEY': '000',
11 | 'SECRET_KEY': '000'},
12 | 'id_2': {'APP_ID': '000',
13 | 'API_KEY': '000',
14 | 'SECRET_KEY': '000'}}
15 | }
16 |
17 | tencent = {'api': {
18 | 'nlp_textpolar': {
19 | 'url': 'https://api.ai.qq.com/fcgi-bin/nlp/nlp_textpolar'}},
20 | 'account': {
21 | 'id_1': {'APP_ID': '000',
22 | 'AppKey': '000'},
23 | 'id_2': {'APP_ID': '000',
24 | 'API_KEY': '000'}}
25 | }
26 |
27 | ali = {'api': {
28 | 'Sentiment': {
29 | 'url': 'https://dtplus-cn-shanghai.data.aliyuncs.com/{org_code}/nlp/api/Sentiment/ecommerce'}},
30 | 'account': {
31 | 'id_1': {'org_code': '000',
32 | 'akID': '000',
33 | 'akSecret': '000'
34 | },
35 | 'id_2': {'org_code': '000',
36 | 'akID': '000',
37 | 'akSecret': '000'
38 | }}
39 | }
40 |
41 | label_path=os.getcwd()
42 |
43 |
--------------------------------------------------------------------------------
/SentimentAnalysis/creat_data/tencent.py:
--------------------------------------------------------------------------------
1 | from SentimentAnalysis.creat_data.config import tencent
2 | import pandas as pd
3 | import numpy as np
4 | import requests
5 | import json
6 | import time
7 | import random
8 | import hashlib
9 | from urllib import parse
10 | from collections import OrderedDict
11 |
12 | AppID = tencent['account']['id_1']['APP_ID']
13 | AppKey = tencent['account']['id_1']['AppKey']
14 |
15 | def cal_sign(params_raw,AppKey=AppKey):
16 | # 官方文档例子为php,给出python版本
17 | # params_raw = {'app_id': '10000',
18 | # 'time_stamp': '1493449657',
19 | # 'nonce_str': '20e3408a79',
20 | # 'key1': '腾讯AI开放平台',
21 | # 'key2': '示例仅供参考',
22 | # 'sign': ''}
23 | # AppKey = 'a95eceb1ac8c24ee28b70f7dbba912bf'
24 | # cal_sign(params_raw=params_raw,
25 | # AppKey=AppKey)
26 | # 返回:BE918C28827E0783D1E5F8E6D7C37A61
27 | params = OrderedDict()
28 | for i in sorted(params_raw):
29 | if params_raw[i] != '':
30 | params[i] = params_raw[i]
31 | newurl = parse.urlencode(params)
32 | newurl += ('&app_key=' + AppKey)
33 | sign = hashlib.md5(newurl.encode("latin1")).hexdigest().upper()
34 | return sign
35 |
36 |
37 | def creat_label(texts,
38 | AppID=AppID,
39 | AppKey=AppKey):
40 | '''
41 | :param texts: 需要打标签的文档列表
42 | :param AppID: 腾讯ai账号信息,默认调用配置文件id_1
43 | :param AppKey: 腾讯ai账号信息,默认调用配置文件id_1
44 | :return: 打好标签的列表,包括原始文档、标签、置信水平、是否成功
45 | '''
46 |
47 | url = tencent['api']['nlp_textpolar']['url']
48 | results = []
49 | # 逐句调用接口判断
50 | count_i=0
51 | for one_text in texts:
52 | params = {'app_id': AppID,
53 | 'time_stamp': int(time.time()),
54 | 'nonce_str': ''.join([random.choice('1234567890abcdefghijklmnopqrstuvwxyz') for i in range(10)]),
55 | 'sign': '',
56 | 'text': one_text}
57 | params['sign'] = cal_sign(params_raw=params,
58 | AppKey=AppKey) # 获取sign
59 | r = requests.post(url=url,
60 | params=params) # 获取分析结果
61 | result = json.loads(r.text)
62 | # print(result)
63 | results.append([one_text,
64 | result['data']['polar'],
65 | result['data']['confd'],
66 | result['ret'],
67 | result['msg']
68 | ])
69 | r.close()
70 | count_i += 1
71 | if count_i % 50 == 0:
72 | print('tencent finish:%d' % (count_i))
73 | return results
74 |
75 |
76 | if __name__ == '__main__':
77 | results = creat_label(texts=['价格便宜啦,比原来优惠多了',
78 | '壁挂效果差,果然一分价钱一分货',
79 | '东西一般般,诶呀',
80 | '讨厌你',
81 | '一般'])
82 | results = pd.DataFrame(results, columns=['evaluation',
83 | 'label',
84 | 'confidence',
85 | 'ret',
86 | 'msg'])
87 | results['label'] = np.where(results['label'] == 1, '正面',
88 | np.where(results['label'] == 0, '中性', '负面'))
89 | print(results)
90 |
--------------------------------------------------------------------------------
/SentimentAnalysis/data/traindata.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/data/traindata.xlsx
--------------------------------------------------------------------------------
/SentimentAnalysis/data_info.json:
--------------------------------------------------------------------------------
1 | {"maxlen": 383, "label": ["\u4e2d\u6027", "\u6b63\u9762", "\u8d1f\u9762"]}
--------------------------------------------------------------------------------
/SentimentAnalysis/flask_api.py:
--------------------------------------------------------------------------------
1 | # -*- coding:utf-8 -*-
2 |
3 | import pandas as pd
4 | import numpy as np
5 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
6 | import os
7 | from flask import Flask, request
8 | from flask_restful import Resource, Api
9 |
10 | app = Flask(__name__)
11 | app.config.update(RESTFUL_JSON=dict(ensure_ascii=False))
12 | api = Api(app)
13 |
14 | DIR = os.path.dirname(__file__)
15 | class sentiment_analyse(Resource):
16 | def get(self):
17 | model_name = request.args.get('model_name')
18 | text = request.args.get('text')
19 | prob=request.args.get('prob')
20 | '''
21 | model_name='SVM'
22 | text='刚买就降价了桑心'
23 | '''
24 |
25 | model = SentimentAnalysis()
26 | # 导入词向量词包
27 | model.load_vocab_word2vec(vocab_loadpath=DIR + '/models/vocab_word2vec.model')
28 |
29 | if model_name in ['SVM', 'KNN', 'Logistic']:
30 | # 导入机器学习模型
31 | model.load_model(model_loadpath=DIR + '/models/classify.model',
32 | model_name=model_name,
33 | data_info_path=DIR + '/data_info.json')
34 | elif model_name in ['Conv1D_LSTM', 'Conv1D', 'LSTM']:
35 | # 导入深度学习模型
36 | model.load_model(model_loadpath=DIR + '/models/classify.h5',
37 | model_name=model_name,
38 | data_info_path=DIR + '/data_info.json')
39 |
40 | try:
41 | if prob == '1':
42 | # 进行预测:概率
43 | result_prob = model.predict_prob(texts=[text])
44 | result_prob = result_prob.astype(np.float64)
45 | result_prob = pd.DataFrame(result_prob, columns=model.label)
46 | result_prob['predict'] = result_prob.idxmax(axis=1)
47 | result_prob['text'] = [text]
48 | result_prob = result_prob[['text'] + list(model.label) + ['predict']]
49 | result = [{i: result_prob.loc[0, i]} for i in ['text'] + list(model.label) + ['predict']]
50 | else:
51 | # 进行预测:类别
52 | result_classify = model.predict(texts=[text])
53 | result = [{'text': text},{'predict':result_classify[0]}]
54 |
55 | return result
56 | except Exception as e:
57 | return '该文本的词语均不在词库中,无法识别'+' (error: %s)'%e
58 |
59 | #http://192.168.3.59:5000/SentimentAnalyse/?model_name=Conv1D&prob=1&text=东西为什么这么烂
60 | api.add_resource(sentiment_analyse, '/SentimentAnalyse/')
61 |
62 | if __name__ == '__main__':
63 | app.run(debug=True, host='0.0.0.0')
64 |
65 |
66 |
67 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['keras_log_plot',
2 | 'neural_bulit',
3 | 'sklearn_supervised',
4 | 'sklearn_config',
5 | 'parameter']
6 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/classify.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/classify.h5
--------------------------------------------------------------------------------
/SentimentAnalysis/models/classify.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/classify.model
--------------------------------------------------------------------------------
/SentimentAnalysis/models/keras_log_plot.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 |
3 |
4 | def keras_log_plot(train_log=None):
5 | plt.plot(train_log['acc'], label='acc', color='red')
6 | plt.plot(train_log['loss'], label='loss', color='yellow')
7 | if 'val_acc' in train_log.columns:
8 | plt.plot(train_log['val_acc'], label='val_acc', color='green')
9 | if 'val_loss' in train_log.columns:
10 | plt.plot(train_log['val_loss'], label='val_loss', color='blue')
11 | plt.legend()
12 | plt.show()
13 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/neural_bulit.py:
--------------------------------------------------------------------------------
1 | from keras.models import Sequential
2 | from keras.layers.core import Dense, initializers, Flatten, Dropout, Masking
3 | from keras.layers import Conv1D, InputLayer
4 | from keras.layers.recurrent import LSTM
5 | from keras.layers.pooling import MaxPooling1D
6 | from SentimentAnalysis.models.parameter.optimizers import optimizers
7 |
8 | def neural_bulit(net_shape,
9 | optimizer_name='Adagrad',
10 | lr=0.001,
11 | loss='categorical_crossentropy'):
12 | '''
13 | :param net_shape: 神经网络格式
14 | net_shape = [
15 | {'name': 'InputLayer','input_shape': [10, 5]},
16 | {'name': 'Dropout','rate': 0.2,},
17 | {'name': 'Masking'},
18 | {'name': 'LSTM','units': 16,'activation': 'tanh','recurrent_activation': 'hard_sigmoid','dropout': 0.,'recurrent_dropout': 0.},
19 | {'name': 'Conv1D','filters': 64,'kernel_size': 3,'strides': 1,'padding': 'same','activation': 'relu'},
20 | {'name': 'MaxPooling1D','pool_size': 5,'padding': 'same','strides': 2},
21 | {'name': 'Flatten'},
22 | {'name': 'Dense','activation': 'relu','units': 64},
23 | {'name': 'softmax','activation': 'softmax','units': 2}
24 | ]
25 | :param optimizer_name: 优化器
26 | :param lr: 学习率
27 | :param loss: 损失函数
28 | :param return: 返回神经网络模型
29 | '''
30 | model = Sequential()
31 |
32 | def add_InputLayer(input_shape,
33 | **param):
34 | model.add(InputLayer(input_shape=input_shape,
35 | **param))
36 |
37 | def add_Dropout(rate=0.2,
38 | **param):
39 | model.add(Dropout(rate=rate,
40 | **param))
41 |
42 | def add_Masking(mask_value=0,
43 | **param):
44 | model.add(Masking(mask_value=mask_value,
45 | **param))
46 |
47 | def add_LSTM(units=16,
48 | activation='tanh',
49 | recurrent_activation='hard_sigmoid',
50 | implementation=1,
51 | dropout=0,
52 | recurrent_dropout=0,
53 | **param):
54 | model.add(LSTM(units=units,
55 | activation=activation,
56 | recurrent_activation=recurrent_activation,
57 | implementation=implementation,
58 | dropout=dropout,
59 | recurrent_dropout=recurrent_dropout,
60 | **param))
61 |
62 | def add_Conv1D(filters=16, # 卷积核数量
63 | kernel_size=3, # 卷积核尺寸,或者[3]
64 | strides=1,
65 | padding='same',
66 | activation='relu',
67 | kernel_initializer=initializers.normal(stddev=0.1),
68 | bias_initializer=initializers.normal(stddev=0.1),
69 | **param):
70 | model.add(Conv1D(filters=filters,
71 | kernel_size=kernel_size,
72 | strides=strides,
73 | padding=padding,
74 | activation=activation,
75 | kernel_initializer=kernel_initializer,
76 | bias_initializer=bias_initializer,
77 | **param))
78 |
79 | def add_MaxPooling1D(pool_size=3, # 卷积核尺寸,或者[3]
80 | strides=1,
81 | padding='same',
82 | **param):
83 | model.add(MaxPooling1D(pool_size=pool_size,
84 | strides=strides,
85 | padding=padding,
86 | **param))
87 |
88 | def add_Flatten(**param):
89 | model.add(Flatten(**param))
90 |
91 | def add_Dense(units=16,
92 | activation='relu',
93 | kernel_initializer=initializers.normal(stddev=0.1),
94 | **param):
95 | model.add(Dense(units=units,
96 | activation=activation,
97 | kernel_initializer=kernel_initializer,
98 | **param))
99 |
100 | for n in range(len(net_shape)):
101 | if net_shape[n]['name'] == 'InputLayer':
102 | del net_shape[n]['name']
103 | add_InputLayer(name='num_' + str(n) + '_InputLayer',
104 | **net_shape[n])
105 | elif net_shape[n]['name'] == 'Dropout':
106 | del net_shape[n]['name']
107 | add_Dropout(name='num_' + str(n) + '_Dropout',
108 | **net_shape[n])
109 | elif net_shape[n]['name'] == 'Masking':
110 | del net_shape[n]['name']
111 | add_Masking(name='num_' + str(n) + '_Masking',
112 | **net_shape[n])
113 | elif net_shape[n]['name'] == 'LSTM':
114 | del net_shape[n]['name']
115 | add_LSTM(name='num_' + str(n) + '_LSTM',
116 | **net_shape[n])
117 | elif net_shape[n]['name'] == 'Conv1D':
118 | del net_shape[n]['name']
119 | add_Conv1D(name='num_' + str(n) + '_Conv1D',
120 | **net_shape[n])
121 | elif net_shape[n]['name'] == 'MaxPooling1D':
122 | del net_shape[n]['name']
123 | add_MaxPooling1D(name='num_' + str(n) + '_MaxPooling1D',
124 | **net_shape[n])
125 | elif net_shape[n]['name'] == 'Flatten':
126 | del net_shape[n]['name']
127 | add_Flatten(name='num_' + str(n) + '_Flatten',
128 | **net_shape[n])
129 | elif net_shape[n]['name'] == 'Dense':
130 | del net_shape[n]['name']
131 | add_Dense(name='num_' + str(n) + '_Dense',
132 | **net_shape[n])
133 | elif net_shape[n]['name'] == 'softmax':
134 | del net_shape[n]['name']
135 | add_Dense(name='num_' + str(n) + '_softmax',
136 | **net_shape[n])
137 |
138 | optimizer = optimizers(name=optimizer_name, lr=lr)
139 | model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
140 |
141 | return model
142 |
143 |
144 | if __name__ == '__main__':
145 | net_shape = [{'name': 'InputLayer',
146 | 'input_shape': [10, 5],
147 | },
148 | {'name': 'Conv1D'
149 | },
150 | {'name': 'MaxPooling1D'
151 | },
152 | {'name': 'Flatten'
153 | },
154 | {'name': 'Dense'
155 | },
156 | {'name': 'Dropout'
157 | },
158 | {'name': 'softmax',
159 | 'activation': 'softmax',
160 | 'units': 3
161 | }
162 | ]
163 | model = neural_bulit(net_shape=net_shape,
164 | optimizer_name='Adagrad',
165 | lr=0.001,
166 | loss='categorical_crossentropy')
167 | model.summary()
168 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/parameter/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['optimizers']
2 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/parameter/optimizers.py:
--------------------------------------------------------------------------------
1 | from keras.optimizers import SGD, Adagrad, Adam
2 |
3 |
4 | def optimizers(name='SGD', lr=0.001):
5 | if name == 'SGD':
6 | optimizers_fun = SGD(lr=lr)
7 | elif name == 'Adagrad':
8 | optimizers_fun = Adagrad(lr=lr)
9 | elif name == 'Adam':
10 | optimizers_fun = Adam(lr=lr)
11 |
12 | return optimizers_fun
13 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/sklearn_config.py:
--------------------------------------------------------------------------------
1 | # from sklearn.neighbors import KNeighborsClassifier
2 | # from sklearn.svm import SVC
3 | # from sklearn.linear_model import LogisticRegression
4 |
5 | SVC = {'kernel': 'linear', # 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'
6 | 'C': 1.0,
7 | 'probability': True}
8 |
9 | KNN = {'n_neighbors': 3} # Number of neighbors to use
10 |
11 | Logistic = {'solver': 'liblinear',
12 | 'penalty': 'l2', # 'l1' or 'l2'
13 | 'C': 1.0} # a positive float,Like in support vector machines, smaller values specify stronger
14 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/sklearn_supervised.py:
--------------------------------------------------------------------------------
1 | from sklearn.neighbors import KNeighborsClassifier
2 | from sklearn.svm import SVC
3 | from sklearn.linear_model import LogisticRegression
4 | from sklearn.externals import joblib
5 | import os
6 |
7 | DIR = os.path.dirname(__file__)
8 | def sklearn_supervised(data=None,
9 | label=None,
10 | model_savepath=DIR + '/sentence_transform/classify.model',
11 | model_name='SVM',
12 | **sklearn_param):
13 | '''
14 | :param data: 训练文本
15 | :param label: 训练文本的标签
16 | :param model_savepath: 模型保存路径
17 | :param model_name: 机器学习分类模型,SVM,KNN,Logistic
18 | :param return: 训练好的模型
19 | '''
20 |
21 | if model_name == 'KNN':
22 | # 调用KNN,近邻=5
23 | model = KNeighborsClassifier(**sklearn_param)
24 | elif model_name == 'SVM':
25 | # 核函数为linear,惩罚系数为1.0
26 | model = SVC(**sklearn_param)
27 | model.fit(data, label)
28 | elif model_name == 'Logistic':
29 | model = LogisticRegression(**sklearn_param) # 核函数为线性,惩罚系数为1
30 | model.fit(data, label)
31 |
32 | if model_savepath != None:
33 | joblib.dump(model, model_savepath) # 保存模型
34 |
35 |
36 | return model
37 |
--------------------------------------------------------------------------------
/SentimentAnalysis/models/vocab_word2vec.model:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/SentimentAnalysis/models/vocab_word2vec.model
--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/__init__.py:
--------------------------------------------------------------------------------
1 | __all__ = ['sentence_2_sparse',
2 | 'sentence_word2vec',
3 | 'sentence_2_tokenizer']
4 |
--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/creat_vocab_word2vec.py:
--------------------------------------------------------------------------------
1 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
2 | import pandas as pd
3 | import jieba
4 | from gensim.models import word2vec, doc2vec
5 | import numpy as np
6 | import os
7 |
8 | jieba.setLogLevel('WARN')
9 | DIR = os.path.dirname(__file__)
10 |
11 | def creat_vocab_word2vec(texts=None,
12 | sg=0,
13 | vocab_savepath=DIR + '/vocab_word2vec.model',
14 | size=5,
15 | window=5,
16 | min_count=1):
17 | '''
18 |
19 | :param texts: list of text
20 | :param sg: 0 CBOW,1 skip-gram
21 | :param size: the dimensionality of the feature vectors
22 | :param window: the maximum distance between the current and predicted word within a sentence
23 | :param min_count: ignore all words with total frequency lower than this
24 | :param vocab_savepath: path to save word2vec dictionary
25 | :return: None
26 |
27 | '''
28 | texts_cut = [[word for word in jieba.lcut(one_text) if word != ' '] for one_text in texts] # 分词
29 | # 训练
30 | model = word2vec.Word2Vec(texts_cut, sg=sg, size=size, window=window, min_count=min_count)
31 | if vocab_savepath != None:
32 | model.save(vocab_savepath)
33 |
34 | return model
35 |
36 |
37 | if __name__ == '__main__':
38 | texts = ['全面从严治党',
39 | '国际公约和国际法',
40 | '中国航天科技集团有限公司']
41 | vocab_word2vec = creat_vocab_word2vec(texts=texts,
42 | sg=0,
43 | vocab_savepath=DIR + '/vocab_word2vec.model',
44 | size=5,
45 | window=5,
46 | min_count=1)
47 |
--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/sentence_2_sparse.py:
--------------------------------------------------------------------------------
1 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
2 | import pandas as pd
3 | import jieba
4 |
5 | jieba.setLogLevel('WARN')
6 |
7 |
8 | def sentence_2_sparse(train_data,
9 | test_data=None,
10 | language='Chinese',
11 | hash=True,
12 | hashmodel='CountVectorizer'):
13 | '''
14 |
15 | :param train_data: 训练集
16 | :param test_data: 测试集
17 | :param language: 语种
18 | :param hash: 是否转哈希存储
19 | :param hashmodel: 哈希计数的方式
20 | :param return: 返回编码后稀疏矩阵
21 | '''
22 | # 分词转one-hot dataframe
23 | if test_data==None:
24 | if hash == False:
25 | train_data = pd.DataFrame([pd.Series([word for word in jieba.lcut(sample) if word != ' ']).value_counts()
26 | for sample in train_data]).fillna(0)
27 | # 中文需要先分词空格分隔,再转稀疏矩阵
28 | else:
29 | if language == 'Chinese':
30 | train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
31 | if hashmodel == 'CountVectorizer': # 只计数
32 | count_train = CountVectorizer()
33 | train_data_hashcount = count_train.fit_transform(train_data) # 训练数据转哈希计数
34 | elif hashmodel == 'TfidfTransformer': # 计数后计算tf-idf
35 | count_train = CountVectorizer()
36 | train_data_hashcount = count_train.fit_transform(train_data) # 训练数据转哈希计数
37 | tfidftransformer = TfidfTransformer()
38 | train_data_hashcount = tfidftransformer.fit(train_data_hashcount).transform(train_data_hashcount)
39 | elif hashmodel == 'HashingVectorizer': # 哈希计算
40 | vectorizer = HashingVectorizer(stop_words=None, n_features=10000)
41 | train_data_hashcount = vectorizer.fit_transform(train_data) # 训练数据转哈希后的特征,避免键值重叠导致过大有一个计算的
42 | return train_data_hashcount
43 | return train_data
44 | else:
45 | # 中文需要先分词空格分隔,再转稀疏矩阵,如果包含测试集,测试集转hash需要在训练集的词库基础上执行
46 | if language == 'Chinese':
47 | train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
48 | test_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in test_data]
49 | if hashmodel == 'CountVectorizer': # 只计数
50 | count_train = CountVectorizer()
51 | train_data_hashcount = count_train.fit_transform(train_data) # 训练数据转哈希计数
52 | count_test = CountVectorizer(vocabulary=count_train.vocabulary_) # 测试数据调用训练词库
53 | test_data_hashcount = count_test.fit_transform(test_data) # 测试数据转哈希计数
54 | elif hashmodel == 'TfidfTransformer': # 计数后计算tf-idf
55 | count_train = CountVectorizer()
56 | train_data_hashcount = count_train.fit_transform(train_data) # 训练数据转哈希计数
57 | count_test = CountVectorizer(vocabulary=count_train.vocabulary_) # 测试数据调用训练词库
58 | test_data_hashcount = count_test.fit_transform(test_data) # 测试数据转哈希计数
59 | tfidftransformer = TfidfTransformer()
60 | train_data_hashcount = tfidftransformer.fit(train_data_hashcount).transform(train_data_hashcount)
61 | test_data_hashcount = tfidftransformer.fit(test_data_hashcount).transform(test_data_hashcount)
62 | elif hashmodel == 'HashingVectorizer': # 哈希计算
63 | vectorizer = HashingVectorizer(stop_words=None, n_features=10000)
64 | train_data_hashcount = vectorizer.fit_transform(train_data) # 训练数据转哈希后的特征,避免键值重叠导致过大有一个计算的
65 | test_data_hashcount = vectorizer.fit_transform(test_data) # 测试数据转哈希后的特征
66 | return train_data_hashcount, test_data_hashcount
67 |
68 |
69 | if __name__ == '__main__':
70 | train_data = ['全面从严治党',
71 | '国际公约和国际法',
72 | '中国航天科技集团有限公司']
73 | test_data = ['全面从严测试']
74 | print('train_data\n',train_data,'\ntest_data\n',test_data)
75 | print('sentence_2_sparse(train_data=train_data,hash=False)\n',
76 | sentence_2_sparse(train_data=train_data, hash=False))
77 | print('sentence_2_sparse(train_data=train_data,hash=True)\n',
78 | sentence_2_sparse(train_data=train_data, hash=True))
79 | m,n=sentence_2_sparse(train_data=train_data, test_data=test_data, hash=True)
80 | print('sentence_2_sparse(train_data=train_data,test_data=test_data,hash=True)\n',
81 | 'train_data\n',m,'\ntest_data\n',n)
82 |
--------------------------------------------------------------------------------
/SentimentAnalysis/sentence_transform/sentence_2_tokenizer.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import jieba
3 | from keras.preprocessing.text import Tokenizer
4 |
5 | jieba.setLogLevel('WARN')
6 |
7 |
8 | def sentence_2_tokenizer(train_data,
9 | test_data=None,
10 | num_words=None,
11 | word_index=False):
12 | '''
13 |
14 | :param train_data: 训练集
15 | :param test_data: 测试集
16 | :param num_words: 词库大小,None则依据样本自动判定
17 | :param word_index: 是否需要索引
18 | :param return: 返回编码后数组
19 | '''
20 | train_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in train_data]
21 | test_data = [' '.join([word for word in jieba.lcut(sample) if word != ' ']) for sample in test_data]
22 | data = train_data + test_data
23 | tokenizer = Tokenizer(num_words=num_words)
24 | tokenizer.fit_on_texts(data)
25 | train_data = tokenizer.texts_to_sequences(train_data)
26 | test_data = tokenizer.texts_to_sequences(test_data)
27 |
28 | if word_index == False:
29 | if test_data == None:
30 | return train_data
31 |
32 | else:
33 | return train_data, test_data
34 | else:
35 | if test_data == None:
36 | return train_data, tokenizer.word_index
37 |
38 | else:
39 | return train_data, test_data, tokenizer.word_index
40 |
41 |
42 | if __name__ == '__main__':
43 | train_data = ['全面从严治党',
44 | '国际公约和国际法',
45 | '中国航天科技集团有限公司']
46 | test_data = ['全面从严测试']
47 | train_data_vec, test_data_vec, word_index = sentence_2_tokenizer(train_data=train_data,
48 | test_data=test_data,
49 | num_words=None,
50 | word_index=True)
51 | print(train_data, '\n', train_data_vec, '\n',
52 | test_data[0], '\n', test_data_vec, '\n',
53 | 'word_index\n',word_index)
54 |
--------------------------------------------------------------------------------
/demo.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 | from sklearn.model_selection import train_test_split
4 | import os
5 |
6 | from SentimentAnalysis.SentimentAnalysis import SentimentAnalysis
7 | from SentimentAnalysis.models.keras_log_plot import keras_log_plot
8 |
9 | model = SentimentAnalysis()
10 |
11 | dataset = pd.read_excel(os.getcwd() + '/data/traindata.xlsx', sheet_name=0)
12 | data = dataset['evaluation']
13 | label = dataset['label']
14 | train_data, test_data, train_label, test_label = train_test_split(data,
15 | label,
16 | test_size=0.1,
17 | random_state=42)
18 | test_data = test_data.reset_index(drop=True)
19 | test_label = test_label.reset_index(drop=True)
20 | # 建模获取词向量词包
21 | model.creat_vocab(texts=train_data,
22 | sg=0,
23 | size=5,
24 | window=5,
25 | min_count=1,
26 | vocab_savepath=None)
27 |
28 | # 导入词向量词包
29 | # model.load_vocab_word2vec(vocab_loadpath=os.getcwd() + '/vocab_word2vec.model')
30 |
31 | ###################################################################################
32 | # 进行机器学习
33 | model.train(texts=train_data,
34 | label=train_label,
35 | model_name='SVM',
36 | model_savepath=os.getcwd() + '/models/classify.model')
37 |
38 | # 导入机器学习模型
39 | model.load_model(model_loadpath=os.getcwd() + '/models/classify.model',
40 | model_name='SVM',
41 | data_info_path=os.getcwd() + '/data_info.json')
42 |
43 | # 进行预测:概率
44 | result_prob = model.predict_prob(texts=test_data)
45 | result_prob = pd.DataFrame(result_prob, columns=model.label)
46 | result_prob['predict'] = result_prob.idxmax(axis=1)
47 | result_prob['data'] = test_data
48 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
49 | print('prob:\n', result_prob)
50 | print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
51 |
52 | ###################################################################################
53 | # 进行深度学习
54 | model.train(texts=train_data,
55 | label=train_label,
56 | model_name='Conv1D',
57 | batch_size=200,
58 | epochs=20,
59 | verbose=2,
60 | maxlen=None,
61 | model_savepath=os.getcwd() + '/models/classify.h5')
62 |
63 | # 导入深度学习模型
64 | model.load_model(model_loadpath=os.getcwd() + '/models/classify.h5',
65 | model_name='Conv1D',
66 | data_info_path=os.getcwd() + '/data_info.json')
67 |
68 | # 进行预测:概率
69 | result_prob = model.predict_prob(texts=test_data)
70 | result_prob = pd.DataFrame(result_prob, columns=model.label)
71 | result_prob['predict'] = result_prob.idxmax(axis=1)
72 | result_prob['data'] = test_data
73 | result_prob = result_prob[['data'] + list(model.label) + ['predict']]
74 | print('prob:\n', result_prob)
75 | print('score:', np.sum(result_prob['predict'] == np.array(test_label)) / len(result_prob['predict']))
76 |
77 | keras_log_plot(model.train_log)
78 |
--------------------------------------------------------------------------------
/picture/Conv1D.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/Conv1D.png
--------------------------------------------------------------------------------
/picture/SVM.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/SVM.png
--------------------------------------------------------------------------------
/picture/api1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api1.png
--------------------------------------------------------------------------------
/picture/api2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api2.png
--------------------------------------------------------------------------------
/picture/api3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/api3.png
--------------------------------------------------------------------------------
/picture/label.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Sentiment-analysis/c6cb5594d2784472f193a4b6633f155ae1919cf8/picture/label.png
--------------------------------------------------------------------------------