├── .gitignore
├── README.md
├── data
    └── input
    │   ├── training_set.txt
    │   └── validation_set.txt
├── docs
    └── images
    │   ├── history_cnn_matrix.png
    │   ├── history_lstm_avg.png
    │   ├── history_lstm_fasttext.png
    │   ├── history_lstm_matrix.png
    │   ├── history_nn_avg.png
    │   └── history_nn_fasttext.png
├── requirements.txt
└── src
    ├── preprocessing.py
    ├── sentiment_analysis_dl.py
    └── sentiment_analysis_ml.py


/.gitignore:
--------------------------------------------------------------------------------
1 | **/*.pyc
2 | **/__pycache__
3 | .idea/
4 | 
5 | data/input/models/
6 | data/output/
7 | src/scripts/
8 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Sentiment Analysis Implementations
 2 | 基于电影评论数据的中文情感分析  
 3 | **Chinese Sentiment Analysis** based on ML(Machine Learning) and DL(Deep Learning) algorithms, including Naive-Bayes, Decision-Tree, KNN, SVM, NN(MLP), CNN, RNN(LSTM).  
 4 | 
 5 | ## 0. Requirements
 6 | All code in this project is implemented in [Python3.6+](https://www.python.org/downloads/).  
 7 | And all the essential packages are listed in `requirements.txt`, you can install them by `pip install -r requirements.txt -i https://pypi.douban.com/simple/`  
 8 | [Anaconda](https://docs.anaconda.com/anaconda/) or [virtualenv + virtualenvwrapper](http://www.jianshu.com/p/44ab75fbaef2) are strongly recommended to manage your Python environments.
 9 | 
10 | ## 1. Data Preparation
11 | **1).数据集**  
12 | 使用电影评论数据作为训练数据集. 其中, 训练数据集20000条(正负向各10000条); 测试数据集6000条(正负向各3000条)  
13 | 
14 | **2).数据预处理**  
15 | 1.去除停用词, 并使用jieba进行分词  
16 | 2.使用预训练的词向量模型，对句子进行向量化  
17 | 
18 | ## 2. 各种实现方法准确率对比
19 | | Algorithm | Accuracy(avg) | Accuracy(fasttext) | Accuracy(matrix) | 说明 |
20 | | :---: | :---: | :---: | :---: | :---: |
21 | | Naive-Bayes | 73.72% | 74.32% | 69.34%(拼接和补齐) | / |
22 | | Decision-Tree | 65.27% | 66.84% | 55.34%(拼接和补齐) | / |
23 | | KNN | 76.69%({'n_neighbors': 19}) | 77.43%({'n_neighbors': 17}) | /(拼接和补齐) | 使用GridSearchCV进行参数选择 |
24 | | SVM | 79.29%({'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}) | 78.93%({'C': 1000, 'kernel': 'linear'}) | /(拼接和补齐) | 使用GridSearchCV进行参数选择 |
25 | | NN(MLP) | 80.24% | 80.41% | / | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau |
26 | | CNN | / | / | 81.34% | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau |
27 | | LSTM | 78.76% | 77.26% | 84.06% | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau |
28 | 
29 | ## 3. 各种实现方法acc-loss曲线绘制
30 | 1).NN(MLP)实现方法结果绘制:  
31 | 使用词向量和的平均表示句子:  
32 | ![history_nn_avg.png](docs/images/history_nn_avg.png)  
33 | 使用fasttext.get_numpy_sentence_vector()词向量表示句子:  
34 | ![history_nn_fasttext.png](docs/images/history_nn_fasttext.png)  
35 | 2).CNN实现方法结果绘制:  
36 | 使用fasttext.get_numpy_vector()词向量组成的矩阵表示句子:  
37 | ![history_cnn_matrix.png](docs/images/history_cnn_matrix.png)  
38 | 3).LSTM实现方法结果绘制:  
39 | 使用词向量和的平均表示句子:  
40 | ![history_lstm_avg.png](docs/images/history_lstm_avg.png)  
41 | 使用fasttext.get_numpy_sentence_vector()词向量表示句子:  
42 | ![history_lstm_fasttext.png](docs/images/history_lstm_fasttext.png)  
43 | 使用fasttext.get_numpy_vector()词向量组成的矩阵表示句子:  
44 | ![history_lstm_matrix.png](docs/images/history_lstm_matrix.png)  
45 | 
46 | 


--------------------------------------------------------------------------------
/docs/images/history_cnn_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_cnn_matrix.png


--------------------------------------------------------------------------------
/docs/images/history_lstm_avg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_avg.png


--------------------------------------------------------------------------------
/docs/images/history_lstm_fasttext.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_fasttext.png


--------------------------------------------------------------------------------
/docs/images/history_lstm_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_matrix.png


--------------------------------------------------------------------------------
/docs/images/history_nn_avg.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_nn_avg.png


--------------------------------------------------------------------------------
/docs/images/history_nn_fasttext.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_nn_fasttext.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | absl-py==0.2.2
 2 | astor==0.6.2
 3 | backcall==0.1.0
 4 | bleach==1.5.0
 5 | boto==2.48.0
 6 | boto3==1.7.42
 7 | botocore==1.10.42
 8 | bz2file==0.98
 9 | certifi==2018.4.16
10 | chardet==3.0.4
11 | cycler==0.10.0
12 | cysignals==1.7.1
13 | Cython==0.28.3
14 | decorator==4.3.0
15 | docutils==0.14
16 | future==0.16.0
17 | gast==0.2.0
18 | gensim==3.4.0
19 | grpcio==1.12.1
20 | h5py==2.8.0
21 | html5lib==0.9999999
22 | idna==2.7
23 | ipython==6.4.0
24 | ipython-genutils==0.2.0
25 | jedi==0.12.0
26 | jieba==0.39
27 | jmespath==0.9.3
28 | Keras==2.2.0
29 | Keras-Applications==1.0.2
30 | Keras-Preprocessing==1.0.1
31 | kiwisolver==1.0.1
32 | Markdown==2.6.11
33 | matplotlib==2.2.2
34 | numpy==1.14.5
35 | pandas==0.23.1
36 | parso==0.2.1
37 | pexpect==4.6.0
38 | pickleshare==0.7.4
39 | prompt-toolkit==1.0.15
40 | protobuf==3.6.0
41 | ptyprocess==0.5.2
42 | pyfasttext==0.4.5
43 | Pygments==2.2.0
44 | pyparsing==2.2.0
45 | python-dateutil==2.7.3
46 | pytz==2018.4
47 | PyYAML==3.12
48 | requests==2.19.1
49 | s3transfer==0.1.13
50 | scikit-learn==0.19.1
51 | scipy==1.1.0
52 | seaborn==0.8.1
53 | simplegeneric==0.8.1
54 | six==1.11.0
55 | sklearn==0.0
56 | smart-open==1.5.7
57 | tensorboard==1.8.0
58 | tensorflow==1.8.0
59 | termcolor==1.1.0
60 | traitlets==4.3.2
61 | urllib3==1.23
62 | wcwidth==0.1.7
63 | Werkzeug==0.14.1
64 | xgboost==0.71
65 | 


--------------------------------------------------------------------------------
/src/preprocessing.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # coding: utf-8
  3 | # File: preprocessing.py
  4 | # Author: lxw
  5 | # Date: 12/20/17 8:10 AM
  6 | 
  7 | 
  8 | # import matplotlib.pyplot as plt
  9 | import numpy as np
 10 | import pandas as pd
 11 | # import seaborn as sns
 12 | import time
 13 | 
 14 | from keras.preprocessing import sequence
 15 | from collections import Counter
 16 | from pyfasttext import FastText
 17 | 
 18 | 
 19 | class Preprocessing:
 20 |     def __init__(self):
 21 |         start_time = time.time()
 22 |         # self.model = FastText("../data/input/models/sg_pyfasttext.bin")  # DEBUG
 23 |         self.model = FastText("../data/input/models/880w_fasttext_skip_gram.bin")
 24 |         end_time = time.time()
 25 |         print(f"Loading word vector model cost: {end_time - start_time:.2f}s")
 26 | 
 27 |         # self.vocab_size, self.vector_size = self.model.numpy_normalized_vectors.shape  # OK
 28 |         self.vocab_size = self.model.nwords
 29 |         self.vector_size = self.model.args.get("dim")
 30 |         # self.vector_size:200, self.vocab_size: 925242
 31 |         print(f"self.vector_size:{self.vector_size}, self.vocab_size: {self.vocab_size}")
 32 | 
 33 |         # 句子的表示形式:
 34 |         # {"avg": 向量和的平均, "fasttext": get_numpy_sentence_vector, "concatenate": 向量拼接和补齐, "matrix": 矩阵}
 35 |         self.sentence_vec_type = "matrix"
 36 | 
 37 |         self.MAX_SENT_LEN = 70  # DEBUG: 超参数. self.get_sent_max_length()
 38 |         # 对于"concatenate": self.MAX_SENT_LEN = 30, 取其他不同值的结果: 100: 50.22%, 80: 50.23%, 70: 50.33%, 60: 55.92%, 50: 69.11%, 40: 68.91%, 36: 69.34%, 30: 69.22%, 20: 69.17%, 10: 67.07%
 39 |         # 对于"matrix": self.MAX_SENT_LEN = 70, 取其他不同值的结果: TODO:
 40 | 
 41 |     @classmethod
 42 |     def data_analysis(cls):
 43 |         train_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"])
 44 |         val_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"])
 45 |         y_train = train_df["label"]
 46 |         y_val = val_df["label"]
 47 |         sns.set(style="white", context="notebook", palette="deep")
 48 |         # 查看样本数据分布情况(各个label数据是否均匀分布)
 49 |         sns.countplot(y_train)
 50 |         plt.show()
 51 |         sns.countplot(y_val)
 52 |         plt.show()
 53 |         print(y_train.value_counts())
 54 |         print(y_val.value_counts())
 55 | 
 56 |     def set_sent_vec_type(self, sentence_vec_type):
 57 |         assert sentence_vec_type in ["avg", "concatenate", "fasttext", "matrix"], \
 58 |             "sentence_vec_type must be in ['avg', 'fasttext', 'concatenate', 'matrix']"
 59 |         self.sentence_vec_type = sentence_vec_type
 60 | 
 61 |     def get_sent_max_length(self):  # NOT_USED
 62 |         sent_len_counter = Counter()
 63 |         max_length = 0
 64 |         with open("../data/input/training_set.txt") as f:
 65 |             for line in f:
 66 |                 content = line.strip().split("\t")[1]
 67 |                 content_list = content.split()
 68 |                 length = len(content_list)
 69 |                 sent_len_counter[length] += 1
 70 |                 if max_length <= length:
 71 |                     max_length = length
 72 |         sent_len_counter = sorted(list(sent_len_counter.items()), key=lambda x: x[0])
 73 |         print(sent_len_counter)
 74 |         # [(31, 1145), (32, 1105), (33, 1017), (34, 938), (35, 839), (36, 830), (37, 775), (38, 737), (39, 720), (40, 643), (41, 575), (42, 584), (43, 517), (44, 547), (45, 514), (46, 514), (47, 480), (48, 460), (49, 470), (50, 444), (51, 484), (52, 432), (53, 462), (54, 495), (55, 487), (56, 500), (57, 496), (58, 489), (59, 419), (60, 387), (61, 348), (62, 265), (63, 222), (64, 153), (65, 127), (66, 103), (67, 67), (68, 34), (69, 21), (70, 22), (71, 8), (72, 6), (73, 4), (74, 10), (75, 2), (76, 4), (77, 2), (78, 1), (79, 2), (80, 4), (81, 2), (82, 3), (83, 1), (84, 5), (86, 4), (87, 3), (88, 3), (89, 2), (90, 2), (91, 3), (92, 5), (93, 2), (94, 4), (96, 1), (97, 5), (98, 1), (99, 2), (100, 2), (101, 2), (102, 1), (103, 2), (104, 2), (105, 2), (106, 5), (107, 3), (108, 2), (109, 3), (110, 4), (111, 1), (112, 2), (113, 3), (114, 1), (116, 1), (119, 3), (679, 1)]
 75 |         return max_length
 76 | 
 77 |     def gen_sentence_vec(self, sentence):
 78 |         """
 79 |         :param sentence: 
 80 |         :return: 
 81 |         """
 82 |         sentence = sentence.strip()
 83 |         if self.sentence_vec_type == "fasttext":
 84 |             return self.model.get_numpy_sentence_vector(sentence)
 85 | 
 86 |         word_list = sentence.split(" ")
 87 |         if self.sentence_vec_type == "concatenate":
 88 |             sentence_vector = self.model.get_numpy_vector(word_list[0])
 89 |             for word in word_list[1:]:
 90 |                 sentence_vector = np.hstack((sentence_vector, self.model.get_numpy_vector(word)))
 91 |             return sentence_vector  # NOTE: 对于concatenate情况, 每个句子的sentence_vector是不一样长的
 92 |         if self.sentence_vec_type == "matrix":  # for Deep Learning.
 93 |             sentence_matrix = []
 94 |             for word in word_list[-self.MAX_SENT_LEN:]:  # NOTE: 截取后面的应该是要好些(参考https://github.com/lxw0109/SentimentClassification_UMICH_SI650/blob/master/src/LSTM_wo_pretrained_vector.py#L86)
 95 |                 sentence_matrix.append(self.model.get_numpy_vector(word))
 96 |             length = len(sentence_matrix)
 97 |             # 一定成立，因为上面做了切片截取
 98 |             assert length <= self.MAX_SENT_LEN, "CRITICAL ERROR: len(sentence_matrix) > self.MAX_SENT_LEN."
 99 |             # 参数中的matrix类型为list of ndarray, 返回值的matrix是ndarray of ndarray
100 |             sentence_matrix = np.pad(sentence_matrix, pad_width=((0, self.MAX_SENT_LEN - length), (0, 0)),
101 |                                      mode="constant", constant_values=-1)
102 |             return sentence_matrix
103 |         else:  # self.sentence_vec_type == "avg":
104 |             sentence_vector = np.zeros(self.vector_size)  # <ndarray>
105 |             # print(f"type(sentence_vector): {type(sentence_vector)}")
106 |             for idx, word in enumerate(word_list):
107 |                 # print(f"type(self.model.get_numpy_vector(word)): {type(self.model.get_numpy_vector(word))}")  # <ndarray>
108 |                 sentence_vector += self.model.get_numpy_vector(word)
109 |             return sentence_vector / len(word_list)
110 | 
111 |     def gen_train_val_data(self):
112 |         # 构造训练数据 & 验证数据
113 |         train_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"])
114 |         val_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"])
115 |         # 打乱训练集的顺序. TODO: 不打乱感觉训练出来的模型是有问题的?(好看那句总是预测结果是1？)
116 |         train_df = train_df.sample(frac=1, random_state=1)
117 |         # val_df = val_df.sample(frac=1, random_state=1)  # 验证集不用打乱
118 | 
119 |         X_train = train_df["sentence"]
120 |         X_train_vec = list()
121 |         for sentence in X_train:
122 |             sent_vector = self.gen_sentence_vec(sentence)
123 |             X_train_vec.append(sent_vector)
124 |         y_train = train_df["label"]  # <Series>
125 | 
126 |         X_val = val_df["sentence"]
127 |         X_val_vec = list()
128 |         for sentence in X_val:
129 |             sent_vector = self.gen_sentence_vec(sentence)
130 |             X_val_vec.append(sent_vector)
131 |         y_val = val_df["label"]  # <Series>
132 | 
133 |         if self.sentence_vec_type == "concatenate":
134 |             # NOTE: 注意，这里的dtype是必须的，否则dtype默认值是"int32", 词向量所有的数值会被全部转换为0
135 |             X_train_vec = sequence.pad_sequences(X_train_vec, maxlen=self.MAX_SENT_LEN * self.vector_size, value=0,
136 |                                              dtype=np.float)
137 |             X_val_vec = sequence.pad_sequences(X_val_vec, maxlen=self.MAX_SENT_LEN * self.vector_size, value=0,
138 |                                            dtype=np.float)
139 | 
140 |         return np.array(X_train_vec), np.array(X_val_vec), np.array(y_train), np.array(y_val)
141 | 
142 | 
143 | if __name__ == "__main__":
144 |     preprocess_obj = Preprocessing()
145 |     """
146 |     preprocess_obj.get_sent_max_length()
147 |     sentence = "刘晓伟 好人"  # gen_sentence_vec()函数里"fasttext"的情况感觉也得处理成这种情况(空格分格)?
148 |     # sentence = "刘晓伟好人"  # NOTE: 与空格分割得到的向量不同
149 |     preprocess_obj.set_sent_vec_type("fasttext")
150 |     print(f'fasttext: {preprocess_obj.gen_sentence_vec(sentence)}')
151 |     preprocess_obj.set_sent_vec_type("avg")
152 |     print(f'avg: {preprocess_obj.gen_sentence_vec(sentence)}')
153 |     preprocess_obj.set_sent_vec_type("concatenate")
154 |     print(f'concatenate: {preprocess_obj.gen_sentence_vec(sentence)}')
155 |     """
156 | 
157 |     X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data()
158 |     # print(f"X_train: {X_train}\nX_val: {X_val}\ny_train: {y_train}\ny_val: {y_val}")
159 |     print(f"X_train.shape: {X_train.shape}\nX_val.shape: {X_val.shape}\n"
160 |           f"y_train.shape: {y_train.shape}\ny_val.shape: {y_val.shape}")
161 | 
162 |     # Preprocessing.data_analysis()
163 | 


--------------------------------------------------------------------------------
/src/sentiment_analysis_dl.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # coding: utf-8
  3 | # File: sentiment_analysis_dl.py
  4 | # Author: lxw
  5 | # Date: 12/21/17 9:23 AM
  6 | 
  7 | """
  8 | 基于"预训练词向量模型"和"深度学习"的情感分类(keras)
  9 | """
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import pickle
 14 | import time
 15 | 
 16 | from keras.callbacks import EarlyStopping
 17 | from keras.callbacks import ModelCheckpoint
 18 | from keras.callbacks import ReduceLROnPlateau
 19 | from keras.layers import Conv1D
 20 | from keras.layers import Dense
 21 | from keras.layers import Dropout
 22 | from keras.layers import GlobalAveragePooling1D
 23 | from keras.layers import LSTM
 24 | from keras.layers import Masking
 25 | from keras.layers import MaxPooling1D
 26 | from keras.models import Sequential
 27 | from keras.preprocessing import sequence
 28 | from keras.utils import np_utils
 29 | 
 30 | from preprocessing import Preprocessing
 31 | 
 32 | 
 33 | class SentimentAnalysis:
 34 |     def __init__(self, preprocess_obj, sent_vec_type):
 35 |         self.model_path_prefix="../data/output/models/"
 36 |         self.algorithm_name = "nn"
 37 |         self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}"
 38 |         self.preprocess_obj = preprocess_obj
 39 |         self.sent_vec_type = sent_vec_type
 40 |         self.bath_size = 512  # TODO
 41 |         self.epochs = 1000  # TODO
 42 | 
 43 |     def pick_algorithm(self, algorithm_name, sent_vec_type):
 44 |         assert algorithm_name in ["nn", "cnn", "lstm"], "algorithm_name must be in ['nn', 'cnn', 'lstm']"
 45 |         self.algorithm_name = algorithm_name
 46 |         self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}"
 47 |         self.sent_vec_type = sent_vec_type
 48 | 
 49 |     def model_build(self, input_shape):
 50 |         model_cls = Sequential()
 51 |         if self.algorithm_name == "nn":  # Neural Network(Multi-Layer Perceptron)
 52 |             # activation is essential for Dense, otherwise linear is used.
 53 |             model_cls.add(Dense(64, input_shape=input_shape, activation="relu", name="dense1"))
 54 |             model_cls.add(Dropout(0.25, name="dropout2"))
 55 |             model_cls.add(Dense(64, activation="relu", name="dense3"))
 56 |             model_cls.add(Dropout(0.25, name="dropout4"))
 57 |             model_cls.add(Dense(2, activation="softmax", name="dense5"))
 58 |             # model_cls.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])  # TODO:
 59 |             model_cls.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
 60 |         elif self.algorithm_name == "cnn":
 61 |             # input_shape = (rows行, cols列, 1) 1表示颜色通道数目, rows行，对应一句话的长度, cols列表示词向量的维度
 62 |             model_cls.add(Conv1D(64, 3, activation="relu", input_shape=input_shape))  # filters, kernel_size
 63 |             model_cls.add(Conv1D(64, 3, activation="relu"))
 64 |             model_cls.add(MaxPooling1D(3))
 65 |             model_cls.add(Conv1D(128, 3, activation="relu"))
 66 |             model_cls.add(Conv1D(128, 3, activation="relu"))
 67 |             model_cls.add(GlobalAveragePooling1D())
 68 |             model_cls.add(Dropout(0.25))
 69 |             model_cls.add(Dense(2, activation="sigmoid"))
 70 | 
 71 |             model_cls.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])  # TODO: categorical_crossentropy
 72 |         elif self.algorithm_name == "lstm":
 73 |             model_cls.add(Masking(mask_value=-1, input_shape=input_shape, name="masking_layer"))
 74 |             model_cls.add(LSTM(units=64, return_sequences=True, dropout=0.25, name="lstm1"))
 75 |             model_cls.add(LSTM(units=128, return_sequences=False, dropout=0.25, name="lstm2"))
 76 |             model_cls.add(Dense(units=2, activation="softmax", name="dense5"))
 77 | 
 78 |             model_cls.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
 79 | 
 80 |         return model_cls
 81 | 
 82 |     def model_train(self, model_cls, X_train, X_val, y_train, y_val):
 83 |         """
 84 |         分类器模型的训练
 85 |         :param model_cls: 所使用的算法的类的定义，尚未训练的模型
 86 |         :param X_train: 
 87 |         :param y_train: 
 88 |         :param X_val: 
 89 |         :param y_val: 
 90 |         :return: 训练好的模型
 91 |         """
 92 |         early_stopping = EarlyStopping(monitor="val_loss", patience=10)
 93 |         lr_reduction = ReduceLROnPlateau(monitor="val_loss", patience=5, verbose=1, factor=0.2, min_lr=1e-5)
 94 |         # 检查最好模型: 只要有提升, 就保存一次. 保存到多个模型文件
 95 |         # model_path = f"../data/output/models/{self.algorithm_name}_best_model_{epoch:02d}_{val_loss:.2f}.hdf5"  # NO
 96 |         model_path = "../data/output/models/best_model_{epoch:02d}_{val_loss:.2f}.hdf5"  # OK
 97 |         checkpoint = ModelCheckpoint(filepath=model_path, monitor="val_loss", verbose=1, save_best_only=True,
 98 |                                      mode="min")
 99 | 
100 |         hist_obj = model_cls.fit(X_train, y_train, batch_size=self.bath_size, epochs=self.epochs, verbose=1,
101 |                                  validation_data=(X_val, y_val), callbacks=[early_stopping, lr_reduction, checkpoint])
102 |         with open(f"../data/output/history_{self.algorithm_name}_{self.sent_vec_type}.pkl", "wb") as f:
103 |             pickle.dump(hist_obj.history, f)
104 |         return model_cls  # model
105 | 
106 |     def plot_hist(self, history_filename):
107 |         import matplotlib.pyplot as plt
108 | 
109 |         history = None
110 |         with open(f"../data/output/{history_filename}.pkl", "rb") as f:
111 |             history = pickle.load(f)
112 | 
113 |         if not history:
114 |             return
115 |         # 绘制训练集和验证集的曲线
116 |         plt.plot(history["acc"], label="Training Accuracy", color="green", linewidth=1)
117 |         plt.plot(history["loss"], label="Training Loss", color="red", linewidth=1)
118 |         plt.plot(history["val_acc"], label="Validation Accuracy", color="purple", linewidth=1)
119 |         plt.plot(history["val_loss"], label="Validation Loss", color="blue", linewidth=1)
120 |         plt.grid(True)  # 设置网格形式
121 |         plt.xlabel("epoch")
122 |         plt.ylabel("acc-loss")  # 给x, y轴加注释
123 |         plt.legend(loc="upper right")  # 设置图例显示位置
124 |         plt.show()
125 | 
126 |     def model_evaluate(self, model, X_val, y_val):
127 |         """
128 |         分类器模型的评估
129 |         :param model: 训练好的模型对象
130 |         :param X_val: 
131 |         :param y_val: 
132 |         :return: None
133 |         """
134 |         print("model.metrics:{0}, model.metrics_names:{1}".format(model.metrics, model.metrics_names))
135 |         scores = model.evaluate(X_val, y_val)
136 |         loss, accuracy = scores[0], scores[1] * 100
137 |         print(f"Loss: {loss:.2f}, '{self.algorithm_name}' Classification Accuracy: {accuracy:.2f}%")
138 | 
139 |     def model_predict(self, model):
140 |         """
141 |         模型测试
142 |         :param model: 训练好的模型对象
143 |         :param preprocess_obj: Preprocessing类对象
144 |         :return: None
145 |         """
146 |         sentence = "这件 衣服 真的 太 好看 了 ！ 好想 买 啊 "  # TODO:
147 |         sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence))  # shape: (70, 200)
148 |         if self.sent_vec_type == "matrix":  # cnn or lstm
149 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1])
150 |         elif self.algorithm_name == "nn":  # nn.
151 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0])
152 |         elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"):  # lstm.
153 |             sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0])
154 |         print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}")  # 0: 正向
155 | 
156 |         sentence = "这 真的是 一部 非常 优秀 电影 作品"
157 |         sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence))
158 |         if self.sent_vec_type == "matrix":  # cnn or lstm
159 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1])
160 |         elif self.algorithm_name == "nn":  # nn.
161 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0])
162 |         elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"):  # lstm.
163 |             sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0])
164 |         print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}")  # 0: 正向
165 | 
166 |         sentence = "这个 电视 真 尼玛 垃圾 ， 老子 再也 不买 了"
167 |         sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence))
168 |         if self.sent_vec_type == "matrix":  # cnn or lstm
169 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1])
170 |         elif self.algorithm_name == "nn":  # nn.
171 |             sent_vec = sent_vec.reshape(1, sent_vec.shape[0])
172 |         elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"):  # lstm.
173 |             sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0])
174 |         print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}")  # 1: 负向
175 | 
176 |         sentence_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"])
177 |         sentence_df = sentence_df.sample(frac=1)
178 |         sentence_series = sentence_df["sentence"]
179 |         label_series = sentence_df["label"]
180 |         print(f"label_series: {label_series.iloc[:11]}")
181 |         count = 0
182 |         for sentence in sentence_series:
183 |             count += 1
184 |             sentence = sentence.strip()
185 |             sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence))
186 |             if self.sent_vec_type == "matrix":  # cnn or lstm
187 |                 sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1])
188 |             elif self.algorithm_name == "nn":  # nn.
189 |                 sent_vec = sent_vec.reshape(1, sent_vec.shape[0])
190 |             elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"):  # lstm.
191 |                 sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0])
192 |             print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}")  # 0: 正向, 1: 负向
193 |             if count > 10:
194 |                 break
195 | 
196 | 
197 | if __name__ == "__main__":
198 |     start_time = time.time()
199 |     preprocess_obj = Preprocessing()
200 |     '''
201 | 
202 |     sent_vec_type_list = ["avg", "fasttext", "matrix"]  # NN(MLP): 只能使用avg或fasttext. CNN: 只能使用matrix. LSTM: avg, fasttext, matrix均可.
203 |     sent_vec_type = sent_vec_type_list[0]
204 |     print(f"\n{sent_vec_type} and", end=" ")
205 |     preprocess_obj.set_sent_vec_type(sent_vec_type)
206 | 
207 |     X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data()
208 |     y_train = np_utils.to_categorical(y_train)
209 |     y_val = np_utils.to_categorical(y_val)
210 | 
211 |     sent_analyse = SentimentAnalysis(preprocess_obj, sent_vec_type)
212 |     algorithm_list = ["nn", "cnn", "lstm"]
213 |     algorithm_name = algorithm_list[0]
214 | 
215 |     if len(X_train.shape) == 2 and algorithm_name == "lstm":  # avg or fasttext
216 |         X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
217 |         X_val = X_val.reshape(X_val.shape[0], 1, X_val.shape[1])
218 |         input_shape = (X_train.shape[1], X_train.shape[2])
219 |     elif algorithm_name == "nn":
220 |         input_shape = (X_train.shape[1],)
221 |     else:  # algorithm_name == "cnn" or (algorithm_name == "lstm" and len(X_train.shape) == 3)
222 |         input_shape = (X_train.shape[1], X_train.shape[2])
223 |     print(X_train.shape, y_train.shape)  # (19998, 200)/(19998, 70, 200) (19998,)
224 |     print(X_val.shape, y_val.shape)  # (5998, 200)/(5998, 70, 200) (5998,)
225 | 
226 |     print(f"{algorithm_name}:")
227 |     sent_analyse.pick_algorithm(algorithm_name, sent_vec_type)
228 |     # """
229 |     model_cls = sent_analyse.model_build(input_shape=input_shape)
230 |     model = sent_analyse.model_train(model_cls, X_train, X_val, y_train, y_val)
231 |     # """
232 |     sent_analyse.model_evaluate(model, X_val, y_val)
233 |     sent_analyse.model_predict(model)
234 |     '''
235 |     sent_analyse = SentimentAnalysis(preprocess_obj, "")
236 |     sent_analyse.plot_hist("history_nn_fasttext")
237 |     end_time = time.time()
238 |     print(f"\nProgram Running Cost {end_time -start_time:.2f}s")
239 | 


--------------------------------------------------------------------------------
/src/sentiment_analysis_ml.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # coding: utf-8
  3 | # File: sentiment_analysis_ml.py
  4 | # Author: lxw
  5 | # Date: 12/21/17 8:13 AM
  6 | 
  7 | """
  8 | 基于"预训练词向量模型"和"机器学习"的情感分类(sklearn)
  9 | """
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | import time
 14 | 
 15 | from keras.preprocessing import sequence
 16 | from sklearn.externals import joblib
 17 | from sklearn.model_selection import GridSearchCV
 18 | 
 19 | from preprocessing import Preprocessing
 20 | 
 21 | 
 22 | class SentimentAnalysis:
 23 |     def __init__(self, sent_vec_type):
 24 |         self.model_path_prefix="../data/output/models/"
 25 |         self.algorithm_name = "nb"
 26 |         self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}.model"
 27 | 
 28 |     def pick_algorithm(self, algorithm_name, sent_vec_type):
 29 |         assert algorithm_name in ["nb", "dt", "knn", "svm"], "algorithm_name must be in ['nb', 'dt', 'knn', 'svm']"
 30 |         self.algorithm_name = algorithm_name
 31 |         self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}.model"
 32 | 
 33 |     def model_build(self):
 34 |         model_cls = None
 35 |         if self.algorithm_name == "nb":  # Naive Bayes
 36 |             from sklearn.naive_bayes import GaussianNB
 37 |             model_cls = GaussianNB()
 38 |         elif self.algorithm_name == "dt":  # Decision Tree
 39 |             from sklearn import tree
 40 |             model_cls = tree.DecisionTreeClassifier()
 41 |         elif self.algorithm_name == "knn":
 42 |             from sklearn.neighbors import KNeighborsClassifier
 43 |             model_cls = KNeighborsClassifier()
 44 |             tuned_parameters = [{"n_neighbors": range(1, 20)}]
 45 |             model_cls = GridSearchCV(model_cls, tuned_parameters, cv=5, scoring="precision_weighted")
 46 |         elif self.algorithm_name == "svm":
 47 |             from sklearn.svm import SVC
 48 |             """
 49 |             # OK
 50 |             model_cls = SVC(kernel="linear")
 51 |             tuned_parameters = [{"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
 52 |                                 {"kernel": ["linear"], "C": [1, 10, 100, 1000]}]
 53 |             model_cls = GridSearchCV(model_cls, tuned_parameters, cv=5, scoring="precision_weighted")
 54 |             """
 55 |             model_cls = SVC(C=1000, gamma=1e-3, kernel="rbf")  # avg
 56 |             # model_cls = SVC(C=1000, kernel="linear")  # fasttext
 57 | 
 58 |         return model_cls
 59 | 
 60 |     def model_train(self, model_cls, X_train, y_train):
 61 |         """
 62 |         分类器模型的训练
 63 |         :param model_cls: 所使用的算法的类的定义，尚未训练的模型
 64 |         :param X_train: 
 65 |         :param y_train: 
 66 |         :return: 训练好的模型
 67 |         """
 68 |         model_cls.fit(X_train, y_train)
 69 |         # if self.algorithm_name in {"svm", "knn"}:
 70 |         #     print(model_cls.best_params_)
 71 |         return model_cls  # model
 72 | 
 73 |     def model_save(self, model):
 74 |         """
 75 |         分类器模型的保存
 76 |         :param model: 训练好的模型对象
 77 |         :return: None
 78 |         """
 79 |         joblib.dump(model, self.model_path)
 80 | 
 81 |     def model_evaluate(self, model, X_val, y_val):
 82 |         """
 83 |         分类器模型的评估
 84 |         :param model: 训练好的模型对象
 85 |         :param X_val: 
 86 |         :param y_val: 
 87 |         :return: None
 88 |         """
 89 |         # model = joblib.load(self.model_path)
 90 |         y_val = list(y_val)
 91 |         correct = 0
 92 |         y_predict = model.predict(X_val)
 93 |         print(f"len(y_predict): {len(y_predict)}, len(y_val): {len(y_val)}")
 94 |         assert len(y_predict) == len(y_val), "Unexpected Error: len(y_predict) != len(y_val), but it should be"
 95 |         for idx in range(len(y_predict)):
 96 |             if int(y_predict[idx]) == int(y_val[idx]):
 97 |                 correct += 1
 98 |         score = correct / len(y_predict)
 99 |         print(f"'{self.algorithm_name}' Classification Accuray:{score*100:.2f}%")
100 | 
101 |     def model_predict(self, model, preprocess_obj):
102 |         """
103 |         模型测试
104 |         :param model: 训练好的模型对象
105 |         :param preprocess_obj: Preprocessing类对象
106 |         :return: None
107 |         """
108 |         sentence = "这 真的是 一部 非常 优秀 电影 作品"
109 |         sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1)  # shape: (1, 1000)
110 |         # print(f"sent_vec: {sent_vec.tolist()}")
111 |         if preprocess_obj.sentence_vec_type == "concatenate":
112 |             # NOTE: 注意，这里的dtype是必须的，否则dtype默认值是'int32', 词向量所有的数值会被全部转换为0
113 |             sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size,
114 |                                               value=0, dtype=np.float)
115 |             # print(f"sent_vec: {sent_vec.tolist()}")
116 |         print(f"'{sentence}': {model.predict(sent_vec)}")  # 0: 正向
117 | 
118 |         sentence = "这个 电视 真 尼玛 垃圾 ， 老子 再也 不买 了"
119 |         sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1)
120 |         # print(f"sent_vec: {sent_vec.tolist()}")
121 |         if preprocess_obj.sentence_vec_type == "concatenate":
122 |             sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size,
123 |                                               value=0, dtype=np.float)
124 |             # print(f"sent_vec: {sent_vec.tolist()}")
125 |         print(f"'{sentence}': {model.predict(sent_vec)}")  # 1: 负向
126 | 
127 |         sentence_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"])
128 |         sentence_df = sentence_df.sample(frac=1)
129 |         sentence_series = sentence_df["sentence"]
130 |         label_series = sentence_df["label"]
131 |         print(f"label_series: {label_series.iloc[:11]}")
132 |         count = 0
133 |         for sentence in sentence_series:
134 |             count += 1
135 |             sentence = sentence.strip()
136 |             sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1)
137 |             # print(f"sent_vec: {sent_vec.tolist()}")
138 |             if preprocess_obj.sentence_vec_type == "concatenate":
139 |                 sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size,
140 |                                                   value=0, dtype=np.float)
141 |                 # print(f"sent_vec: {sent_vec.tolist()}")
142 |             print(f"'{sentence}': {model.predict(sent_vec)}")  # 0: 正向, 1: 负向
143 |             if count > 10:
144 |                 break
145 | 
146 | 
147 | if __name__ == "__main__":
148 |     start_time = time.time()
149 |     preprocess_obj = Preprocessing()
150 | 
151 |     sent_vec_type_list = ["avg", "fasttext", "concatenate"]
152 |     sent_vec_type = sent_vec_type_list[0]
153 |     print(f"\n{sent_vec_type} and", end=" ")
154 |     preprocess_obj.set_sent_vec_type(sent_vec_type)
155 | 
156 |     X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data()
157 |     # print(X_train.shape, y_train.shape)  # (19998, 100) (19998,)
158 |     # print(X_val.shape, y_val.shape)  # (5998, 100) (5998,)
159 | 
160 |     sent_analyse = SentimentAnalysis(sent_vec_type)
161 |     algorithm_list = ["nb", "dt", "knn", "svm"]
162 |     algorithm_name = algorithm_list[2]
163 |     print(f"{algorithm_name}:")
164 |     sent_analyse.pick_algorithm(algorithm_name, sent_vec_type)
165 |     model_cls = sent_analyse.model_build()
166 |     model = sent_analyse.model_train(model_cls, X_train, y_train)
167 |     sent_analyse.model_save(model)
168 |     """
169 |     """
170 |     model = joblib.load(sent_analyse.model_path)
171 |     sent_analyse.model_evaluate(model, X_val, y_val)
172 |     sent_analyse.model_predict(model, preprocess_obj)
173 |     end_time = time.time()
174 |     print(f"\nProgram Running Cost {end_time -start_time:.2f}s")
175 | 
176 | 


--------------------------------------------------------------------------------