├── .gitignore ├── README.md ├── data └── input │ ├── training_set.txt │ └── validation_set.txt ├── docs └── images │ ├── history_cnn_matrix.png │ ├── history_lstm_avg.png │ ├── history_lstm_fasttext.png │ ├── history_lstm_matrix.png │ ├── history_nn_avg.png │ └── history_nn_fasttext.png ├── requirements.txt └── src ├── preprocessing.py ├── sentiment_analysis_dl.py └── sentiment_analysis_ml.py /.gitignore: -------------------------------------------------------------------------------- 1 | **/*.pyc 2 | **/__pycache__ 3 | .idea/ 4 | 5 | data/input/models/ 6 | data/output/ 7 | src/scripts/ 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis Implementations 2 | 基于电影评论数据的中文情感分析 3 | **Chinese Sentiment Analysis** based on ML(Machine Learning) and DL(Deep Learning) algorithms, including Naive-Bayes, Decision-Tree, KNN, SVM, NN(MLP), CNN, RNN(LSTM). 4 | 5 | ## 0. Requirements 6 | All code in this project is implemented in [Python3.6+](https://www.python.org/downloads/). 7 | And all the essential packages are listed in `requirements.txt`, you can install them by `pip install -r requirements.txt -i https://pypi.douban.com/simple/` 8 | [Anaconda](https://docs.anaconda.com/anaconda/) or [virtualenv + virtualenvwrapper](http://www.jianshu.com/p/44ab75fbaef2) are strongly recommended to manage your Python environments. 9 | 10 | ## 1. Data Preparation 11 | **1).数据集** 12 | 使用电影评论数据作为训练数据集. 其中, 训练数据集20000条(正负向各10000条); 测试数据集6000条(正负向各3000条) 13 | 14 | **2).数据预处理** 15 | 1.去除停用词, 并使用jieba进行分词 16 | 2.使用预训练的词向量模型,对句子进行向量化 17 | 18 | ## 2. 各种实现方法准确率对比 19 | | Algorithm | Accuracy(avg) | Accuracy(fasttext) | Accuracy(matrix) | 说明 | 20 | | :---: | :---: | :---: | :---: | :---: | 21 | | Naive-Bayes | 73.72% | 74.32% | 69.34%(拼接和补齐) | / | 22 | | Decision-Tree | 65.27% | 66.84% | 55.34%(拼接和补齐) | / | 23 | | KNN | 76.69%({'n_neighbors': 19}) | 77.43%({'n_neighbors': 17}) | /(拼接和补齐) | 使用GridSearchCV进行参数选择 | 24 | | SVM | 79.29%({'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}) | 78.93%({'C': 1000, 'kernel': 'linear'}) | /(拼接和补齐) | 使用GridSearchCV进行参数选择 | 25 | | NN(MLP) | 80.24% | 80.41% | / | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau | 26 | | CNN | / | / | 81.34% | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau | 27 | | LSTM | 78.76% | 77.26% | 84.06% | 采用EarlyStopping, ModelCheckpoint, ReduceLROnPlateau | 28 | 29 | ## 3. 各种实现方法acc-loss曲线绘制 30 | 1).NN(MLP)实现方法结果绘制: 31 | 使用词向量和的平均表示句子: 32 | ![history_nn_avg.png](docs/images/history_nn_avg.png) 33 | 使用fasttext.get_numpy_sentence_vector()词向量表示句子: 34 | ![history_nn_fasttext.png](docs/images/history_nn_fasttext.png) 35 | 2).CNN实现方法结果绘制: 36 | 使用fasttext.get_numpy_vector()词向量组成的矩阵表示句子: 37 | ![history_cnn_matrix.png](docs/images/history_cnn_matrix.png) 38 | 3).LSTM实现方法结果绘制: 39 | 使用词向量和的平均表示句子: 40 | ![history_lstm_avg.png](docs/images/history_lstm_avg.png) 41 | 使用fasttext.get_numpy_sentence_vector()词向量表示句子: 42 | ![history_lstm_fasttext.png](docs/images/history_lstm_fasttext.png) 43 | 使用fasttext.get_numpy_vector()词向量组成的矩阵表示句子: 44 | ![history_lstm_matrix.png](docs/images/history_lstm_matrix.png) 45 | 46 | -------------------------------------------------------------------------------- /docs/images/history_cnn_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_cnn_matrix.png -------------------------------------------------------------------------------- /docs/images/history_lstm_avg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_avg.png -------------------------------------------------------------------------------- /docs/images/history_lstm_fasttext.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_fasttext.png -------------------------------------------------------------------------------- /docs/images/history_lstm_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_lstm_matrix.png -------------------------------------------------------------------------------- /docs/images/history_nn_avg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_nn_avg.png -------------------------------------------------------------------------------- /docs/images/history_nn_fasttext.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lxw0109/ChineseSentimentAnalysis/1084e8ab0060df9c389cd4586fc40527d09cecfd/docs/images/history_nn_fasttext.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.2.2 2 | astor==0.6.2 3 | backcall==0.1.0 4 | bleach==1.5.0 5 | boto==2.48.0 6 | boto3==1.7.42 7 | botocore==1.10.42 8 | bz2file==0.98 9 | certifi==2018.4.16 10 | chardet==3.0.4 11 | cycler==0.10.0 12 | cysignals==1.7.1 13 | Cython==0.28.3 14 | decorator==4.3.0 15 | docutils==0.14 16 | future==0.16.0 17 | gast==0.2.0 18 | gensim==3.4.0 19 | grpcio==1.12.1 20 | h5py==2.8.0 21 | html5lib==0.9999999 22 | idna==2.7 23 | ipython==6.4.0 24 | ipython-genutils==0.2.0 25 | jedi==0.12.0 26 | jieba==0.39 27 | jmespath==0.9.3 28 | Keras==2.2.0 29 | Keras-Applications==1.0.2 30 | Keras-Preprocessing==1.0.1 31 | kiwisolver==1.0.1 32 | Markdown==2.6.11 33 | matplotlib==2.2.2 34 | numpy==1.14.5 35 | pandas==0.23.1 36 | parso==0.2.1 37 | pexpect==4.6.0 38 | pickleshare==0.7.4 39 | prompt-toolkit==1.0.15 40 | protobuf==3.6.0 41 | ptyprocess==0.5.2 42 | pyfasttext==0.4.5 43 | Pygments==2.2.0 44 | pyparsing==2.2.0 45 | python-dateutil==2.7.3 46 | pytz==2018.4 47 | PyYAML==3.12 48 | requests==2.19.1 49 | s3transfer==0.1.13 50 | scikit-learn==0.19.1 51 | scipy==1.1.0 52 | seaborn==0.8.1 53 | simplegeneric==0.8.1 54 | six==1.11.0 55 | sklearn==0.0 56 | smart-open==1.5.7 57 | tensorboard==1.8.0 58 | tensorflow==1.8.0 59 | termcolor==1.1.0 60 | traitlets==4.3.2 61 | urllib3==1.23 62 | wcwidth==0.1.7 63 | Werkzeug==0.14.1 64 | xgboost==0.71 65 | -------------------------------------------------------------------------------- /src/preprocessing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding: utf-8 3 | # File: preprocessing.py 4 | # Author: lxw 5 | # Date: 12/20/17 8:10 AM 6 | 7 | 8 | # import matplotlib.pyplot as plt 9 | import numpy as np 10 | import pandas as pd 11 | # import seaborn as sns 12 | import time 13 | 14 | from keras.preprocessing import sequence 15 | from collections import Counter 16 | from pyfasttext import FastText 17 | 18 | 19 | class Preprocessing: 20 | def __init__(self): 21 | start_time = time.time() 22 | # self.model = FastText("../data/input/models/sg_pyfasttext.bin") # DEBUG 23 | self.model = FastText("../data/input/models/880w_fasttext_skip_gram.bin") 24 | end_time = time.time() 25 | print(f"Loading word vector model cost: {end_time - start_time:.2f}s") 26 | 27 | # self.vocab_size, self.vector_size = self.model.numpy_normalized_vectors.shape # OK 28 | self.vocab_size = self.model.nwords 29 | self.vector_size = self.model.args.get("dim") 30 | # self.vector_size:200, self.vocab_size: 925242 31 | print(f"self.vector_size:{self.vector_size}, self.vocab_size: {self.vocab_size}") 32 | 33 | # 句子的表示形式: 34 | # {"avg": 向量和的平均, "fasttext": get_numpy_sentence_vector, "concatenate": 向量拼接和补齐, "matrix": 矩阵} 35 | self.sentence_vec_type = "matrix" 36 | 37 | self.MAX_SENT_LEN = 70 # DEBUG: 超参数. self.get_sent_max_length() 38 | # 对于"concatenate": self.MAX_SENT_LEN = 30, 取其他不同值的结果: 100: 50.22%, 80: 50.23%, 70: 50.33%, 60: 55.92%, 50: 69.11%, 40: 68.91%, 36: 69.34%, 30: 69.22%, 20: 69.17%, 10: 67.07% 39 | # 对于"matrix": self.MAX_SENT_LEN = 70, 取其他不同值的结果: TODO: 40 | 41 | @classmethod 42 | def data_analysis(cls): 43 | train_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"]) 44 | val_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"]) 45 | y_train = train_df["label"] 46 | y_val = val_df["label"] 47 | sns.set(style="white", context="notebook", palette="deep") 48 | # 查看样本数据分布情况(各个label数据是否均匀分布) 49 | sns.countplot(y_train) 50 | plt.show() 51 | sns.countplot(y_val) 52 | plt.show() 53 | print(y_train.value_counts()) 54 | print(y_val.value_counts()) 55 | 56 | def set_sent_vec_type(self, sentence_vec_type): 57 | assert sentence_vec_type in ["avg", "concatenate", "fasttext", "matrix"], \ 58 | "sentence_vec_type must be in ['avg', 'fasttext', 'concatenate', 'matrix']" 59 | self.sentence_vec_type = sentence_vec_type 60 | 61 | def get_sent_max_length(self): # NOT_USED 62 | sent_len_counter = Counter() 63 | max_length = 0 64 | with open("../data/input/training_set.txt") as f: 65 | for line in f: 66 | content = line.strip().split("\t")[1] 67 | content_list = content.split() 68 | length = len(content_list) 69 | sent_len_counter[length] += 1 70 | if max_length <= length: 71 | max_length = length 72 | sent_len_counter = sorted(list(sent_len_counter.items()), key=lambda x: x[0]) 73 | print(sent_len_counter) 74 | # [(31, 1145), (32, 1105), (33, 1017), (34, 938), (35, 839), (36, 830), (37, 775), (38, 737), (39, 720), (40, 643), (41, 575), (42, 584), (43, 517), (44, 547), (45, 514), (46, 514), (47, 480), (48, 460), (49, 470), (50, 444), (51, 484), (52, 432), (53, 462), (54, 495), (55, 487), (56, 500), (57, 496), (58, 489), (59, 419), (60, 387), (61, 348), (62, 265), (63, 222), (64, 153), (65, 127), (66, 103), (67, 67), (68, 34), (69, 21), (70, 22), (71, 8), (72, 6), (73, 4), (74, 10), (75, 2), (76, 4), (77, 2), (78, 1), (79, 2), (80, 4), (81, 2), (82, 3), (83, 1), (84, 5), (86, 4), (87, 3), (88, 3), (89, 2), (90, 2), (91, 3), (92, 5), (93, 2), (94, 4), (96, 1), (97, 5), (98, 1), (99, 2), (100, 2), (101, 2), (102, 1), (103, 2), (104, 2), (105, 2), (106, 5), (107, 3), (108, 2), (109, 3), (110, 4), (111, 1), (112, 2), (113, 3), (114, 1), (116, 1), (119, 3), (679, 1)] 75 | return max_length 76 | 77 | def gen_sentence_vec(self, sentence): 78 | """ 79 | :param sentence: 80 | :return: 81 | """ 82 | sentence = sentence.strip() 83 | if self.sentence_vec_type == "fasttext": 84 | return self.model.get_numpy_sentence_vector(sentence) 85 | 86 | word_list = sentence.split(" ") 87 | if self.sentence_vec_type == "concatenate": 88 | sentence_vector = self.model.get_numpy_vector(word_list[0]) 89 | for word in word_list[1:]: 90 | sentence_vector = np.hstack((sentence_vector, self.model.get_numpy_vector(word))) 91 | return sentence_vector # NOTE: 对于concatenate情况, 每个句子的sentence_vector是不一样长的 92 | if self.sentence_vec_type == "matrix": # for Deep Learning. 93 | sentence_matrix = [] 94 | for word in word_list[-self.MAX_SENT_LEN:]: # NOTE: 截取后面的应该是要好些(参考https://github.com/lxw0109/SentimentClassification_UMICH_SI650/blob/master/src/LSTM_wo_pretrained_vector.py#L86) 95 | sentence_matrix.append(self.model.get_numpy_vector(word)) 96 | length = len(sentence_matrix) 97 | # 一定成立,因为上面做了切片截取 98 | assert length <= self.MAX_SENT_LEN, "CRITICAL ERROR: len(sentence_matrix) > self.MAX_SENT_LEN." 99 | # 参数中的matrix类型为list of ndarray, 返回值的matrix是ndarray of ndarray 100 | sentence_matrix = np.pad(sentence_matrix, pad_width=((0, self.MAX_SENT_LEN - length), (0, 0)), 101 | mode="constant", constant_values=-1) 102 | return sentence_matrix 103 | else: # self.sentence_vec_type == "avg": 104 | sentence_vector = np.zeros(self.vector_size) # 105 | # print(f"type(sentence_vector): {type(sentence_vector)}") 106 | for idx, word in enumerate(word_list): 107 | # print(f"type(self.model.get_numpy_vector(word)): {type(self.model.get_numpy_vector(word))}") # 108 | sentence_vector += self.model.get_numpy_vector(word) 109 | return sentence_vector / len(word_list) 110 | 111 | def gen_train_val_data(self): 112 | # 构造训练数据 & 验证数据 113 | train_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"]) 114 | val_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"]) 115 | # 打乱训练集的顺序. TODO: 不打乱感觉训练出来的模型是有问题的?(好看那句总是预测结果是1?) 116 | train_df = train_df.sample(frac=1, random_state=1) 117 | # val_df = val_df.sample(frac=1, random_state=1) # 验证集不用打乱 118 | 119 | X_train = train_df["sentence"] 120 | X_train_vec = list() 121 | for sentence in X_train: 122 | sent_vector = self.gen_sentence_vec(sentence) 123 | X_train_vec.append(sent_vector) 124 | y_train = train_df["label"] # 125 | 126 | X_val = val_df["sentence"] 127 | X_val_vec = list() 128 | for sentence in X_val: 129 | sent_vector = self.gen_sentence_vec(sentence) 130 | X_val_vec.append(sent_vector) 131 | y_val = val_df["label"] # 132 | 133 | if self.sentence_vec_type == "concatenate": 134 | # NOTE: 注意,这里的dtype是必须的,否则dtype默认值是"int32", 词向量所有的数值会被全部转换为0 135 | X_train_vec = sequence.pad_sequences(X_train_vec, maxlen=self.MAX_SENT_LEN * self.vector_size, value=0, 136 | dtype=np.float) 137 | X_val_vec = sequence.pad_sequences(X_val_vec, maxlen=self.MAX_SENT_LEN * self.vector_size, value=0, 138 | dtype=np.float) 139 | 140 | return np.array(X_train_vec), np.array(X_val_vec), np.array(y_train), np.array(y_val) 141 | 142 | 143 | if __name__ == "__main__": 144 | preprocess_obj = Preprocessing() 145 | """ 146 | preprocess_obj.get_sent_max_length() 147 | sentence = "刘晓伟 好人" # gen_sentence_vec()函数里"fasttext"的情况感觉也得处理成这种情况(空格分格)? 148 | # sentence = "刘晓伟好人" # NOTE: 与空格分割得到的向量不同 149 | preprocess_obj.set_sent_vec_type("fasttext") 150 | print(f'fasttext: {preprocess_obj.gen_sentence_vec(sentence)}') 151 | preprocess_obj.set_sent_vec_type("avg") 152 | print(f'avg: {preprocess_obj.gen_sentence_vec(sentence)}') 153 | preprocess_obj.set_sent_vec_type("concatenate") 154 | print(f'concatenate: {preprocess_obj.gen_sentence_vec(sentence)}') 155 | """ 156 | 157 | X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data() 158 | # print(f"X_train: {X_train}\nX_val: {X_val}\ny_train: {y_train}\ny_val: {y_val}") 159 | print(f"X_train.shape: {X_train.shape}\nX_val.shape: {X_val.shape}\n" 160 | f"y_train.shape: {y_train.shape}\ny_val.shape: {y_val.shape}") 161 | 162 | # Preprocessing.data_analysis() 163 | -------------------------------------------------------------------------------- /src/sentiment_analysis_dl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding: utf-8 3 | # File: sentiment_analysis_dl.py 4 | # Author: lxw 5 | # Date: 12/21/17 9:23 AM 6 | 7 | """ 8 | 基于"预训练词向量模型"和"深度学习"的情感分类(keras) 9 | """ 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import pickle 14 | import time 15 | 16 | from keras.callbacks import EarlyStopping 17 | from keras.callbacks import ModelCheckpoint 18 | from keras.callbacks import ReduceLROnPlateau 19 | from keras.layers import Conv1D 20 | from keras.layers import Dense 21 | from keras.layers import Dropout 22 | from keras.layers import GlobalAveragePooling1D 23 | from keras.layers import LSTM 24 | from keras.layers import Masking 25 | from keras.layers import MaxPooling1D 26 | from keras.models import Sequential 27 | from keras.preprocessing import sequence 28 | from keras.utils import np_utils 29 | 30 | from preprocessing import Preprocessing 31 | 32 | 33 | class SentimentAnalysis: 34 | def __init__(self, preprocess_obj, sent_vec_type): 35 | self.model_path_prefix="../data/output/models/" 36 | self.algorithm_name = "nn" 37 | self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}" 38 | self.preprocess_obj = preprocess_obj 39 | self.sent_vec_type = sent_vec_type 40 | self.bath_size = 512 # TODO 41 | self.epochs = 1000 # TODO 42 | 43 | def pick_algorithm(self, algorithm_name, sent_vec_type): 44 | assert algorithm_name in ["nn", "cnn", "lstm"], "algorithm_name must be in ['nn', 'cnn', 'lstm']" 45 | self.algorithm_name = algorithm_name 46 | self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}" 47 | self.sent_vec_type = sent_vec_type 48 | 49 | def model_build(self, input_shape): 50 | model_cls = Sequential() 51 | if self.algorithm_name == "nn": # Neural Network(Multi-Layer Perceptron) 52 | # activation is essential for Dense, otherwise linear is used. 53 | model_cls.add(Dense(64, input_shape=input_shape, activation="relu", name="dense1")) 54 | model_cls.add(Dropout(0.25, name="dropout2")) 55 | model_cls.add(Dense(64, activation="relu", name="dense3")) 56 | model_cls.add(Dropout(0.25, name="dropout4")) 57 | model_cls.add(Dense(2, activation="softmax", name="dense5")) 58 | # model_cls.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) # TODO: 59 | model_cls.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 60 | elif self.algorithm_name == "cnn": 61 | # input_shape = (rows行, cols列, 1) 1表示颜色通道数目, rows行,对应一句话的长度, cols列表示词向量的维度 62 | model_cls.add(Conv1D(64, 3, activation="relu", input_shape=input_shape)) # filters, kernel_size 63 | model_cls.add(Conv1D(64, 3, activation="relu")) 64 | model_cls.add(MaxPooling1D(3)) 65 | model_cls.add(Conv1D(128, 3, activation="relu")) 66 | model_cls.add(Conv1D(128, 3, activation="relu")) 67 | model_cls.add(GlobalAveragePooling1D()) 68 | model_cls.add(Dropout(0.25)) 69 | model_cls.add(Dense(2, activation="sigmoid")) 70 | 71 | model_cls.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) # TODO: categorical_crossentropy 72 | elif self.algorithm_name == "lstm": 73 | model_cls.add(Masking(mask_value=-1, input_shape=input_shape, name="masking_layer")) 74 | model_cls.add(LSTM(units=64, return_sequences=True, dropout=0.25, name="lstm1")) 75 | model_cls.add(LSTM(units=128, return_sequences=False, dropout=0.25, name="lstm2")) 76 | model_cls.add(Dense(units=2, activation="softmax", name="dense5")) 77 | 78 | model_cls.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 79 | 80 | return model_cls 81 | 82 | def model_train(self, model_cls, X_train, X_val, y_train, y_val): 83 | """ 84 | 分类器模型的训练 85 | :param model_cls: 所使用的算法的类的定义,尚未训练的模型 86 | :param X_train: 87 | :param y_train: 88 | :param X_val: 89 | :param y_val: 90 | :return: 训练好的模型 91 | """ 92 | early_stopping = EarlyStopping(monitor="val_loss", patience=10) 93 | lr_reduction = ReduceLROnPlateau(monitor="val_loss", patience=5, verbose=1, factor=0.2, min_lr=1e-5) 94 | # 检查最好模型: 只要有提升, 就保存一次. 保存到多个模型文件 95 | # model_path = f"../data/output/models/{self.algorithm_name}_best_model_{epoch:02d}_{val_loss:.2f}.hdf5" # NO 96 | model_path = "../data/output/models/best_model_{epoch:02d}_{val_loss:.2f}.hdf5" # OK 97 | checkpoint = ModelCheckpoint(filepath=model_path, monitor="val_loss", verbose=1, save_best_only=True, 98 | mode="min") 99 | 100 | hist_obj = model_cls.fit(X_train, y_train, batch_size=self.bath_size, epochs=self.epochs, verbose=1, 101 | validation_data=(X_val, y_val), callbacks=[early_stopping, lr_reduction, checkpoint]) 102 | with open(f"../data/output/history_{self.algorithm_name}_{self.sent_vec_type}.pkl", "wb") as f: 103 | pickle.dump(hist_obj.history, f) 104 | return model_cls # model 105 | 106 | def plot_hist(self, history_filename): 107 | import matplotlib.pyplot as plt 108 | 109 | history = None 110 | with open(f"../data/output/{history_filename}.pkl", "rb") as f: 111 | history = pickle.load(f) 112 | 113 | if not history: 114 | return 115 | # 绘制训练集和验证集的曲线 116 | plt.plot(history["acc"], label="Training Accuracy", color="green", linewidth=1) 117 | plt.plot(history["loss"], label="Training Loss", color="red", linewidth=1) 118 | plt.plot(history["val_acc"], label="Validation Accuracy", color="purple", linewidth=1) 119 | plt.plot(history["val_loss"], label="Validation Loss", color="blue", linewidth=1) 120 | plt.grid(True) # 设置网格形式 121 | plt.xlabel("epoch") 122 | plt.ylabel("acc-loss") # 给x, y轴加注释 123 | plt.legend(loc="upper right") # 设置图例显示位置 124 | plt.show() 125 | 126 | def model_evaluate(self, model, X_val, y_val): 127 | """ 128 | 分类器模型的评估 129 | :param model: 训练好的模型对象 130 | :param X_val: 131 | :param y_val: 132 | :return: None 133 | """ 134 | print("model.metrics:{0}, model.metrics_names:{1}".format(model.metrics, model.metrics_names)) 135 | scores = model.evaluate(X_val, y_val) 136 | loss, accuracy = scores[0], scores[1] * 100 137 | print(f"Loss: {loss:.2f}, '{self.algorithm_name}' Classification Accuracy: {accuracy:.2f}%") 138 | 139 | def model_predict(self, model): 140 | """ 141 | 模型测试 142 | :param model: 训练好的模型对象 143 | :param preprocess_obj: Preprocessing类对象 144 | :return: None 145 | """ 146 | sentence = "这件 衣服 真的 太 好看 了 ! 好想 买 啊 " # TODO: 147 | sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence)) # shape: (70, 200) 148 | if self.sent_vec_type == "matrix": # cnn or lstm 149 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1]) 150 | elif self.algorithm_name == "nn": # nn. 151 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0]) 152 | elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"): # lstm. 153 | sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0]) 154 | print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}") # 0: 正向 155 | 156 | sentence = "这 真的是 一部 非常 优秀 电影 作品" 157 | sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence)) 158 | if self.sent_vec_type == "matrix": # cnn or lstm 159 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1]) 160 | elif self.algorithm_name == "nn": # nn. 161 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0]) 162 | elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"): # lstm. 163 | sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0]) 164 | print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}") # 0: 正向 165 | 166 | sentence = "这个 电视 真 尼玛 垃圾 , 老子 再也 不买 了" 167 | sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence)) 168 | if self.sent_vec_type == "matrix": # cnn or lstm 169 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1]) 170 | elif self.algorithm_name == "nn": # nn. 171 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0]) 172 | elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"): # lstm. 173 | sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0]) 174 | print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}") # 1: 负向 175 | 176 | sentence_df = pd.read_csv("../data/input/training_set.txt", sep="\t", header=None, names=["label", "sentence"]) 177 | sentence_df = sentence_df.sample(frac=1) 178 | sentence_series = sentence_df["sentence"] 179 | label_series = sentence_df["label"] 180 | print(f"label_series: {label_series.iloc[:11]}") 181 | count = 0 182 | for sentence in sentence_series: 183 | count += 1 184 | sentence = sentence.strip() 185 | sent_vec = np.array(self.preprocess_obj.gen_sentence_vec(sentence)) 186 | if self.sent_vec_type == "matrix": # cnn or lstm 187 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0], sent_vec.shape[1]) 188 | elif self.algorithm_name == "nn": # nn. 189 | sent_vec = sent_vec.reshape(1, sent_vec.shape[0]) 190 | elif self.algorithm_name == "lstm" and (self.sent_vec_type == "avg" or self.sent_vec_type == "fasttext"): # lstm. 191 | sent_vec = sent_vec.reshape(1, 1, sent_vec.shape[0]) 192 | print(f"'{sentence}': {np.argmax(model.predict(sent_vec))}") # 0: 正向, 1: 负向 193 | if count > 10: 194 | break 195 | 196 | 197 | if __name__ == "__main__": 198 | start_time = time.time() 199 | preprocess_obj = Preprocessing() 200 | ''' 201 | 202 | sent_vec_type_list = ["avg", "fasttext", "matrix"] # NN(MLP): 只能使用avg或fasttext. CNN: 只能使用matrix. LSTM: avg, fasttext, matrix均可. 203 | sent_vec_type = sent_vec_type_list[0] 204 | print(f"\n{sent_vec_type} and", end=" ") 205 | preprocess_obj.set_sent_vec_type(sent_vec_type) 206 | 207 | X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data() 208 | y_train = np_utils.to_categorical(y_train) 209 | y_val = np_utils.to_categorical(y_val) 210 | 211 | sent_analyse = SentimentAnalysis(preprocess_obj, sent_vec_type) 212 | algorithm_list = ["nn", "cnn", "lstm"] 213 | algorithm_name = algorithm_list[0] 214 | 215 | if len(X_train.shape) == 2 and algorithm_name == "lstm": # avg or fasttext 216 | X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1]) 217 | X_val = X_val.reshape(X_val.shape[0], 1, X_val.shape[1]) 218 | input_shape = (X_train.shape[1], X_train.shape[2]) 219 | elif algorithm_name == "nn": 220 | input_shape = (X_train.shape[1],) 221 | else: # algorithm_name == "cnn" or (algorithm_name == "lstm" and len(X_train.shape) == 3) 222 | input_shape = (X_train.shape[1], X_train.shape[2]) 223 | print(X_train.shape, y_train.shape) # (19998, 200)/(19998, 70, 200) (19998,) 224 | print(X_val.shape, y_val.shape) # (5998, 200)/(5998, 70, 200) (5998,) 225 | 226 | print(f"{algorithm_name}:") 227 | sent_analyse.pick_algorithm(algorithm_name, sent_vec_type) 228 | # """ 229 | model_cls = sent_analyse.model_build(input_shape=input_shape) 230 | model = sent_analyse.model_train(model_cls, X_train, X_val, y_train, y_val) 231 | # """ 232 | sent_analyse.model_evaluate(model, X_val, y_val) 233 | sent_analyse.model_predict(model) 234 | ''' 235 | sent_analyse = SentimentAnalysis(preprocess_obj, "") 236 | sent_analyse.plot_hist("history_nn_fasttext") 237 | end_time = time.time() 238 | print(f"\nProgram Running Cost {end_time -start_time:.2f}s") 239 | -------------------------------------------------------------------------------- /src/sentiment_analysis_ml.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # coding: utf-8 3 | # File: sentiment_analysis_ml.py 4 | # Author: lxw 5 | # Date: 12/21/17 8:13 AM 6 | 7 | """ 8 | 基于"预训练词向量模型"和"机器学习"的情感分类(sklearn) 9 | """ 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import time 14 | 15 | from keras.preprocessing import sequence 16 | from sklearn.externals import joblib 17 | from sklearn.model_selection import GridSearchCV 18 | 19 | from preprocessing import Preprocessing 20 | 21 | 22 | class SentimentAnalysis: 23 | def __init__(self, sent_vec_type): 24 | self.model_path_prefix="../data/output/models/" 25 | self.algorithm_name = "nb" 26 | self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}.model" 27 | 28 | def pick_algorithm(self, algorithm_name, sent_vec_type): 29 | assert algorithm_name in ["nb", "dt", "knn", "svm"], "algorithm_name must be in ['nb', 'dt', 'knn', 'svm']" 30 | self.algorithm_name = algorithm_name 31 | self.model_path = f"{self.model_path_prefix}{self.algorithm_name}_{sent_vec_type}.model" 32 | 33 | def model_build(self): 34 | model_cls = None 35 | if self.algorithm_name == "nb": # Naive Bayes 36 | from sklearn.naive_bayes import GaussianNB 37 | model_cls = GaussianNB() 38 | elif self.algorithm_name == "dt": # Decision Tree 39 | from sklearn import tree 40 | model_cls = tree.DecisionTreeClassifier() 41 | elif self.algorithm_name == "knn": 42 | from sklearn.neighbors import KNeighborsClassifier 43 | model_cls = KNeighborsClassifier() 44 | tuned_parameters = [{"n_neighbors": range(1, 20)}] 45 | model_cls = GridSearchCV(model_cls, tuned_parameters, cv=5, scoring="precision_weighted") 46 | elif self.algorithm_name == "svm": 47 | from sklearn.svm import SVC 48 | """ 49 | # OK 50 | model_cls = SVC(kernel="linear") 51 | tuned_parameters = [{"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]}, 52 | {"kernel": ["linear"], "C": [1, 10, 100, 1000]}] 53 | model_cls = GridSearchCV(model_cls, tuned_parameters, cv=5, scoring="precision_weighted") 54 | """ 55 | model_cls = SVC(C=1000, gamma=1e-3, kernel="rbf") # avg 56 | # model_cls = SVC(C=1000, kernel="linear") # fasttext 57 | 58 | return model_cls 59 | 60 | def model_train(self, model_cls, X_train, y_train): 61 | """ 62 | 分类器模型的训练 63 | :param model_cls: 所使用的算法的类的定义,尚未训练的模型 64 | :param X_train: 65 | :param y_train: 66 | :return: 训练好的模型 67 | """ 68 | model_cls.fit(X_train, y_train) 69 | # if self.algorithm_name in {"svm", "knn"}: 70 | # print(model_cls.best_params_) 71 | return model_cls # model 72 | 73 | def model_save(self, model): 74 | """ 75 | 分类器模型的保存 76 | :param model: 训练好的模型对象 77 | :return: None 78 | """ 79 | joblib.dump(model, self.model_path) 80 | 81 | def model_evaluate(self, model, X_val, y_val): 82 | """ 83 | 分类器模型的评估 84 | :param model: 训练好的模型对象 85 | :param X_val: 86 | :param y_val: 87 | :return: None 88 | """ 89 | # model = joblib.load(self.model_path) 90 | y_val = list(y_val) 91 | correct = 0 92 | y_predict = model.predict(X_val) 93 | print(f"len(y_predict): {len(y_predict)}, len(y_val): {len(y_val)}") 94 | assert len(y_predict) == len(y_val), "Unexpected Error: len(y_predict) != len(y_val), but it should be" 95 | for idx in range(len(y_predict)): 96 | if int(y_predict[idx]) == int(y_val[idx]): 97 | correct += 1 98 | score = correct / len(y_predict) 99 | print(f"'{self.algorithm_name}' Classification Accuray:{score*100:.2f}%") 100 | 101 | def model_predict(self, model, preprocess_obj): 102 | """ 103 | 模型测试 104 | :param model: 训练好的模型对象 105 | :param preprocess_obj: Preprocessing类对象 106 | :return: None 107 | """ 108 | sentence = "这 真的是 一部 非常 优秀 电影 作品" 109 | sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1) # shape: (1, 1000) 110 | # print(f"sent_vec: {sent_vec.tolist()}") 111 | if preprocess_obj.sentence_vec_type == "concatenate": 112 | # NOTE: 注意,这里的dtype是必须的,否则dtype默认值是'int32', 词向量所有的数值会被全部转换为0 113 | sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size, 114 | value=0, dtype=np.float) 115 | # print(f"sent_vec: {sent_vec.tolist()}") 116 | print(f"'{sentence}': {model.predict(sent_vec)}") # 0: 正向 117 | 118 | sentence = "这个 电视 真 尼玛 垃圾 , 老子 再也 不买 了" 119 | sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1) 120 | # print(f"sent_vec: {sent_vec.tolist()}") 121 | if preprocess_obj.sentence_vec_type == "concatenate": 122 | sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size, 123 | value=0, dtype=np.float) 124 | # print(f"sent_vec: {sent_vec.tolist()}") 125 | print(f"'{sentence}': {model.predict(sent_vec)}") # 1: 负向 126 | 127 | sentence_df = pd.read_csv("../data/input/validation_set.txt", sep="\t", header=None, names=["label", "sentence"]) 128 | sentence_df = sentence_df.sample(frac=1) 129 | sentence_series = sentence_df["sentence"] 130 | label_series = sentence_df["label"] 131 | print(f"label_series: {label_series.iloc[:11]}") 132 | count = 0 133 | for sentence in sentence_series: 134 | count += 1 135 | sentence = sentence.strip() 136 | sent_vec = np.array(preprocess_obj.gen_sentence_vec(sentence)).reshape(1, -1) 137 | # print(f"sent_vec: {sent_vec.tolist()}") 138 | if preprocess_obj.sentence_vec_type == "concatenate": 139 | sent_vec = sequence.pad_sequences(sent_vec, maxlen=preprocess_obj.MAX_SENT_LEN * preprocess_obj.vector_size, 140 | value=0, dtype=np.float) 141 | # print(f"sent_vec: {sent_vec.tolist()}") 142 | print(f"'{sentence}': {model.predict(sent_vec)}") # 0: 正向, 1: 负向 143 | if count > 10: 144 | break 145 | 146 | 147 | if __name__ == "__main__": 148 | start_time = time.time() 149 | preprocess_obj = Preprocessing() 150 | 151 | sent_vec_type_list = ["avg", "fasttext", "concatenate"] 152 | sent_vec_type = sent_vec_type_list[0] 153 | print(f"\n{sent_vec_type} and", end=" ") 154 | preprocess_obj.set_sent_vec_type(sent_vec_type) 155 | 156 | X_train, X_val, y_train, y_val = preprocess_obj.gen_train_val_data() 157 | # print(X_train.shape, y_train.shape) # (19998, 100) (19998,) 158 | # print(X_val.shape, y_val.shape) # (5998, 100) (5998,) 159 | 160 | sent_analyse = SentimentAnalysis(sent_vec_type) 161 | algorithm_list = ["nb", "dt", "knn", "svm"] 162 | algorithm_name = algorithm_list[2] 163 | print(f"{algorithm_name}:") 164 | sent_analyse.pick_algorithm(algorithm_name, sent_vec_type) 165 | model_cls = sent_analyse.model_build() 166 | model = sent_analyse.model_train(model_cls, X_train, y_train) 167 | sent_analyse.model_save(model) 168 | """ 169 | """ 170 | model = joblib.load(sent_analyse.model_path) 171 | sent_analyse.model_evaluate(model, X_val, y_val) 172 | sent_analyse.model_predict(model, preprocess_obj) 173 | end_time = time.time() 174 | print(f"\nProgram Running Cost {end_time -start_time:.2f}s") 175 | 176 | --------------------------------------------------------------------------------