├── .gitattributes ├── .gitignore ├── README.md ├── TextClassification ├── DataPreprocess.py ├── TextClassification.py ├── __init__.py ├── data │ ├── data_multiple.json │ └── data_single.csv ├── load_data.py └── net.py ├── demo_multiple.py ├── demo_single.py └── picture ├── data_multiple.png └── data_single.png /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .idea/machine-learning.iml 3 | .idea/misc.xml 4 | .idea/vcs.xml 5 | .idea/modules.xml 6 | .idea/workspace.xml 7 | .idea/NLP.iml 8 | .idea/misc.xml 9 | .idea/modules.xml 10 | .idea/workspace.xml 11 | 12 | try.py 13 | __pycache__/sentence_transform.cpython-35.pyc 14 | __pycache__/__init__.cpython-35.pyc 15 | __pycache__/__init__.cpython-35.pyc 16 | *.pyc 17 | data/demo_score/智能电视_电商_正负面.zip 18 | data/demo_score/电商_智能电视_正面帖_2500条.xlsx 19 | data/demo_score/电商_智能电视_负面帖_2500条.xlsx 20 | data/demo_score/data.xlsx 21 | data/demo_score/~$predict.xlsx 22 | demo/demo_score - 副本.py 23 | data/demo_score/predict.xlsx 24 | data/demo_score/predict.xlsx 25 | testdata/6000document/20180125/predict_v1.xlsx 26 | testdata/6000document/20180125/test_v1.xlsx 27 | testdata/6000document/20180126/predict_v1.xlsx 28 | testdata/6000document/20180126/test_v1.xlsx 29 | testdata/6000document/20180126/update_v1.xlsx 30 | test/6000条原帖.20180125.xlsx 31 | testdata/6000document/20180126/~$update_v1.xlsx 32 | 33 | 34 | .idea/Text-Classification.iml 35 | data/demo_score/~$data.xlsx 36 | creat_data/config.py 37 | creat_data/try - 副本.py 38 | demo_score - 副本.py 39 | *.xml 40 | *.xml 41 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Text-Classification 2 | [![](https://img.shields.io/badge/Python-3.6-blue.svg)](https://www.python.org/) 3 | [![](https://img.shields.io/badge/pandas-0.21.0-brightgreen.svg)](https://pypi.python.org/pypi/pandas/0.21.0) 4 | [![](https://img.shields.io/badge/numpy-1.13.1-brightgreen.svg)](https://pypi.python.org/pypi/numpy/1.13.1) 5 | [![](https://img.shields.io/badge/jieba-0.39-brightgreen.svg)](https://pypi.python.org/pypi/jieba/0.39) 6 | [![](https://img.shields.io/badge/Keras-2.2.4-brightgreen.svg)](https://pypi.python.org/pypi/Keras/2.2.4) 7 | 8 | 9 | ## 项目介绍 10 | 通过对已有标签的文本进行训练,实现新文本的分类。
11 | 12 | ## 更新说明 13 | 2019.3.25:项目最初是公司的一个舆情分析业务,后来参加了一些比赛又增加了一些小功能。当时只是想着把机器学习、深度学习的一些简单的模型整合在一起,锻炼一下工程能力。和一些网友交流后,觉得没必要搞一个通用型的模块(反正也没人用哈哈~)。最近刚好比较清闲,就本着越简单越好的目的把没啥用的花里胡哨的参数和函数都删了,只保留了预处理和卷积网络。 14 | 15 | ## 导入数据集:load_data 16 | **准备了单一标签的电商数据4000多条和多标签的司法罪名数据15000多条,数据仅供学术研究使用,禁止商业传播。**
17 | * 单一标签的电商数据4000条为.csv格式,来源于真实电商评论,由'evaluation'和'label'两个字段组成,分别表示用户评论和正负面标签,建议pandas读取,读入后为dataframe。
18 | * 多标签的司法罪名数据15000条为.json格式,来源于2018‘法研杯’法律智能挑战赛(CAIL2018),由'fact'和'accusation'两个字段组成,分别表示事实陈述和罪名,读入后为列表。
19 | ``` python 20 | from TextClassification.load_data import load_data 21 | 22 | # 单标签 23 | data = load_data('single') 24 | x = data['evaluation'] 25 | y = [[i] for i in data['label']] 26 | 27 | # 多标签 28 | data = load_data('multiple') 29 | x = [i['fact'] for i in data] 30 | y = [i['accusation'] for i in data] 31 | ``` 32 | ![](https://github.com/renjunxiang/Text-Classification/blob/master/picture/data_single.png) 33 | ![](https://github.com/renjunxiang/Text-Classification/blob/master/picture/data_multiple.png) 34 | 35 | ## 文本预处理:DataPreprocess.py 36 | **用于对原始文本数据做预处理,包含分词、转编码、长度统一等方法,已封装进TextClassification.py**
37 | 38 | ``` python 39 | preprocess = DataPreprocess() 40 | 41 | # 处理文本 42 | texts_cut = preprocess.cut_texts(texts, word_len) 43 | preprocess.train_tokenizer(texts_cut, num_words) 44 | texts_seq = preprocess.text2seq(texts_cut, sentence_len) 45 | 46 | # 得到标签 47 | preprocess.creat_label_set(labels) 48 | labels = preprocess.creat_labels(labels) 49 | ``` 50 | 51 | ## 模型训练及预测:TextClassification.py 52 | **整合预处理、网络的训练、网络的预测,demo请参考两个demo脚本**
53 | 54 | **方法如下:**
55 | * fit:输入原始文本和标签,可以在已有的模型基础上继续训练,不输入模型则重新开始训练;
56 | * predict:输入原始文本;
57 | 58 | ``` python 59 | from TextClassification import TextClassification 60 | 61 | clf = TextClassification() 62 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, 63 | word_len=1, 64 | num_words=2000, 65 | sentence_len=50) 66 | clf.fit(texts_seq=texts_seq, 67 | texts_labels=texts_labels, 68 | output_type=data_type, 69 | epochs=10, 70 | batch_size=64, 71 | model=None) 72 | 73 | # 保存整个模块,包括预处理和神经网络 74 | with open('./%s.pkl' % data_type, 'wb') as f: 75 | pickle.dump(clf, f) 76 | 77 | # 导入刚才保存的模型 78 | with open('./%s.pkl' % data_type, 'rb') as f: 79 | clf = pickle.load(f) 80 | y_predict = clf.predict(x_test) 81 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict] 82 | score = sum(y_predict == np.array(y_test)) / len(y_test) 83 | print(score) # 0.9288 84 | ``` 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /TextClassification/DataPreprocess.py: -------------------------------------------------------------------------------- 1 | import jieba 2 | from keras.preprocessing.text import Tokenizer 3 | from keras.preprocessing.sequence import pad_sequences 4 | import numpy as np 5 | 6 | jieba.setLogLevel('WARN') 7 | 8 | 9 | class DataPreprocess(): 10 | def __init__(self, tokenizer=None, 11 | label_set=None): 12 | self.tokenizer = tokenizer 13 | self.num_words = None 14 | self.label_set = label_set 15 | self.sentence_len = None 16 | self.word_len = None 17 | 18 | def cut_texts(self, texts=None, word_len=1): 19 | """ 20 | 对文本分词 21 | :param texts: 文本列表 22 | :param word_len: 保留最短长度的词语 23 | :return: 24 | """ 25 | if word_len > 1: 26 | texts_cut = [[word for word in jieba.lcut(text) if len(word) >= word_len] for text in texts] 27 | else: 28 | texts_cut = [jieba.lcut(one_text) for one_text in texts] 29 | 30 | self.word_len = word_len 31 | 32 | return texts_cut 33 | 34 | def train_tokenizer(self, 35 | texts_cut=None, 36 | num_words=2000): 37 | """ 38 | 生成编码字典 39 | :param texts_cut: 分词的列表 40 | :param num_words: 字典按词频从高到低保留数量 41 | :return: 42 | """ 43 | tokenizer = Tokenizer(num_words=num_words) 44 | tokenizer.fit_on_texts(texts=texts_cut) 45 | num_words = min(num_words, len(tokenizer.word_index) + 1) 46 | self.tokenizer = tokenizer 47 | self.num_words = num_words 48 | 49 | def text2seq(self, 50 | texts_cut, 51 | sentence_len=30): 52 | """ 53 | 文本转序列,用于神经网络的ebedding层输入。 54 | :param texts_cut: 分词后的文本列表 55 | :param sentence_len: 文本转序列保留长度 56 | :return:sequence list 57 | """ 58 | tokenizer = self.tokenizer 59 | texts_seq = tokenizer.texts_to_sequences(texts=texts_cut) 60 | del texts_cut 61 | 62 | texts_pad_seq = pad_sequences(texts_seq, 63 | maxlen=sentence_len, 64 | padding='post', 65 | truncating='post') 66 | self.sentence_len = sentence_len 67 | return texts_pad_seq 68 | 69 | def creat_label_set(self, labels): 70 | ''' 71 | 获取标签集合,用于one-hot 72 | :param labels: 原始标签集 73 | :return: 74 | ''' 75 | label_set = set() 76 | for i in labels: 77 | label_set = label_set.union(set(i)) 78 | 79 | self.label_set = np.array(list(label_set)) 80 | 81 | def creat_label(self, label): 82 | ''' 83 | 构建标签one-hot 84 | :param label: 原始标签 85 | :return: 标签one-hot形式的array 86 | eg. creat_label(label=data_valid_accusations[12], label_set=accusations_set) 87 | ''' 88 | label_set = self.label_set 89 | label_zero = np.zeros(len(label_set)) 90 | label_zero[np.in1d(label_set, label)] = 1 91 | return label_zero 92 | 93 | def creat_labels(self, labels=None): 94 | ''' 95 | 调用creat_label遍历标签列表生成one-hot二维数组 96 | :param label: 原始标签集 97 | :return: 98 | ''' 99 | label_set = self.label_set 100 | labels_one_hot = [self.creat_label(label) for label in labels] 101 | 102 | return np.array(labels_one_hot) 103 | -------------------------------------------------------------------------------- /TextClassification/TextClassification.py: -------------------------------------------------------------------------------- 1 | from .DataPreprocess import DataPreprocess 2 | from .net import CNN 3 | import numpy as np 4 | 5 | 6 | class TextClassification(): 7 | def __init__(self): 8 | self.preprocess = None 9 | self.model = None 10 | 11 | def get_preprocess(self, texts, labels, word_len=1, num_words=2000, sentence_len=30): 12 | # 数据预处理 13 | preprocess = DataPreprocess() 14 | 15 | # 处理文本 16 | texts_cut = preprocess.cut_texts(texts, word_len) 17 | preprocess.train_tokenizer(texts_cut, num_words) 18 | texts_seq = preprocess.text2seq(texts_cut, sentence_len) 19 | 20 | # 得到标签 21 | preprocess.creat_label_set(labels) 22 | labels = preprocess.creat_labels(labels) 23 | self.preprocess = preprocess 24 | 25 | return texts_seq, labels 26 | 27 | def fit(self, texts_seq, texts_labels, output_type, epochs, batch_size, model=None): 28 | if model is None: 29 | preprocess = self.preprocess 30 | model = CNN(preprocess.num_words, 31 | preprocess.sentence_len, 32 | 128, 33 | len(preprocess.label_set), 34 | output_type) 35 | # 训练神经网络 36 | model.fit(texts_seq, 37 | texts_labels, 38 | epochs=epochs, 39 | batch_size=batch_size) 40 | self.model = model 41 | 42 | def predict(self, texts): 43 | preprocess = self.preprocess 44 | word_len = preprocess.word_len 45 | sentence_len = preprocess.sentence_len 46 | 47 | # 处理文本 48 | texts_cut = preprocess.cut_texts(texts, word_len) 49 | texts_seq = preprocess.text2seq(texts_cut, sentence_len) 50 | 51 | return self.model.predict(texts_seq) 52 | 53 | def label2toptag(self, predictions, labelset): 54 | labels = [] 55 | for prediction in predictions: 56 | label = labelset[prediction == prediction.max()] 57 | labels.append(label.tolist()) 58 | return labels 59 | 60 | def label2half(self, predictions, labelset): 61 | labels = [] 62 | for prediction in predictions: 63 | label = labelset[prediction > 0.5] 64 | labels.append(label.tolist()) 65 | return labels 66 | 67 | def label2tag(self, predictions, labelset): 68 | labels1 = self.label2toptag(predictions, labelset) 69 | labels2 = self.label2half(predictions, labelset) 70 | labels = [] 71 | for i in range(len(predictions)): 72 | if len(labels2[i]) == 0: 73 | labels.append(labels1[i]) 74 | else: 75 | labels.append(labels2[i]) 76 | return labels 77 | -------------------------------------------------------------------------------- /TextClassification/__init__.py: -------------------------------------------------------------------------------- 1 | from .DataPreprocess import DataPreprocess 2 | from .net import CNN 3 | from .TextClassification import TextClassification 4 | from .load_data import load_data 5 | -------------------------------------------------------------------------------- /TextClassification/load_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pandas as pd 3 | import os 4 | 5 | localpath = os.path.dirname(__file__) 6 | 7 | 8 | def load_data(data_type='single'): 9 | if data_type == 'single': 10 | data = pd.read_csv(localpath + "/data/data_single.csv", encoding='utf8') 11 | elif data_type == 'multiple': 12 | with open(localpath + '/data/data_multiple.json', mode='r', encoding='utf8') as f: 13 | data_raw = f.readlines() 14 | data = [json.loads(i) for i in data_raw] 15 | return data 16 | -------------------------------------------------------------------------------- /TextClassification/net.py: -------------------------------------------------------------------------------- 1 | from keras.models import Model 2 | from keras.layers import Dense, Embedding, Input 3 | from keras.layers import Conv1D, GlobalMaxPool1D, Dropout 4 | 5 | 6 | def CNN(input_dim, 7 | input_length, 8 | vec_size, 9 | output_shape, 10 | output_type='multiple'): 11 | ''' 12 | Creat CNN net,use Embedding+CNN1D+GlobalMaxPool1D+Dense. 13 | You can change filters and dropout rate in code.. 14 | 15 | :param input_dim: Size of the vocabulary 16 | :param input_length:Length of input sequences 17 | :param vec_size:Dimension of the dense embedding 18 | :param output_shape:Target shape,target should be one-hot term 19 | :param output_type:last layer type,multiple(activation="sigmoid") or single(activation="softmax") 20 | :return:keras model 21 | ''' 22 | data_input = Input(shape=[input_length]) 23 | word_vec = Embedding(input_dim=input_dim + 1, 24 | input_length=input_length, 25 | output_dim=vec_size)(data_input) 26 | x = Conv1D(filters=128, 27 | kernel_size=[3], 28 | strides=1, 29 | padding='same', 30 | activation='relu')(word_vec) 31 | x = GlobalMaxPool1D()(x) 32 | x = Dense(500, activation='relu')(x) 33 | x = Dropout(0.1)(x) 34 | if output_type == 'multiple': 35 | x = Dense(output_shape, activation='sigmoid')(x) 36 | model = Model(inputs=data_input, outputs=x) 37 | model.compile(loss='binary_crossentropy', 38 | optimizer='adam', 39 | metrics=['acc']) 40 | elif output_type == 'single': 41 | x = Dense(output_shape, activation='softmax')(x) 42 | model = Model(inputs=data_input, outputs=x) 43 | model.compile(loss='categorical_crossentropy', 44 | optimizer='adam', 45 | metrics=['acc']) 46 | else: 47 | raise ValueError('output_type should be multiple or single') 48 | return model 49 | 50 | 51 | if __name__ == '__main__': 52 | model = CNN(input_dim=10, input_length=10, vec_size=10, output_shape=10, output_type='multiple') 53 | model.summary() 54 | -------------------------------------------------------------------------------- /demo_multiple.py: -------------------------------------------------------------------------------- 1 | from TextClassification import load_data 2 | from sklearn.model_selection import train_test_split 3 | import tensorflow as tf 4 | import pickle 5 | import numpy as np 6 | 7 | sess = tf.InteractiveSession() 8 | 9 | # 导入数据 10 | data_type = 'multiple' 11 | data = load_data(data_type) 12 | x = [i['fact'] for i in data] 13 | y = [i['accusation'] for i in data] 14 | 15 | # 拆分训练集和测试集 16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) 17 | 18 | ##### 以下是训练过程 ##### 19 | 20 | from TextClassification import TextClassification 21 | 22 | clf = TextClassification() 23 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, word_len=1, num_words=2000, sentence_len=50) 24 | clf.fit(texts_seq, texts_labels, data_type, 3, 64) 25 | 26 | # 保存整个模块,包括预处理和神经网络 27 | with open('./%s.pkl' % data_type, 'wb') as f: 28 | pickle.dump(clf, f) 29 | 30 | ##### 以下是预测过程 ##### 31 | 32 | # 导入刚才保存的模型 33 | with open('./%s.pkl' % data_type, 'rb') as f: 34 | clf = pickle.load(f) 35 | y_predict = clf.predict(x_test) 36 | y_predict = clf.label2tag(y_predict, clf.preprocess.label_set) 37 | score = sum([y_predict[i] == y_test[i] for i in range(len(y_predict))]) / len(y_predict) 38 | print(score) # 0.2136 39 | -------------------------------------------------------------------------------- /demo_single.py: -------------------------------------------------------------------------------- 1 | from TextClassification import load_data 2 | from sklearn.model_selection import train_test_split 3 | import tensorflow as tf 4 | import pickle 5 | import numpy as np 6 | 7 | sess = tf.InteractiveSession() 8 | 9 | # 导入数据 10 | data_type = 'single' 11 | data = load_data(data_type) 12 | x = data['evaluation'] 13 | y = [[i] for i in data['label']] 14 | 15 | # 拆分训练集和测试集 16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1) 17 | 18 | ##### 以下是训练过程 ##### 19 | from TextClassification import TextClassification 20 | 21 | clf = TextClassification() 22 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, 23 | word_len=1, 24 | num_words=2000, 25 | sentence_len=50) 26 | clf.fit(texts_seq=texts_seq, 27 | texts_labels=texts_labels, 28 | output_type=data_type, 29 | epochs=10, 30 | batch_size=64, 31 | model=None) 32 | 33 | # 保存整个模块,包括预处理和神经网络 34 | with open('./%s.pkl' % data_type, 'wb') as f: 35 | pickle.dump(clf, f) 36 | 37 | ##### 以下是预测过程 ##### 38 | 39 | # 导入刚才保存的模型 40 | with open('./%s.pkl' % data_type, 'rb') as f: 41 | clf = pickle.load(f) 42 | y_predict = clf.predict(x_test) 43 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict] 44 | score = sum(y_predict == np.array(y_test)) / len(y_test) 45 | print(score) # 0.9288 46 | -------------------------------------------------------------------------------- /picture/data_multiple.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_multiple.png -------------------------------------------------------------------------------- /picture/data_single.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_single.png --------------------------------------------------------------------------------