├── .gitattributes
├── .gitignore
├── README.md
├── TextClassification
    ├── DataPreprocess.py
    ├── TextClassification.py
    ├── __init__.py
    ├── data
    │   ├── data_multiple.json
    │   └── data_single.csv
    ├── load_data.py
    └── net.py
├── demo_multiple.py
├── demo_single.py
└── picture
    ├── data_multiple.png
    └── data_single.png


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | .idea/machine-learning.iml
 3 | .idea/misc.xml
 4 | .idea/vcs.xml
 5 | .idea/modules.xml
 6 | .idea/workspace.xml
 7 | .idea/NLP.iml
 8 | .idea/misc.xml
 9 | .idea/modules.xml
10 | .idea/workspace.xml
11 | 
12 | try.py
13 | __pycache__/sentence_transform.cpython-35.pyc
14 | __pycache__/__init__.cpython-35.pyc
15 | __pycache__/__init__.cpython-35.pyc
16 | *.pyc
17 | data/demo_score/智能电视_电商_正负面.zip
18 | data/demo_score/电商_智能电视_正面帖_2500条.xlsx
19 | data/demo_score/电商_智能电视_负面帖_2500条.xlsx
20 | data/demo_score/data.xlsx
21 | data/demo_score/~$predict.xlsx
22 | demo/demo_score - 副本.py
23 | data/demo_score/predict.xlsx
24 | data/demo_score/predict.xlsx
25 | testdata/6000document/20180125/predict_v1.xlsx
26 | testdata/6000document/20180125/test_v1.xlsx
27 | testdata/6000document/20180126/predict_v1.xlsx
28 | testdata/6000document/20180126/test_v1.xlsx
29 | testdata/6000document/20180126/update_v1.xlsx
30 | test/6000条原帖.20180125.xlsx
31 | testdata/6000document/20180126/~$update_v1.xlsx
32 | 
33 | 
34 | .idea/Text-Classification.iml
35 | data/demo_score/~$data.xlsx
36 | creat_data/config.py
37 | creat_data/try - 副本.py
38 | demo_score - 副本.py
39 | *.xml
40 | *.xml
41 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ﻿# Text-Classification
 2 | [![](https://img.shields.io/badge/Python-3.6-blue.svg)](https://www.python.org/)
 3 | [![](https://img.shields.io/badge/pandas-0.21.0-brightgreen.svg)](https://pypi.python.org/pypi/pandas/0.21.0)
 4 | [![](https://img.shields.io/badge/numpy-1.13.1-brightgreen.svg)](https://pypi.python.org/pypi/numpy/1.13.1)
 5 | [![](https://img.shields.io/badge/jieba-0.39-brightgreen.svg)](https://pypi.python.org/pypi/jieba/0.39)
 6 | [![](https://img.shields.io/badge/Keras-2.2.4-brightgreen.svg)](https://pypi.python.org/pypi/Keras/2.2.4)
 7 | 
 8 | 
 9 | ## 项目介绍
10 | 通过对已有标签的文本进行训练，实现新文本的分类。<br>
11 | 
12 | ## 更新说明
13 | 2019.3.25：项目最初是公司的一个舆情分析业务，后来参加了一些比赛又增加了一些小功能。当时只是想着把机器学习、深度学习的一些简单的模型整合在一起，锻炼一下工程能力。和一些网友交流后，觉得没必要搞一个通用型的模块（反正也没人用哈哈~）。最近刚好比较清闲，就本着越简单越好的目的把没啥用的花里胡哨的参数和函数都删了，只保留了预处理和卷积网络。
14 | 
15 | ## 导入数据集:load_data
16 | **准备了单一标签的电商数据4000多条和多标签的司法罪名数据15000多条，数据仅供学术研究使用，禁止商业传播。**<br>
17 | * 单一标签的电商数据4000条为.csv格式，来源于真实电商评论，由'evaluation'和'label'两个字段组成，分别表示用户评论和正负面标签，建议pandas读取，读入后为dataframe。<br>
18 | * 多标签的司法罪名数据15000条为.json格式，来源于2018‘法研杯’法律智能挑战赛（CAIL2018），由'fact'和'accusation'两个字段组成，分别表示事实陈述和罪名，读入后为列表。<br>
19 | ``` python
20 | from TextClassification.load_data import load_data
21 | 
22 | # 单标签
23 | data = load_data('single')
24 | x = data['evaluation']
25 | y = [[i] for i in data['label']]
26 | 
27 | # 多标签
28 | data = load_data('multiple')
29 | x = [i['fact'] for i in data]
30 | y = [i['accusation'] for i in data]
31 | ```
32 | ![](https://github.com/renjunxiang/Text-Classification/blob/master/picture/data_single.png)
33 | ![](https://github.com/renjunxiang/Text-Classification/blob/master/picture/data_multiple.png)
34 | 
35 | ## 文本预处理：DataPreprocess.py
36 | **用于对原始文本数据做预处理，包含分词、转编码、长度统一等方法，已封装进TextClassification.py**<br>
37 | 
38 | ``` python
39 | preprocess = DataPreprocess()
40 | 
41 | # 处理文本
42 | texts_cut = preprocess.cut_texts(texts, word_len)
43 | preprocess.train_tokenizer(texts_cut, num_words)
44 | texts_seq = preprocess.text2seq(texts_cut, sentence_len)
45 | 
46 | # 得到标签
47 | preprocess.creat_label_set(labels)
48 | labels = preprocess.creat_labels(labels)
49 | ```
50 | 
51 | ## 模型训练及预测：TextClassification.py
52 | **整合预处理、网络的训练、网络的预测，demo请参考两个demo脚本**<br>
53 | 
54 | **方法如下：**<br>
55 | * fit：输入原始文本和标签，可以在已有的模型基础上继续训练，不输入模型则重新开始训练；<br>
56 | * predict：输入原始文本；<br>
57 | 
58 | ``` python
59 | from TextClassification import TextClassification
60 | 
61 | clf = TextClassification()
62 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, 
63 |                                              word_len=1, 
64 |                                              num_words=2000, 
65 |                                              sentence_len=50)
66 | clf.fit(texts_seq=texts_seq,
67 |         texts_labels=texts_labels,
68 |         output_type=data_type,
69 |         epochs=10,
70 |         batch_size=64,
71 |         model=None)
72 | 
73 | # 保存整个模块,包括预处理和神经网络
74 | with open('./%s.pkl' % data_type, 'wb') as f:
75 |     pickle.dump(clf, f)
76 | 
77 | # 导入刚才保存的模型
78 | with open('./%s.pkl' % data_type, 'rb') as f:
79 |     clf = pickle.load(f)
80 | y_predict = clf.predict(x_test)
81 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict]
82 | score = sum(y_predict == np.array(y_test)) / len(y_test)
83 | print(score)  # 0.9288
84 | ```
85 | 
86 | 
87 | 
88 | 


--------------------------------------------------------------------------------
/TextClassification/DataPreprocess.py:
--------------------------------------------------------------------------------
  1 | import jieba
  2 | from keras.preprocessing.text import Tokenizer
  3 | from keras.preprocessing.sequence import pad_sequences
  4 | import numpy as np
  5 | 
  6 | jieba.setLogLevel('WARN')
  7 | 
  8 | 
  9 | class DataPreprocess():
 10 |     def __init__(self, tokenizer=None,
 11 |                  label_set=None):
 12 |         self.tokenizer = tokenizer
 13 |         self.num_words = None
 14 |         self.label_set = label_set
 15 |         self.sentence_len = None
 16 |         self.word_len = None
 17 | 
 18 |     def cut_texts(self, texts=None, word_len=1):
 19 |         """
 20 |         对文本分词
 21 |         :param texts: 文本列表
 22 |         :param word_len: 保留最短长度的词语
 23 |         :return:
 24 |         """
 25 |         if word_len > 1:
 26 |             texts_cut = [[word for word in jieba.lcut(text) if len(word) >= word_len] for text in texts]
 27 |         else:
 28 |             texts_cut = [jieba.lcut(one_text) for one_text in texts]
 29 | 
 30 |         self.word_len = word_len
 31 | 
 32 |         return texts_cut
 33 | 
 34 |     def train_tokenizer(self,
 35 |                         texts_cut=None,
 36 |                         num_words=2000):
 37 |         """
 38 |         生成编码字典
 39 |         :param texts_cut: 分词的列表
 40 |         :param num_words: 字典按词频从高到低保留数量
 41 |         :return:
 42 |         """
 43 |         tokenizer = Tokenizer(num_words=num_words)
 44 |         tokenizer.fit_on_texts(texts=texts_cut)
 45 |         num_words = min(num_words, len(tokenizer.word_index) + 1)
 46 |         self.tokenizer = tokenizer
 47 |         self.num_words = num_words
 48 | 
 49 |     def text2seq(self,
 50 |                  texts_cut,
 51 |                  sentence_len=30):
 52 |         """
 53 |         文本转序列，用于神经网络的ebedding层输入。
 54 |         :param texts_cut: 分词后的文本列表
 55 |         :param sentence_len: 文本转序列保留长度
 56 |         :return:sequence list
 57 |         """
 58 |         tokenizer = self.tokenizer
 59 |         texts_seq = tokenizer.texts_to_sequences(texts=texts_cut)
 60 |         del texts_cut
 61 | 
 62 |         texts_pad_seq = pad_sequences(texts_seq,
 63 |                                       maxlen=sentence_len,
 64 |                                       padding='post',
 65 |                                       truncating='post')
 66 |         self.sentence_len = sentence_len
 67 |         return texts_pad_seq
 68 | 
 69 |     def creat_label_set(self, labels):
 70 |         '''
 71 |         获取标签集合，用于one-hot
 72 |         :param labels: 原始标签集
 73 |         :return:
 74 |         '''
 75 |         label_set = set()
 76 |         for i in labels:
 77 |             label_set = label_set.union(set(i))
 78 | 
 79 |         self.label_set = np.array(list(label_set))
 80 | 
 81 |     def creat_label(self, label):
 82 |         '''
 83 |         构建标签one-hot
 84 |         :param label: 原始标签
 85 |         :return: 标签one-hot形式的array
 86 |         eg. creat_label(label=data_valid_accusations[12], label_set=accusations_set)
 87 |         '''
 88 |         label_set = self.label_set
 89 |         label_zero = np.zeros(len(label_set))
 90 |         label_zero[np.in1d(label_set, label)] = 1
 91 |         return label_zero
 92 | 
 93 |     def creat_labels(self, labels=None):
 94 |         '''
 95 |         调用creat_label遍历标签列表生成one-hot二维数组
 96 |         :param label: 原始标签集
 97 |         :return:
 98 |         '''
 99 |         label_set = self.label_set
100 |         labels_one_hot = [self.creat_label(label) for label in labels]
101 | 
102 |         return np.array(labels_one_hot)
103 | 


--------------------------------------------------------------------------------
/TextClassification/TextClassification.py:
--------------------------------------------------------------------------------
 1 | from .DataPreprocess import DataPreprocess
 2 | from .net import CNN
 3 | import numpy as np
 4 | 
 5 | 
 6 | class TextClassification():
 7 |     def __init__(self):
 8 |         self.preprocess = None
 9 |         self.model = None
10 | 
11 |     def get_preprocess(self, texts, labels, word_len=1, num_words=2000, sentence_len=30):
12 |         # 数据预处理
13 |         preprocess = DataPreprocess()
14 | 
15 |         # 处理文本
16 |         texts_cut = preprocess.cut_texts(texts, word_len)
17 |         preprocess.train_tokenizer(texts_cut, num_words)
18 |         texts_seq = preprocess.text2seq(texts_cut, sentence_len)
19 | 
20 |         # 得到标签
21 |         preprocess.creat_label_set(labels)
22 |         labels = preprocess.creat_labels(labels)
23 |         self.preprocess = preprocess
24 | 
25 |         return texts_seq, labels
26 | 
27 |     def fit(self, texts_seq, texts_labels, output_type, epochs, batch_size, model=None):
28 |         if model is None:
29 |             preprocess = self.preprocess
30 |             model = CNN(preprocess.num_words,
31 |                         preprocess.sentence_len,
32 |                         128,
33 |                         len(preprocess.label_set),
34 |                         output_type)
35 |         # 训练神经网络
36 |         model.fit(texts_seq,
37 |                   texts_labels,
38 |                   epochs=epochs,
39 |                   batch_size=batch_size)
40 |         self.model = model
41 | 
42 |     def predict(self, texts):
43 |         preprocess = self.preprocess
44 |         word_len = preprocess.word_len
45 |         sentence_len = preprocess.sentence_len
46 | 
47 |         # 处理文本
48 |         texts_cut = preprocess.cut_texts(texts, word_len)
49 |         texts_seq = preprocess.text2seq(texts_cut, sentence_len)
50 | 
51 |         return self.model.predict(texts_seq)
52 | 
53 |     def label2toptag(self, predictions, labelset):
54 |         labels = []
55 |         for prediction in predictions:
56 |             label = labelset[prediction == prediction.max()]
57 |             labels.append(label.tolist())
58 |         return labels
59 | 
60 |     def label2half(self, predictions, labelset):
61 |         labels = []
62 |         for prediction in predictions:
63 |             label = labelset[prediction > 0.5]
64 |             labels.append(label.tolist())
65 |         return labels
66 | 
67 |     def label2tag(self, predictions, labelset):
68 |         labels1 = self.label2toptag(predictions, labelset)
69 |         labels2 = self.label2half(predictions, labelset)
70 |         labels = []
71 |         for i in range(len(predictions)):
72 |             if len(labels2[i]) == 0:
73 |                 labels.append(labels1[i])
74 |             else:
75 |                 labels.append(labels2[i])
76 |         return labels
77 | 


--------------------------------------------------------------------------------
/TextClassification/__init__.py:
--------------------------------------------------------------------------------
1 | from .DataPreprocess import DataPreprocess
2 | from .net import CNN
3 | from .TextClassification import TextClassification
4 | from .load_data import load_data
5 | 


--------------------------------------------------------------------------------
/TextClassification/load_data.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import pandas as pd
 3 | import os
 4 | 
 5 | localpath = os.path.dirname(__file__)
 6 | 
 7 | 
 8 | def load_data(data_type='single'):
 9 |     if data_type == 'single':
10 |         data = pd.read_csv(localpath + "/data/data_single.csv", encoding='utf8')
11 |     elif data_type == 'multiple':
12 |         with open(localpath + '/data/data_multiple.json', mode='r', encoding='utf8') as f:
13 |             data_raw = f.readlines()
14 |         data = [json.loads(i) for i in data_raw]
15 |     return data
16 | 


--------------------------------------------------------------------------------
/TextClassification/net.py:
--------------------------------------------------------------------------------
 1 | from keras.models import Model
 2 | from keras.layers import Dense, Embedding, Input
 3 | from keras.layers import Conv1D, GlobalMaxPool1D, Dropout
 4 | 
 5 | 
 6 | def CNN(input_dim,
 7 |         input_length,
 8 |         vec_size,
 9 |         output_shape,
10 |         output_type='multiple'):
11 |     '''
12 |     Creat CNN net,use Embedding+CNN1D+GlobalMaxPool1D+Dense.
13 |     You can change filters and dropout rate in code..
14 | 
15 |     :param input_dim: Size of the vocabulary
16 |     :param input_length:Length of input sequences
17 |     :param vec_size:Dimension of the dense embedding
18 |     :param output_shape:Target shape,target should be one-hot term
19 |     :param output_type:last layer type,multiple(activation="sigmoid") or single(activation="softmax")
20 |     :return:keras model
21 |     '''
22 |     data_input = Input(shape=[input_length])
23 |     word_vec = Embedding(input_dim=input_dim + 1,
24 |                          input_length=input_length,
25 |                          output_dim=vec_size)(data_input)
26 |     x = Conv1D(filters=128,
27 |                kernel_size=[3],
28 |                strides=1,
29 |                padding='same',
30 |                activation='relu')(word_vec)
31 |     x = GlobalMaxPool1D()(x)
32 |     x = Dense(500, activation='relu')(x)
33 |     x = Dropout(0.1)(x)
34 |     if output_type == 'multiple':
35 |         x = Dense(output_shape, activation='sigmoid')(x)
36 |         model = Model(inputs=data_input, outputs=x)
37 |         model.compile(loss='binary_crossentropy',
38 |                       optimizer='adam',
39 |                       metrics=['acc'])
40 |     elif output_type == 'single':
41 |         x = Dense(output_shape, activation='softmax')(x)
42 |         model = Model(inputs=data_input, outputs=x)
43 |         model.compile(loss='categorical_crossentropy',
44 |                       optimizer='adam',
45 |                       metrics=['acc'])
46 |     else:
47 |         raise ValueError('output_type should be multiple or single')
48 |     return model
49 | 
50 | 
51 | if __name__ == '__main__':
52 |     model = CNN(input_dim=10, input_length=10, vec_size=10, output_shape=10, output_type='multiple')
53 |     model.summary()
54 | 


--------------------------------------------------------------------------------
/demo_multiple.py:
--------------------------------------------------------------------------------
 1 | from TextClassification import load_data
 2 | from sklearn.model_selection import train_test_split
 3 | import tensorflow as tf
 4 | import pickle
 5 | import numpy as np
 6 | 
 7 | sess = tf.InteractiveSession()
 8 | 
 9 | # 导入数据
10 | data_type = 'multiple'
11 | data = load_data(data_type)
12 | x = [i['fact'] for i in data]
13 | y = [i['accusation'] for i in data]
14 | 
15 | # 拆分训练集和测试集
16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
17 | 
18 | ##### 以下是训练过程 #####
19 | 
20 | from TextClassification import TextClassification
21 | 
22 | clf = TextClassification()
23 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, word_len=1, num_words=2000, sentence_len=50)
24 | clf.fit(texts_seq, texts_labels, data_type, 3, 64)
25 | 
26 | # 保存整个模块,包括预处理和神经网络
27 | with open('./%s.pkl' % data_type, 'wb') as f:
28 |     pickle.dump(clf, f)
29 | 
30 | ##### 以下是预测过程 #####
31 | 
32 | # 导入刚才保存的模型
33 | with open('./%s.pkl' % data_type, 'rb') as f:
34 |     clf = pickle.load(f)
35 | y_predict = clf.predict(x_test)
36 | y_predict = clf.label2tag(y_predict, clf.preprocess.label_set)
37 | score = sum([y_predict[i] == y_test[i] for i in range(len(y_predict))]) / len(y_predict)
38 | print(score)  # 0.2136
39 | 


--------------------------------------------------------------------------------
/demo_single.py:
--------------------------------------------------------------------------------
 1 | from TextClassification import load_data
 2 | from sklearn.model_selection import train_test_split
 3 | import tensorflow as tf
 4 | import pickle
 5 | import numpy as np
 6 | 
 7 | sess = tf.InteractiveSession()
 8 | 
 9 | # 导入数据
10 | data_type = 'single'
11 | data = load_data(data_type)
12 | x = data['evaluation']
13 | y = [[i] for i in data['label']]
14 | 
15 | # 拆分训练集和测试集
16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
17 | 
18 | ##### 以下是训练过程 #####
19 | from TextClassification import TextClassification
20 | 
21 | clf = TextClassification()
22 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train,
23 |                                              word_len=1,
24 |                                              num_words=2000,
25 |                                              sentence_len=50)
26 | clf.fit(texts_seq=texts_seq,
27 |         texts_labels=texts_labels,
28 |         output_type=data_type,
29 |         epochs=10,
30 |         batch_size=64,
31 |         model=None)
32 | 
33 | # 保存整个模块,包括预处理和神经网络
34 | with open('./%s.pkl' % data_type, 'wb') as f:
35 |     pickle.dump(clf, f)
36 | 
37 | ##### 以下是预测过程 #####
38 | 
39 | # 导入刚才保存的模型
40 | with open('./%s.pkl' % data_type, 'rb') as f:
41 |     clf = pickle.load(f)
42 | y_predict = clf.predict(x_test)
43 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict]
44 | score = sum(y_predict == np.array(y_test)) / len(y_test)
45 | print(score)  # 0.9288
46 | 


--------------------------------------------------------------------------------
/picture/data_multiple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_multiple.png


--------------------------------------------------------------------------------
/picture/data_single.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_single.png


--------------------------------------------------------------------------------