├── .gitattributes
├── .gitignore
├── README.md
├── TextClassification
├── DataPreprocess.py
├── TextClassification.py
├── __init__.py
├── data
│ ├── data_multiple.json
│ └── data_single.csv
├── load_data.py
└── net.py
├── demo_multiple.py
├── demo_single.py
└── picture
├── data_multiple.png
└── data_single.png
/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | .idea/machine-learning.iml
3 | .idea/misc.xml
4 | .idea/vcs.xml
5 | .idea/modules.xml
6 | .idea/workspace.xml
7 | .idea/NLP.iml
8 | .idea/misc.xml
9 | .idea/modules.xml
10 | .idea/workspace.xml
11 |
12 | try.py
13 | __pycache__/sentence_transform.cpython-35.pyc
14 | __pycache__/__init__.cpython-35.pyc
15 | __pycache__/__init__.cpython-35.pyc
16 | *.pyc
17 | data/demo_score/智能电视_电商_正负面.zip
18 | data/demo_score/电商_智能电视_正面帖_2500条.xlsx
19 | data/demo_score/电商_智能电视_负面帖_2500条.xlsx
20 | data/demo_score/data.xlsx
21 | data/demo_score/~$predict.xlsx
22 | demo/demo_score - 副本.py
23 | data/demo_score/predict.xlsx
24 | data/demo_score/predict.xlsx
25 | testdata/6000document/20180125/predict_v1.xlsx
26 | testdata/6000document/20180125/test_v1.xlsx
27 | testdata/6000document/20180126/predict_v1.xlsx
28 | testdata/6000document/20180126/test_v1.xlsx
29 | testdata/6000document/20180126/update_v1.xlsx
30 | test/6000条原帖.20180125.xlsx
31 | testdata/6000document/20180126/~$update_v1.xlsx
32 |
33 |
34 | .idea/Text-Classification.iml
35 | data/demo_score/~$data.xlsx
36 | creat_data/config.py
37 | creat_data/try - 副本.py
38 | demo_score - 副本.py
39 | *.xml
40 | *.xml
41 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Text-Classification
2 | [](https://www.python.org/)
3 | [](https://pypi.python.org/pypi/pandas/0.21.0)
4 | [](https://pypi.python.org/pypi/numpy/1.13.1)
5 | [](https://pypi.python.org/pypi/jieba/0.39)
6 | [](https://pypi.python.org/pypi/Keras/2.2.4)
7 |
8 |
9 | ## 项目介绍
10 | 通过对已有标签的文本进行训练,实现新文本的分类。
11 |
12 | ## 更新说明
13 | 2019.3.25:项目最初是公司的一个舆情分析业务,后来参加了一些比赛又增加了一些小功能。当时只是想着把机器学习、深度学习的一些简单的模型整合在一起,锻炼一下工程能力。和一些网友交流后,觉得没必要搞一个通用型的模块(反正也没人用哈哈~)。最近刚好比较清闲,就本着越简单越好的目的把没啥用的花里胡哨的参数和函数都删了,只保留了预处理和卷积网络。
14 |
15 | ## 导入数据集:load_data
16 | **准备了单一标签的电商数据4000多条和多标签的司法罪名数据15000多条,数据仅供学术研究使用,禁止商业传播。**
17 | * 单一标签的电商数据4000条为.csv格式,来源于真实电商评论,由'evaluation'和'label'两个字段组成,分别表示用户评论和正负面标签,建议pandas读取,读入后为dataframe。
18 | * 多标签的司法罪名数据15000条为.json格式,来源于2018‘法研杯’法律智能挑战赛(CAIL2018),由'fact'和'accusation'两个字段组成,分别表示事实陈述和罪名,读入后为列表。
19 | ``` python
20 | from TextClassification.load_data import load_data
21 |
22 | # 单标签
23 | data = load_data('single')
24 | x = data['evaluation']
25 | y = [[i] for i in data['label']]
26 |
27 | # 多标签
28 | data = load_data('multiple')
29 | x = [i['fact'] for i in data]
30 | y = [i['accusation'] for i in data]
31 | ```
32 | 
33 | 
34 |
35 | ## 文本预处理:DataPreprocess.py
36 | **用于对原始文本数据做预处理,包含分词、转编码、长度统一等方法,已封装进TextClassification.py**
37 |
38 | ``` python
39 | preprocess = DataPreprocess()
40 |
41 | # 处理文本
42 | texts_cut = preprocess.cut_texts(texts, word_len)
43 | preprocess.train_tokenizer(texts_cut, num_words)
44 | texts_seq = preprocess.text2seq(texts_cut, sentence_len)
45 |
46 | # 得到标签
47 | preprocess.creat_label_set(labels)
48 | labels = preprocess.creat_labels(labels)
49 | ```
50 |
51 | ## 模型训练及预测:TextClassification.py
52 | **整合预处理、网络的训练、网络的预测,demo请参考两个demo脚本**
53 |
54 | **方法如下:**
55 | * fit:输入原始文本和标签,可以在已有的模型基础上继续训练,不输入模型则重新开始训练;
56 | * predict:输入原始文本;
57 |
58 | ``` python
59 | from TextClassification import TextClassification
60 |
61 | clf = TextClassification()
62 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train,
63 | word_len=1,
64 | num_words=2000,
65 | sentence_len=50)
66 | clf.fit(texts_seq=texts_seq,
67 | texts_labels=texts_labels,
68 | output_type=data_type,
69 | epochs=10,
70 | batch_size=64,
71 | model=None)
72 |
73 | # 保存整个模块,包括预处理和神经网络
74 | with open('./%s.pkl' % data_type, 'wb') as f:
75 | pickle.dump(clf, f)
76 |
77 | # 导入刚才保存的模型
78 | with open('./%s.pkl' % data_type, 'rb') as f:
79 | clf = pickle.load(f)
80 | y_predict = clf.predict(x_test)
81 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict]
82 | score = sum(y_predict == np.array(y_test)) / len(y_test)
83 | print(score) # 0.9288
84 | ```
85 |
86 |
87 |
88 |
--------------------------------------------------------------------------------
/TextClassification/DataPreprocess.py:
--------------------------------------------------------------------------------
1 | import jieba
2 | from keras.preprocessing.text import Tokenizer
3 | from keras.preprocessing.sequence import pad_sequences
4 | import numpy as np
5 |
6 | jieba.setLogLevel('WARN')
7 |
8 |
9 | class DataPreprocess():
10 | def __init__(self, tokenizer=None,
11 | label_set=None):
12 | self.tokenizer = tokenizer
13 | self.num_words = None
14 | self.label_set = label_set
15 | self.sentence_len = None
16 | self.word_len = None
17 |
18 | def cut_texts(self, texts=None, word_len=1):
19 | """
20 | 对文本分词
21 | :param texts: 文本列表
22 | :param word_len: 保留最短长度的词语
23 | :return:
24 | """
25 | if word_len > 1:
26 | texts_cut = [[word for word in jieba.lcut(text) if len(word) >= word_len] for text in texts]
27 | else:
28 | texts_cut = [jieba.lcut(one_text) for one_text in texts]
29 |
30 | self.word_len = word_len
31 |
32 | return texts_cut
33 |
34 | def train_tokenizer(self,
35 | texts_cut=None,
36 | num_words=2000):
37 | """
38 | 生成编码字典
39 | :param texts_cut: 分词的列表
40 | :param num_words: 字典按词频从高到低保留数量
41 | :return:
42 | """
43 | tokenizer = Tokenizer(num_words=num_words)
44 | tokenizer.fit_on_texts(texts=texts_cut)
45 | num_words = min(num_words, len(tokenizer.word_index) + 1)
46 | self.tokenizer = tokenizer
47 | self.num_words = num_words
48 |
49 | def text2seq(self,
50 | texts_cut,
51 | sentence_len=30):
52 | """
53 | 文本转序列,用于神经网络的ebedding层输入。
54 | :param texts_cut: 分词后的文本列表
55 | :param sentence_len: 文本转序列保留长度
56 | :return:sequence list
57 | """
58 | tokenizer = self.tokenizer
59 | texts_seq = tokenizer.texts_to_sequences(texts=texts_cut)
60 | del texts_cut
61 |
62 | texts_pad_seq = pad_sequences(texts_seq,
63 | maxlen=sentence_len,
64 | padding='post',
65 | truncating='post')
66 | self.sentence_len = sentence_len
67 | return texts_pad_seq
68 |
69 | def creat_label_set(self, labels):
70 | '''
71 | 获取标签集合,用于one-hot
72 | :param labels: 原始标签集
73 | :return:
74 | '''
75 | label_set = set()
76 | for i in labels:
77 | label_set = label_set.union(set(i))
78 |
79 | self.label_set = np.array(list(label_set))
80 |
81 | def creat_label(self, label):
82 | '''
83 | 构建标签one-hot
84 | :param label: 原始标签
85 | :return: 标签one-hot形式的array
86 | eg. creat_label(label=data_valid_accusations[12], label_set=accusations_set)
87 | '''
88 | label_set = self.label_set
89 | label_zero = np.zeros(len(label_set))
90 | label_zero[np.in1d(label_set, label)] = 1
91 | return label_zero
92 |
93 | def creat_labels(self, labels=None):
94 | '''
95 | 调用creat_label遍历标签列表生成one-hot二维数组
96 | :param label: 原始标签集
97 | :return:
98 | '''
99 | label_set = self.label_set
100 | labels_one_hot = [self.creat_label(label) for label in labels]
101 |
102 | return np.array(labels_one_hot)
103 |
--------------------------------------------------------------------------------
/TextClassification/TextClassification.py:
--------------------------------------------------------------------------------
1 | from .DataPreprocess import DataPreprocess
2 | from .net import CNN
3 | import numpy as np
4 |
5 |
6 | class TextClassification():
7 | def __init__(self):
8 | self.preprocess = None
9 | self.model = None
10 |
11 | def get_preprocess(self, texts, labels, word_len=1, num_words=2000, sentence_len=30):
12 | # 数据预处理
13 | preprocess = DataPreprocess()
14 |
15 | # 处理文本
16 | texts_cut = preprocess.cut_texts(texts, word_len)
17 | preprocess.train_tokenizer(texts_cut, num_words)
18 | texts_seq = preprocess.text2seq(texts_cut, sentence_len)
19 |
20 | # 得到标签
21 | preprocess.creat_label_set(labels)
22 | labels = preprocess.creat_labels(labels)
23 | self.preprocess = preprocess
24 |
25 | return texts_seq, labels
26 |
27 | def fit(self, texts_seq, texts_labels, output_type, epochs, batch_size, model=None):
28 | if model is None:
29 | preprocess = self.preprocess
30 | model = CNN(preprocess.num_words,
31 | preprocess.sentence_len,
32 | 128,
33 | len(preprocess.label_set),
34 | output_type)
35 | # 训练神经网络
36 | model.fit(texts_seq,
37 | texts_labels,
38 | epochs=epochs,
39 | batch_size=batch_size)
40 | self.model = model
41 |
42 | def predict(self, texts):
43 | preprocess = self.preprocess
44 | word_len = preprocess.word_len
45 | sentence_len = preprocess.sentence_len
46 |
47 | # 处理文本
48 | texts_cut = preprocess.cut_texts(texts, word_len)
49 | texts_seq = preprocess.text2seq(texts_cut, sentence_len)
50 |
51 | return self.model.predict(texts_seq)
52 |
53 | def label2toptag(self, predictions, labelset):
54 | labels = []
55 | for prediction in predictions:
56 | label = labelset[prediction == prediction.max()]
57 | labels.append(label.tolist())
58 | return labels
59 |
60 | def label2half(self, predictions, labelset):
61 | labels = []
62 | for prediction in predictions:
63 | label = labelset[prediction > 0.5]
64 | labels.append(label.tolist())
65 | return labels
66 |
67 | def label2tag(self, predictions, labelset):
68 | labels1 = self.label2toptag(predictions, labelset)
69 | labels2 = self.label2half(predictions, labelset)
70 | labels = []
71 | for i in range(len(predictions)):
72 | if len(labels2[i]) == 0:
73 | labels.append(labels1[i])
74 | else:
75 | labels.append(labels2[i])
76 | return labels
77 |
--------------------------------------------------------------------------------
/TextClassification/__init__.py:
--------------------------------------------------------------------------------
1 | from .DataPreprocess import DataPreprocess
2 | from .net import CNN
3 | from .TextClassification import TextClassification
4 | from .load_data import load_data
5 |
--------------------------------------------------------------------------------
/TextClassification/load_data.py:
--------------------------------------------------------------------------------
1 | import json
2 | import pandas as pd
3 | import os
4 |
5 | localpath = os.path.dirname(__file__)
6 |
7 |
8 | def load_data(data_type='single'):
9 | if data_type == 'single':
10 | data = pd.read_csv(localpath + "/data/data_single.csv", encoding='utf8')
11 | elif data_type == 'multiple':
12 | with open(localpath + '/data/data_multiple.json', mode='r', encoding='utf8') as f:
13 | data_raw = f.readlines()
14 | data = [json.loads(i) for i in data_raw]
15 | return data
16 |
--------------------------------------------------------------------------------
/TextClassification/net.py:
--------------------------------------------------------------------------------
1 | from keras.models import Model
2 | from keras.layers import Dense, Embedding, Input
3 | from keras.layers import Conv1D, GlobalMaxPool1D, Dropout
4 |
5 |
6 | def CNN(input_dim,
7 | input_length,
8 | vec_size,
9 | output_shape,
10 | output_type='multiple'):
11 | '''
12 | Creat CNN net,use Embedding+CNN1D+GlobalMaxPool1D+Dense.
13 | You can change filters and dropout rate in code..
14 |
15 | :param input_dim: Size of the vocabulary
16 | :param input_length:Length of input sequences
17 | :param vec_size:Dimension of the dense embedding
18 | :param output_shape:Target shape,target should be one-hot term
19 | :param output_type:last layer type,multiple(activation="sigmoid") or single(activation="softmax")
20 | :return:keras model
21 | '''
22 | data_input = Input(shape=[input_length])
23 | word_vec = Embedding(input_dim=input_dim + 1,
24 | input_length=input_length,
25 | output_dim=vec_size)(data_input)
26 | x = Conv1D(filters=128,
27 | kernel_size=[3],
28 | strides=1,
29 | padding='same',
30 | activation='relu')(word_vec)
31 | x = GlobalMaxPool1D()(x)
32 | x = Dense(500, activation='relu')(x)
33 | x = Dropout(0.1)(x)
34 | if output_type == 'multiple':
35 | x = Dense(output_shape, activation='sigmoid')(x)
36 | model = Model(inputs=data_input, outputs=x)
37 | model.compile(loss='binary_crossentropy',
38 | optimizer='adam',
39 | metrics=['acc'])
40 | elif output_type == 'single':
41 | x = Dense(output_shape, activation='softmax')(x)
42 | model = Model(inputs=data_input, outputs=x)
43 | model.compile(loss='categorical_crossentropy',
44 | optimizer='adam',
45 | metrics=['acc'])
46 | else:
47 | raise ValueError('output_type should be multiple or single')
48 | return model
49 |
50 |
51 | if __name__ == '__main__':
52 | model = CNN(input_dim=10, input_length=10, vec_size=10, output_shape=10, output_type='multiple')
53 | model.summary()
54 |
--------------------------------------------------------------------------------
/demo_multiple.py:
--------------------------------------------------------------------------------
1 | from TextClassification import load_data
2 | from sklearn.model_selection import train_test_split
3 | import tensorflow as tf
4 | import pickle
5 | import numpy as np
6 |
7 | sess = tf.InteractiveSession()
8 |
9 | # 导入数据
10 | data_type = 'multiple'
11 | data = load_data(data_type)
12 | x = [i['fact'] for i in data]
13 | y = [i['accusation'] for i in data]
14 |
15 | # 拆分训练集和测试集
16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
17 |
18 | ##### 以下是训练过程 #####
19 |
20 | from TextClassification import TextClassification
21 |
22 | clf = TextClassification()
23 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train, word_len=1, num_words=2000, sentence_len=50)
24 | clf.fit(texts_seq, texts_labels, data_type, 3, 64)
25 |
26 | # 保存整个模块,包括预处理和神经网络
27 | with open('./%s.pkl' % data_type, 'wb') as f:
28 | pickle.dump(clf, f)
29 |
30 | ##### 以下是预测过程 #####
31 |
32 | # 导入刚才保存的模型
33 | with open('./%s.pkl' % data_type, 'rb') as f:
34 | clf = pickle.load(f)
35 | y_predict = clf.predict(x_test)
36 | y_predict = clf.label2tag(y_predict, clf.preprocess.label_set)
37 | score = sum([y_predict[i] == y_test[i] for i in range(len(y_predict))]) / len(y_predict)
38 | print(score) # 0.2136
39 |
--------------------------------------------------------------------------------
/demo_single.py:
--------------------------------------------------------------------------------
1 | from TextClassification import load_data
2 | from sklearn.model_selection import train_test_split
3 | import tensorflow as tf
4 | import pickle
5 | import numpy as np
6 |
7 | sess = tf.InteractiveSession()
8 |
9 | # 导入数据
10 | data_type = 'single'
11 | data = load_data(data_type)
12 | x = data['evaluation']
13 | y = [[i] for i in data['label']]
14 |
15 | # 拆分训练集和测试集
16 | x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
17 |
18 | ##### 以下是训练过程 #####
19 | from TextClassification import TextClassification
20 |
21 | clf = TextClassification()
22 | texts_seq, texts_labels = clf.get_preprocess(x_train, y_train,
23 | word_len=1,
24 | num_words=2000,
25 | sentence_len=50)
26 | clf.fit(texts_seq=texts_seq,
27 | texts_labels=texts_labels,
28 | output_type=data_type,
29 | epochs=10,
30 | batch_size=64,
31 | model=None)
32 |
33 | # 保存整个模块,包括预处理和神经网络
34 | with open('./%s.pkl' % data_type, 'wb') as f:
35 | pickle.dump(clf, f)
36 |
37 | ##### 以下是预测过程 #####
38 |
39 | # 导入刚才保存的模型
40 | with open('./%s.pkl' % data_type, 'rb') as f:
41 | clf = pickle.load(f)
42 | y_predict = clf.predict(x_test)
43 | y_predict = [[clf.preprocess.label_set[i.argmax()]] for i in y_predict]
44 | score = sum(y_predict == np.array(y_test)) / len(y_test)
45 | print(score) # 0.9288
46 |
--------------------------------------------------------------------------------
/picture/data_multiple.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_multiple.png
--------------------------------------------------------------------------------
/picture/data_single.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/renjunxiang/Text-Classification/6ec5483623f4060f62d955795d5e2791635ca512/picture/data_single.png
--------------------------------------------------------------------------------