├── README.md
├── imgs
├── 01.png
├── 02.png
├── HELLONLP.png
└── README.md
├── sentiment_analysis_albert
├── README.md
├── __init__.py
├── albert_small_zh_google
│ └── README.md
├── classifier_utils.py
├── data
│ ├── README.md
│ ├── sa_test.csv
│ └── sa_train.csv
├── hyperparameters.py
├── image
│ └── model.png
├── lamb_optimizer.py
├── logdir
│ └── model_01
│ │ ├── events.out.tfevents.1592553634.DESKTOP-QC1A83I
│ │ └── events.out.tfevents.1592553671.DESKTOP-QC1A83I
├── model
│ ├── model_load
│ │ └── README.md
│ └── model_save
│ │ └── README.md
├── modeling.py
├── modules.py
├── networks.py
├── optimization.py
├── predict.py
├── requirements.txt
├── tokenization.py
├── train.py
└── utils.py
├── sentiment_analysis_albert_emoji
├── README.md
├── __init__.py
├── albert_base_zh
│ └── README.md
├── albert_small_zh_google
│ └── README.md
├── classifier_utils.py
├── data
│ ├── README.md
│ ├── sa_test.csv
│ └── sa_train.csv
├── hyperparameters.py
├── lamb_optimizer.py
├── model
│ ├── model_load
│ │ └── README.md
│ └── model_save
│ │ └── README.md
├── modeling.py
├── modules.py
├── networks.py
├── optimization.py
├── predict.py
├── requirements.txt
├── tokenization.py
├── train.py
└── utils.py
├── sentiment_analysis_bayes
├── README.md
├── bayes.py
├── data
│ ├── test.zip
│ ├── test_feature.zip
│ ├── test_label.zip
│ ├── train.zip
│ ├── train_feature.zip
│ └── train_label.zip
├── dict
│ ├── stopwords.txt
│ └── vocabulary_pearson_40000.txt
├── hyperparameters.py
├── load.py
├── model
│ ├── class.txt
│ ├── p0.txt
│ └── p1.txt
├── predict.py
├── prepare.py
├── train.py
└── utils.py
└── sentiment_analysis_dict
├── README.md
├── dict
├── insufficiently.txt
├── inverse.txt
├── ish.txt
├── jieba_sentiment.txt
├── more.txt
├── most.txt
├── negative.txt
├── not.txt
├── over.txt
├── ponctuation_sentiment.txt
├── positive.txt
└── very.txt
├── hyperparameters.py
├── networks.py
├── preidict.py
└── utils.py
/README.md:
--------------------------------------------------------------------------------
1 | # Sentiment Analysis: 情感分析
2 |
3 | [](https://www.python.org/downloads/release/python-376/)
4 |
5 |
6 |
7 |
8 |
9 |
10 | ## 一、简介
11 | ### 1. 文本分类
12 | 文本分类是自然语言处理(NLP)最基础核心的任务,或者换句话说,几乎所有的任务都是「分类」任务,或者涉及到「分类」这个概念。情感分析是文本分类中非常主要的一个方向,应用场景非常广泛。
13 | ### 2. 情感分析
14 | 我们将中文文本情感分析分为三大类型,第一个是应用情感词典和句式结构方法来做的;第二个是使用传统机器学习方法来做的,例如Bayes、SVM等;第三个是应用深度学习的方法来做的,例如LSTM、CNN、LSTM+CNN、BERT+CNN等。
15 | 这三种方法中,第一种不需要人工标注,也不需要训练,第二种和第三种方法都需要人工标注大量的数据,然后做有监督的模型训练。
16 |
17 |
18 |
19 | ## 二、算法
20 |
21 | **4种实现方法**
22 | ```
23 | ├── sentiment-analysis
24 | └── sentiment_analysis_dict
25 | └── sentiment_analysis_bayes
26 | └── sentiment_analysis_albert
27 | └── sentiment_analysis_albert_emoji
28 | ```
29 |
30 | ### 1. sentiment_analysis_dict
31 | 基于词典的方法。
32 |
33 | ### 2. sentiment_analysis_bayes
34 | 基于传统机器学习**bayes**的方法。
35 |
36 | ### 3. sentiment_analysis_albert
37 | 基于深度学习的方法,使用了语言模型**ALBERT**和下游任务框架**TextCNN**。
38 |
39 | ### 4. sentiment_analysis_albert_emoji
40 | 基于深度学习的方法,使用了语言模型**ALBERT**和下游任务框架**TextCNN**。
41 | 引入**未知token**(emoji是其中的一种),在微调过程的同时学习未知token的语义向量,从而达到识别未知token情感语义的目的。
42 |
43 |
44 |
45 | ## 参考
46 | [基于词典的文本情感分析(附代码)](https://zhuanlan.zhihu.com/p/142011031)
47 | [文本分类 [ALBERT+TextCNN] [中文情感分析](附代码)](https://zhuanlan.zhihu.com/p/149491055)
48 | [中文情感分析 [emoji 表情符号]](https://zhuanlan.zhihu.com/p/338806367)
49 |
--------------------------------------------------------------------------------
/imgs/01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/01.png
--------------------------------------------------------------------------------
/imgs/02.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/02.png
--------------------------------------------------------------------------------
/imgs/HELLONLP.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/HELLONLP.png
--------------------------------------------------------------------------------
/imgs/README.md:
--------------------------------------------------------------------------------
1 | # imgs
2 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/README.md:
--------------------------------------------------------------------------------
1 | # 简介
2 | 1、本项目是在tensorflow版本1.15.0的基础上做的训练和测试。
3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。
4 | 3、欢迎大家联系我 www.hellonlp.com
5 | 4、albert_small_zh_google对应的百度云下载地址:
6 | 链接:https://pan.baidu.com/s/1RKzGJTazlZ7y12YRbAWvyA
7 | 提取码:wuxw
8 |
9 | # 使用方法
10 | 1、准备数据
11 | 数据格式为:sentiment_analysis_albert/data/sa_test.csv
12 | 2、参数设置
13 | 参考脚本 hyperparameters.py,直接修改里面的数值即可。
14 | 3、训练
15 | python train.py
16 | 4、推理
17 | python predict.py
18 |
19 | # 知乎代码解读
20 | https://zhuanlan.zhihu.com/p/149491055
21 |
22 |
23 |
24 |
25 |
26 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/__init__.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/albert_small_zh_google/README.md:
--------------------------------------------------------------------------------
1 | # 语言模型
2 |
3 | ## ALBERT small Chinese
4 | ```
5 | albert_small_zh_google/albert_config.json
6 | albert_small_zh_google/albert_model.ckpt.data-00000-of-00001
7 | albert_small_zh_google/albert_model.ckpt.index
8 | albert_small_zh_google/albert_model.ckpt.meta
9 | albert_small_zh_google/checkpoint
10 | albert_small_zh_google/vocab_chinese.txt
11 | ```
12 |
13 | ## 下载路径
14 | 链接:https://pan.baidu.com/s/1rfnQewbUXUl5c4jvXEQFFg?pwd=igey
15 | 提取码:igey
16 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/classifier_utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2018
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | import csv
11 | import random
12 | import collections
13 | import tensorflow_hub as hub
14 | import tensorflow.compat.v1 as tf
15 | from tensorflow.contrib import tpu as contrib_tpu
16 | from tensorflow.contrib import data as contrib_data
17 | from tensorflow.contrib import metrics as contrib_metrics
18 |
19 | from sentiment_analysis_albert import modeling
20 | from sentiment_analysis_albert import optimization
21 | from sentiment_analysis_albert import tokenization
22 | from sentiment_analysis_albert.utils import load_csv
23 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp
24 |
25 |
26 | def index2label(index):
27 | return hp.dict_label[str(index)]
28 |
29 |
30 | def read_csv(input_file):
31 | """Reads a tab separated value file."""
32 | df = load_csv(input_file,header=0)
33 | jobcontent = df['content'].tolist()
34 | jlabel = df['label'].tolist()
35 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))]
36 | print('Read csv finished!(1)')
37 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str]
38 | random.shuffle(lines2)
39 | return lines2
40 |
41 |
42 | class InputExample(object):
43 | """A single training/test example for simple sequence classification."""
44 |
45 | def __init__(self, guid, text_a, text_b=None, label=None):
46 | """Constructs a InputExample.
47 |
48 | Args:
49 | guid: Unique id for the example.
50 | text_a: string. The untokenized text of the first sequence. For single
51 | sequence tasks, only this sequence must be specified.
52 | text_b: (Optional) string. The untokenized text of the second sequence.
53 | Only must be specified for sequence pair tasks.
54 | label: (Optional) string. The label of the example. This should be
55 | specified for train and dev examples, but not for test examples.
56 | """
57 | self.guid = guid
58 | self.text_a = text_a
59 | self.text_b = text_b
60 | self.label = label
61 |
62 |
63 | class PaddingInputExample(object):
64 | """Fake example so the num input examples is a multiple of the batch size.
65 |
66 | When running eval/predict on the TPU, we need to pad the number of examples
67 | to be a multiple of the batch size, because the TPU requires a fixed batch
68 | size. The alternative is to drop the last batch, which is bad because it means
69 | the entire output data won't be generated.
70 |
71 | We use this class instead of `None` because treating `None` as padding
72 | battches could cause silent errors.
73 | """
74 |
75 |
76 | class InputFeatures(object):
77 | """A single set of features of data."""
78 |
79 | def __init__(self,
80 | input_ids,
81 | input_mask,
82 | segment_ids,
83 | label_id,
84 | guid=None,
85 | example_id=None,
86 | is_real_example=True):
87 | self.input_ids = input_ids
88 | self.input_mask = input_mask
89 | self.segment_ids = segment_ids
90 | self.label_id = label_id
91 | self.example_id = example_id
92 | self.guid = guid
93 | self.is_real_example = is_real_example
94 |
95 |
96 | class DataProcessor(object):
97 | """Base class for data converters for sequence classification data sets."""
98 |
99 | def __init__(self, use_spm, do_lower_case):
100 | super(DataProcessor, self).__init__()
101 | self.use_spm = use_spm
102 | self.do_lower_case = do_lower_case
103 |
104 | def get_train_examples(self, data_dir):
105 | """Gets a collection of `InputExample`s for the train set."""
106 | raise NotImplementedError()
107 |
108 | def get_dev_examples(self, data_dir):
109 | """Gets a collection of `InputExample`s for the dev set."""
110 | raise NotImplementedError()
111 |
112 | def get_test_examples(self, data_dir):
113 | """Gets a collection of `InputExample`s for prediction."""
114 | raise NotImplementedError()
115 |
116 | def get_labels(self):
117 | """Gets the list of labels for this data set."""
118 | raise NotImplementedError()
119 |
120 | @classmethod
121 | def _read_tsv(cls, input_file, quotechar=None):
122 | """Reads a tab separated value file."""
123 | with tf.gfile.Open(input_file, "r") as f:
124 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
125 | lines = []
126 | for line in reader:
127 | lines.append(line)
128 | return lines
129 |
130 | @classmethod
131 | def _read_csv(cls,input_file):# 项目
132 | """Reads a tab separated value file."""
133 | df = load_csv(input_file,header=0)
134 | jobcontent = df['content'].tolist()
135 | jlabel = df['label'].tolist()
136 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))]
137 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str]
138 | random.shuffle(lines2)
139 | return lines2
140 |
141 | class ClassifyProcessor(DataProcessor):
142 | """Processor for the MRPC data set (GLUE version)."""
143 |
144 | def __init__(self):
145 | self.labels = set()
146 |
147 | def get_train_examples(self, data_dir):
148 | """See base class."""
149 | print('*'*30)
150 | return self._create_examples(
151 | self._read_csv(os.path.join(data_dir, hp.train_data)), "train")
152 |
153 | def get_dev_examples(self, data_dir):
154 | """See base class."""
155 | return self._create_examples(
156 | self._read_csv(os.path.join(data_dir, hp.test_data)), "dev")
157 |
158 | def get_test_examples(self, data_dir):
159 | """See base class."""
160 | return self._create_examples(
161 | self._read_csv(os.path.join(data_dir, hp.test_data)), "test")
162 |
163 | def get_labels(self):
164 | """See base class."""
165 | return ['0','1','2']
166 |
167 | def _create_examples(self, lines, set_type):
168 | """Creates examples for the training and dev sets."""
169 | examples = []
170 | for (i, line) in enumerate(lines):
171 | guid = "%s-%s" % (set_type, i)
172 | text_a = tokenization.convert_to_unicode(line[1])
173 | label = tokenization.convert_to_unicode(line[0])
174 | self.labels.add(label)
175 | # print(self.labels)
176 | examples.append(
177 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
178 | # print(examples)
179 | return examples
180 |
181 |
182 | def convert_single_example(ex_index, example, label_list, max_seq_length,
183 | tokenizer, task_name):
184 | """Converts a single `InputExample` into a single `InputFeatures`."""
185 |
186 | if isinstance(example, PaddingInputExample):
187 | return InputFeatures(
188 | input_ids=[0] * max_seq_length,
189 | input_mask=[0] * max_seq_length,
190 | segment_ids=[0] * max_seq_length,
191 | label_id=0,
192 | is_real_example=False)
193 |
194 | if task_name != "sts-b":
195 | label_map = {}
196 | for (i, label) in enumerate(label_list):
197 | label_map[label] = i
198 |
199 | tokens_a = tokenizer.tokenize(example.text_a)
200 | tokens_b = None
201 | if example.text_b:
202 | tokens_b = tokenizer.tokenize(example.text_b)
203 |
204 | if tokens_b:
205 | # Modifies `tokens_a` and `tokens_b` in place so that the total
206 | # length is less than the specified length.
207 | # Account for [CLS], [SEP], [SEP] with "- 3"
208 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
209 | else:
210 | # Account for [CLS] and [SEP] with "- 2"
211 | if len(tokens_a) > max_seq_length - 2:
212 | tokens_a = tokens_a[0:(max_seq_length - 2)]
213 |
214 | # The convention in ALBERT is:
215 | # (a) For sequence pairs:
216 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
217 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
218 | # (b) For single sequences:
219 | # tokens: [CLS] the dog is hairy . [SEP]
220 | # type_ids: 0 0 0 0 0 0 0
221 | #
222 | # Where "type_ids" are used to indicate whether this is the first
223 | # sequence or the second sequence. The embedding vectors for `type=0` and
224 | # `type=1` were learned during pre-training and are added to the
225 | # embedding vector (and position vector). This is not *strictly* necessary
226 | # since the [SEP] token unambiguously separates the sequences, but it makes
227 | # it easier for the model to learn the concept of sequences.
228 | #
229 | # For classification tasks, the first vector (corresponding to [CLS]) is
230 | # used as the "sentence vector". Note that this only makes sense because
231 | # the entire model is fine-tuned.
232 | tokens = []
233 | segment_ids = []
234 | tokens.append("[CLS]")
235 | segment_ids.append(0)
236 | for token in tokens_a:
237 | tokens.append(token)
238 | segment_ids.append(0)
239 | tokens.append("[SEP]")
240 | segment_ids.append(0)
241 |
242 | if tokens_b:
243 | for token in tokens_b:
244 | tokens.append(token)
245 | segment_ids.append(1)
246 | tokens.append("[SEP]")
247 | segment_ids.append(1)
248 |
249 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
250 |
251 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
252 | # tokens are attended to.
253 | input_mask = [1] * len(input_ids)
254 |
255 | # Zero-pad up to the sequence length.
256 | while len(input_ids) < max_seq_length:
257 | input_ids.append(0)
258 | input_mask.append(0)
259 | segment_ids.append(0)
260 |
261 | assert len(input_ids) == max_seq_length
262 | assert len(input_mask) == max_seq_length
263 | assert len(segment_ids) == max_seq_length
264 |
265 | if task_name != "sts-b":
266 | label_id = label_map[example.label]
267 | else:
268 | label_id = example.label
269 |
270 | feature = InputFeatures(
271 | input_ids=input_ids,
272 | input_mask=input_mask,
273 | segment_ids=segment_ids,
274 | label_id=label_id,
275 | is_real_example=True)
276 | return feature
277 |
278 |
279 | def file_based_convert_examples_to_features(
280 | examples, label_list, max_seq_length, tokenizer, output_file, task_name):
281 | """Convert a set of `InputExample`s to a TFRecord file."""
282 |
283 | writer = tf.python_io.TFRecordWriter(output_file)
284 |
285 | for (ex_index, example) in enumerate(examples):
286 | if ex_index % 10000 == 0:
287 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
288 |
289 | feature = convert_single_example(ex_index, example, label_list,
290 | max_seq_length, tokenizer, task_name)
291 |
292 | def create_int_feature(values):
293 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
294 | return f
295 |
296 | def create_float_feature(values):
297 | f = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
298 | return f
299 |
300 | features = collections.OrderedDict()
301 | features["input_ids"] = create_int_feature(feature.input_ids)
302 | features["input_mask"] = create_int_feature(feature.input_mask)
303 | features["segment_ids"] = create_int_feature(feature.segment_ids)
304 | features["label_ids"] = create_float_feature([feature.label_id])\
305 | if task_name == "sts-b" else create_int_feature([feature.label_id])
306 | features["is_real_example"] = create_int_feature(
307 | [int(feature.is_real_example)])
308 |
309 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
310 | writer.write(tf_example.SerializeToString())
311 | writer.close()
312 |
313 |
314 | def file_based_input_fn_builder(input_file, seq_length, is_training,
315 | drop_remainder, task_name, use_tpu, bsz,
316 | multiple=1):
317 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
318 | labeltype = tf.float32 if task_name == "sts-b" else tf.int64
319 |
320 | name_to_features = {
321 | "input_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64),
322 | "input_mask": tf.FixedLenFeature([seq_length * multiple], tf.int64),
323 | "segment_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64),
324 | "label_ids": tf.FixedLenFeature([], labeltype),
325 | "is_real_example": tf.FixedLenFeature([], tf.int64),
326 | }
327 |
328 | def _decode_record(record, name_to_features):
329 | """Decodes a record to a TensorFlow example."""
330 | example = tf.parse_single_example(record, name_to_features)
331 |
332 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
333 | # So cast all int64 to int32.
334 | for name in list(example.keys()):
335 | t = example[name]
336 | if t.dtype == tf.int64:
337 | t = tf.to_int32(t)
338 | example[name] = t
339 |
340 | return example
341 |
342 | def input_fn(params):
343 | """The actual input function."""
344 | if use_tpu:
345 | batch_size = params["batch_size"]
346 | else:
347 | batch_size = bsz
348 |
349 | # For training, we want a lot of parallel reading and shuffling.
350 | # For eval, we want no shuffling and parallel reading doesn't matter.
351 | d = tf.data.TFRecordDataset(input_file)
352 | if is_training:
353 | d = d.repeat()
354 | d = d.shuffle(buffer_size=100)
355 |
356 | d = d.apply(
357 | contrib_data.map_and_batch(
358 | lambda record: _decode_record(record, name_to_features),
359 | batch_size=batch_size,
360 | drop_remainder=drop_remainder))
361 |
362 | return d
363 |
364 | return input_fn
365 |
366 |
367 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
368 | """Truncates a sequence pair in place to the maximum length."""
369 |
370 | # This is a simple heuristic which will always truncate the longer sequence
371 | # one token at a time. This makes more sense than truncating an equal percent
372 | # of tokens from each, since if one sequence is very short then each token
373 | # that's truncated likely contains more information than a longer sequence.
374 | while True:
375 | total_length = len(tokens_a) + len(tokens_b)
376 | if total_length <= max_length:
377 | break
378 | if len(tokens_a) > len(tokens_b):
379 | tokens_a.pop()
380 | else:
381 | tokens_b.pop()
382 |
383 |
384 | def _create_model_from_hub(hub_module, is_training, input_ids, input_mask,
385 | segment_ids):
386 | """Creates an ALBERT model from TF-Hub."""
387 | tags = set()
388 | if is_training:
389 | tags.add("train")
390 | albert_module = hub.Module(hub_module, tags=tags, trainable=True)
391 | albert_inputs = dict(
392 | input_ids=input_ids,
393 | input_mask=input_mask,
394 | segment_ids=segment_ids)
395 | albert_outputs = albert_module(
396 | inputs=albert_inputs,
397 | signature="tokens",
398 | as_dict=True)
399 | output_layer = albert_outputs["pooled_output"]
400 | return output_layer
401 |
402 |
403 | def _create_model_from_scratch(albert_config, is_training, input_ids,
404 | input_mask, segment_ids, use_one_hot_embeddings):
405 | """Creates an ALBERT model from scratch (as opposed to hub)."""
406 | model = modeling.AlbertModel(
407 | config=albert_config,
408 | is_training=is_training,
409 | input_ids=input_ids,
410 | input_mask=input_mask,
411 | token_type_ids=segment_ids,
412 | use_one_hot_embeddings=use_one_hot_embeddings)
413 | output_layer = model.get_pooled_output()
414 | return output_layer
415 |
416 |
417 | def create_model(albert_config, is_training, input_ids, input_mask, segment_ids,
418 | labels, num_labels, use_one_hot_embeddings, task_name,
419 | hub_module):
420 | """Creates a classification model."""
421 | if hub_module:
422 | tf.logging.info("creating model from hub_module: %s", hub_module)
423 | output_layer = _create_model_from_hub(hub_module, is_training, input_ids,
424 | input_mask, segment_ids)
425 | else:
426 | tf.logging.info("creating model from albert_config")
427 | output_layer = _create_model_from_scratch(albert_config, is_training,
428 | input_ids, input_mask,
429 | segment_ids,
430 | use_one_hot_embeddings)
431 |
432 | hidden_size = output_layer.shape[-1].value
433 |
434 | output_weights = tf.get_variable(
435 | "output_weights", [num_labels, hidden_size],
436 | initializer=tf.truncated_normal_initializer(stddev=0.02))
437 |
438 | output_bias = tf.get_variable(
439 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
440 |
441 | with tf.variable_scope("loss"):
442 | if is_training:
443 | # I.e., 0.1 dropout
444 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
445 |
446 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
447 | logits = tf.nn.bias_add(logits, output_bias)
448 | if task_name != "sts-b":
449 | probabilities = tf.nn.softmax(logits, axis=-1)
450 | predictions = tf.argmax(probabilities, axis=-1, output_type=tf.int32)
451 | log_probs = tf.nn.log_softmax(logits, axis=-1)
452 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
453 |
454 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
455 | else:
456 | probabilities = logits
457 | logits = tf.squeeze(logits, [-1])
458 | predictions = logits
459 | per_example_loss = tf.square(logits - labels)
460 | loss = tf.reduce_mean(per_example_loss)
461 |
462 | return (loss, per_example_loss, probabilities, logits, predictions)
463 |
464 |
465 | def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate,
466 | num_train_steps, num_warmup_steps, use_tpu,
467 | use_one_hot_embeddings, task_name, hub_module=None,
468 | optimizer="adamw"):
469 | """Returns `model_fn` closure for TPUEstimator."""
470 |
471 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
472 | """The `model_fn` for TPUEstimator."""
473 |
474 | tf.logging.info("*** Features ***")
475 | for name in sorted(features.keys()):
476 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
477 |
478 | input_ids = features["input_ids"]
479 | input_mask = features["input_mask"]
480 | segment_ids = features["segment_ids"]
481 | label_ids = features["label_ids"]
482 | is_real_example = None
483 | if "is_real_example" in features:
484 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
485 | else:
486 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
487 |
488 | is_training = (mode == tf.estimator.ModeKeys.TRAIN)
489 |
490 | (total_loss, per_example_loss, probabilities, logits, predictions) = \
491 | create_model(albert_config, is_training, input_ids, input_mask,
492 | segment_ids, label_ids, num_labels,
493 | use_one_hot_embeddings, task_name, hub_module)
494 |
495 | tvars = tf.trainable_variables()
496 | initialized_variable_names = {}
497 | scaffold_fn = None
498 | if init_checkpoint:
499 | (assignment_map, initialized_variable_names
500 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
501 | if use_tpu:
502 |
503 | def tpu_scaffold():
504 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
505 | return tf.train.Scaffold()
506 |
507 | scaffold_fn = tpu_scaffold
508 | else:
509 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
510 |
511 | tf.logging.info("**** Trainable Variables ****")
512 | for var in tvars:
513 | init_string = ""
514 | if var.name in initialized_variable_names:
515 | init_string = ", *INIT_FROM_CKPT*"
516 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
517 | init_string)
518 |
519 | output_spec = None
520 | if mode == tf.estimator.ModeKeys.TRAIN:
521 |
522 | train_op = optimization.create_optimizer(
523 | total_loss, learning_rate, num_train_steps, num_warmup_steps,
524 | use_tpu, optimizer)
525 |
526 | output_spec = contrib_tpu.TPUEstimatorSpec(
527 | mode=mode,
528 | loss=total_loss,
529 | train_op=train_op,
530 | scaffold_fn=scaffold_fn)
531 | elif mode == tf.estimator.ModeKeys.EVAL:
532 | if task_name not in ["sts-b", "cola"]:
533 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
534 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
535 | accuracy = tf.metrics.accuracy(
536 | labels=label_ids, predictions=predictions,
537 | weights=is_real_example)
538 | loss = tf.metrics.mean(
539 | values=per_example_loss, weights=is_real_example)
540 | return {
541 | "eval_accuracy": accuracy,
542 | "eval_loss": loss,
543 | }
544 | elif task_name == "sts-b":
545 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
546 | """Compute Pearson correlations for STS-B."""
547 | # Display labels and predictions
548 | concat1 = contrib_metrics.streaming_concat(logits)
549 | concat2 = contrib_metrics.streaming_concat(label_ids)
550 |
551 | # Compute Pearson correlation
552 | pearson = contrib_metrics.streaming_pearson_correlation(
553 | logits, label_ids, weights=is_real_example)
554 |
555 | # Compute MSE
556 | # mse = tf.metrics.mean(per_example_loss)
557 | mse = tf.metrics.mean_squared_error(
558 | label_ids, logits, weights=is_real_example)
559 |
560 | loss = tf.metrics.mean(
561 | values=per_example_loss,
562 | weights=is_real_example)
563 |
564 | return {"pred": concat1, "label_ids": concat2, "pearson": pearson,
565 | "MSE": mse, "eval_loss": loss,}
566 | elif task_name == "cola":
567 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
568 | """Compute Matthew's correlations for STS-B."""
569 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
570 | # https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
571 | tp, tp_op = tf.metrics.true_positives(
572 | predictions, label_ids, weights=is_real_example)
573 | tn, tn_op = tf.metrics.true_negatives(
574 | predictions, label_ids, weights=is_real_example)
575 | fp, fp_op = tf.metrics.false_positives(
576 | predictions, label_ids, weights=is_real_example)
577 | fn, fn_op = tf.metrics.false_negatives(
578 | predictions, label_ids, weights=is_real_example)
579 |
580 | # Compute Matthew's correlation
581 | mcc = tf.div_no_nan(
582 | tp * tn - fp * fn,
583 | tf.pow((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn), 0.5))
584 |
585 | # Compute accuracy
586 | accuracy = tf.metrics.accuracy(
587 | labels=label_ids, predictions=predictions,
588 | weights=is_real_example)
589 |
590 | loss = tf.metrics.mean(
591 | values=per_example_loss,
592 | weights=is_real_example)
593 |
594 | return {"matthew_corr": (mcc, tf.group(tp_op, tn_op, fp_op, fn_op)),
595 | "eval_accuracy": accuracy, "eval_loss": loss,}
596 |
597 | eval_metrics = (metric_fn,
598 | [per_example_loss, label_ids, logits, is_real_example])
599 | output_spec = contrib_tpu.TPUEstimatorSpec(
600 | mode=mode,
601 | loss=total_loss,
602 | eval_metrics=eval_metrics,
603 | scaffold_fn=scaffold_fn)
604 | else:
605 | output_spec = contrib_tpu.TPUEstimatorSpec(
606 | mode=mode,
607 | predictions={
608 | "probabilities": probabilities,
609 | "predictions": predictions
610 | },
611 | scaffold_fn=scaffold_fn)
612 | return output_spec
613 |
614 | return model_fn
615 |
616 |
617 | # This function is not used by this file but is still used by the Colab and
618 | # people who depend on it.
619 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
620 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
621 |
622 | all_input_ids = []
623 | all_input_mask = []
624 | all_segment_ids = []
625 | all_label_ids = []
626 |
627 | for feature in features:
628 | all_input_ids.append(feature.input_ids)
629 | all_input_mask.append(feature.input_mask)
630 | all_segment_ids.append(feature.segment_ids)
631 | all_label_ids.append(feature.label_id)
632 |
633 | def input_fn(params):
634 | """The actual input function."""
635 | batch_size = params["batch_size"]
636 |
637 | num_examples = len(features)
638 |
639 | # This is for demo purposes and does NOT scale to large data sets. We do
640 | # not use Dataset.from_generator() because that uses tf.py_func which is
641 | # not TPU compatible. The right way to load data is with TFRecordReader.
642 | d = tf.data.Dataset.from_tensor_slices({
643 | "input_ids":
644 | tf.constant(
645 | all_input_ids, shape=[num_examples, seq_length],
646 | dtype=tf.int32),
647 | "input_mask":
648 | tf.constant(
649 | all_input_mask,
650 | shape=[num_examples, seq_length],
651 | dtype=tf.int32),
652 | "segment_ids":
653 | tf.constant(
654 | all_segment_ids,
655 | shape=[num_examples, seq_length],
656 | dtype=tf.int32),
657 | "label_ids":
658 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
659 | })
660 |
661 | if is_training:
662 | d = d.repeat()
663 | d = d.shuffle(buffer_size=100)
664 |
665 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
666 | return d
667 |
668 | return input_fn
669 |
670 |
671 | # This function is not used by this file but is still used by the Colab and
672 | # people who depend on it.
673 | def convert_examples_to_features(examples, label_list, max_seq_length,
674 | tokenizer, task_name):
675 | """Convert a set of `InputExample`s to a list of `InputFeatures`."""
676 |
677 | features = []
678 | print('Length of examples:',len(examples))
679 | for (ex_index, example) in enumerate(examples):
680 | if ex_index % 10000 == 0:
681 | #tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
682 | print("Writing example %d of %d" % (ex_index, len(examples)))
683 | feature = convert_single_example(ex_index, example, label_list,
684 | max_seq_length, tokenizer, task_name)
685 |
686 | features.append(feature)
687 | return features
688 |
689 |
690 | # Load parameters
691 | max_seq_length = hp.sequence_length
692 | do_lower_case = hp.do_lower_case
693 | vocab_file = hp.vocab_file
694 | tokenizer = tokenization.FullTokenizer.from_scratch(vocab_file=vocab_file,
695 | do_lower_case=do_lower_case,
696 | spm_model_file=None)
697 | processor = ClassifyProcessor()
698 | label_list = processor.get_labels()
699 |
700 |
701 | def get_features():
702 | # Load train data
703 | train_examples = processor.get_train_examples(hp.data_dir)
704 | # Get onehot feature
705 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify')
706 | input_ids = [f.input_ids for f in features]
707 | input_masks = [f.input_mask for f in features]
708 | segment_ids = [f.segment_ids for f in features]
709 | label_ids = [f.label_id for f in features]
710 | print('Get features finished!')
711 | return input_ids,input_masks,segment_ids,label_ids
712 |
713 | def get_features_test():
714 | # Load test data
715 | train_examples = processor.get_test_examples(hp.data_dir)
716 | # Get onehot feature
717 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify_test')
718 | input_ids = [f.input_ids for f in features]
719 | input_masks = [f.input_mask for f in features]
720 | segment_ids = [f.segment_ids for f in features]
721 | label_ids = [f.label_id for f in features]
722 | print('Get features(test) finished!')
723 | return input_ids,input_masks,segment_ids,label_ids
724 |
725 |
726 | def create_example(line,set_type):
727 | """Creates examples for the training and dev sets."""
728 | guid = "%s-%s" % (set_type, 1)
729 | text_a = tokenization.convert_to_unicode(line[1])
730 | label = tokenization.convert_to_unicode(line[0])
731 | example = InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
732 | return example
733 |
734 |
735 | def get_feature_test(sentence):
736 | example = create_example(['0',sentence],'test')
737 | feature = convert_single_example(0, example, label_list,max_seq_length, tokenizer,task_name='classify')
738 | return feature.input_ids,feature.input_mask,feature.segment_ids,feature.label_id
739 |
740 |
741 | if __name__ == '__main__':
742 | # Get feature
743 | sentence = '天天向上'
744 | feature = get_feature_test(sentence)
745 | print('feature.input_ids',feature[0])
746 | print('feature.input_mask',feature[1])
747 | print('feature.segment_ids',feature[2])
748 | print('feature.label_id',feature[3])
749 |
750 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/data/README.md:
--------------------------------------------------------------------------------
1 | # 数据集
2 | - data/sa_train.csv
3 | - data/sa_test.csv
4 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/data/sa_test.csv:
--------------------------------------------------------------------------------
1 | content,label
2 | 新配色喜欢,1
3 | 一星也不想给,-1
4 | 我就想说上个苹果X我媳妇儿用了差不多两年,0
5 | 大部分人还是被iso的流畅度吸引,1
6 | 一共买了7台,0
7 | 全金属身、超薄机身,1
8 | 还是值得了,1
9 | 紫色真的很好看,1
10 | 下次介绍朋友买,1
11 | 四个档位的水雾都拍了一张图,0
12 | 外观我觉得还可以,1
13 | 而且快递竟然打电话说下午不送了,-1
14 | 比不多一个晚上,0
15 | 这说明原创上电量使用是吹了牛+的,-1
16 | 放客厅非常合适,1
17 | 可惜没那么多钱了,0
18 | 因为没有防抖,-1
19 | 一个办用,0
20 | 电陶炉不挑锅具真是太方便啦,1
21 | 安装师傅也很尽心尽力,1
22 | 但是就是等得太久了,-1
23 | 网络速度快了很多,1
24 | 非标快哦,1
25 | 目前体验了下,0
26 | 不带音频,0
27 | 让自己想办法,0
28 | 但是时间长了就好了,0
29 | 客服人员服务态度也很好,1
30 | 买来给老父亲的,0
31 | 简直就是卡死,-1
32 | 最烦的就是这个充电头了,-1
33 | 女生也能单手操作,1
34 | 总是乱来,-1
35 | 国际品牌,0
36 | 虽然四个滤芯儿是送的,0
37 | 商品跟原装的一样灵活,1
38 | 安装师傅态度还不错.第二天打了客服电话后,1
39 | 这次给女儿新房又买一款,1
40 | 家里基本上都是小米的产品,1
41 | 两个人用完全没问题,1
42 | 超级超级好,1
43 | 在4米范围内小声说话都能收到音,1
44 | 水质过滤后改善了好多,1
45 | 都是屏幕和外框接口有缝隙,-1
46 | 开始想买康佳用的是LG萍屏,0
47 | 说月底盘点月初发,0
48 | 70寸还带主机音响,1
49 | 主要功能用于看时间跟测各种健康指标,1
50 | 布料也比较结实,1
51 | 但是整体体验很好,1
52 | 可以洗好多碗具,1
53 | 看着就非常的爽,1
54 | 摁一下自动匹配,0
55 | 才想起来没有评论,0
56 | 手感较好,1
57 | 电量也只用了一格,1
58 | 用过两个了,0
59 | 你们自己慢慢品,0
60 | 整体的效果不错,1
61 | 就像地摊20块钱以下的东西,-1
62 | 水容器大,1
63 | 讲真的以后真的不想再买苹果了,-1
64 | 外形外观:非常的好看,1
65 | 屏幕音效:质还是不错的,1
66 | 然后才发现原来的手机是移动定制机,0
67 | 就直接放在保安室,0
68 | 外观在能接受的范围,1
69 | 电池续航能力也不错,1
70 | 噪音比以前的小太多,1
71 | 价格贵点,-1
72 | 老公也说好,1
73 | 再清澈一些就好了,0
74 | 然后快递师傅还挺好的,1
75 | 一路给我打电话送到家的,1
76 | 适合小设计师们,1
77 | 用了空气泡沫包装,0
78 | 送给对象很喜欢,1
79 | 这款Lenovo,0
80 | 只当给自己起心理安慰吧,1
81 | 这两天热的时候还能降降温,1
82 | 我家阳台小尺寸刚刚好,1
83 | 送个老公的生日礼物,0
84 | 虽说不是JD自己的物流(日日顺快递公司),0
85 | 屏幕音效:l很好的,1
86 | 垃圾的东西,-1
87 | 唯一的缺点应该是拍照和想象中的差点,-1
88 | 商家送的备用PP棉芯很好,1
89 | 望继续做好售后服务,0
90 | 还好客服售后耐心与我沟通消除误会,1
91 | 荣耀品质做工越来越好了,1
92 | 我买的是86寸的电视,0
93 | 国产机越来越好了,1
94 | 使用起来相当流畅,1
95 | 京东买的有保障,1
96 | 智能反应超级迟钝,-1
97 | 用起来键盘还不错,1
98 | 居然还是坏的,-1
99 | 甚至可以用评价高达模型的角度来评价他,0
100 | 给远程安装,0
101 | 同时存电话号码时拼音输不全,-1
--------------------------------------------------------------------------------
/sentiment_analysis_albert/data/sa_train.csv:
--------------------------------------------------------------------------------
1 | content,label
2 | 超划算的,1
3 | iPhone的性价比之王,1
4 | 一套不同国家电源转换插头,-1
5 | 由于安装师傅上门安装时,0
6 | 55英寸的这个创维,0
7 | 我更喜欢那个的感觉,1
8 | 拿到手机第二天莫名其妙保护膜就裂了,-1
9 | 全部试了一遍都不行,-1
10 | 送的装备也很齐全,1
11 | 原来买的是内存32G的我在用好用,1
12 | 因为修剪器充电口处有一凸起,-1
13 | 提升幸福感必备,1
14 | 9号送人,0
15 | 紫色骚气,1
16 | 搭配手持吸尘器,0
17 | 包装的很精细,1
18 | 买了好几个华为手机了,1
19 | 当时还一下子买了3台,0
20 | 很喜欢这个牌子的啊,1
21 | 运行速度:除了反应慢,-1
22 | 关键是买的别的加湿器都是水干自动断电,-1
23 | 我只想问你们一个电脑利润就这么高,-1
24 | 用用还可以暂时没有发现问题,1
25 | 在京东购物总是让我高兴,1
26 | 这个门卡使用非常灵敏、方便,1
27 | 包装充分完好,1
28 | 华为的路由器好,1
29 | 希望以后也可以的屏幕跟音效都不错的,1
30 | 无论是正面还是侧面,0
31 | 高大上的东西,1
32 | 我的收据没帮开过来,-1
33 | 外形外观:和x一样,0
34 | 商家能积极处理客户碰到的问题,1
35 | 确实不会买苹果的手机,-1
36 | 目前使用一切正常,1
37 | 唯一觉得不足的就是app,-1
38 | 其他特色:现在我京东的速度很快,1
39 | 直接不回,-1
40 | 两个不足:,-1
41 | 很好支持国货,1
42 | 从内到外都很棒,1
43 | 的质量很好,1
44 | 待机时间:能从早用到晚,1
45 | 待机时间:续航是个鸡肋,-1
46 | 看看没动手,0
47 | 说不能直接放在地板上,0
48 | 后续体验再来追评,0
49 | 而且我还是联通卡,0
50 | 我们生活使用日常杠杠滴,1
51 | 充电慢费电快,-1
52 | 自从知道评论之后京豆可以抵现金了,0
53 | 运行速度:效果很不错,1
54 | 中午送到马上装好,0
55 | 然后Windows10和以前的系统有点不同,0
56 | 到的也快,1
57 | 感觉没有我的5T好用,-1
58 | 相信京东没问题,1
59 | 晚上睡觉开起好睡多了,1
60 | 华为是首选,1
61 | 还好可以,1
62 | 适合单身租房的,1
63 | 但是只给5月24以后买的保价,0
64 | 昨晚定,0
65 | 电池比较耐用、不再担心一天一次冲,1
66 | 已经离不开戴森了,1
67 | 物流速度很快电脑到了没有任何瑕疵,1
68 | 感觉音质很棒,1
69 | 待机时间没使用的情况下4天左右可以的,0
70 | 拍照效果:基本不用,0
71 | 这次活动价,0
72 | 东西用着很好,1
73 | 好不好用还没试,0
74 | 后悔换苹果了,-1
75 | 看起来挺不错,1
76 | 以后还找你购买,1
77 | 总算是没有让人失望,1
78 | 待机时间:全是很久,1
79 | 和v20犹豫了一番,0
80 | 扫拖一起干,1
81 | 这次买净水器滤芯也是很坚决,0
82 | 买的也都是256G,0
83 | 反复装卡重启都不行,-1
84 | 我觉得有点厚,-1
85 | 结果京东物流速度很快,1
86 | 公司常年采购,0
87 | 到货速度稍慢些,-1
88 | 宝宝们都很喜欢,1
89 | 外观看上去赞,1
90 | 这个电水壶确实好用,1
91 | 物流又给力,1
92 | 在门上有档次解锁识别速度,1
93 | 商家非常诚信,1
94 | 半个月过去了,0
95 | 容量大小:3口之家正合适,1
96 | 而且京东快递小姐姐服务还好,1
97 | 平时折叠放着特别的小,0
98 | 屏幕音效:屏幕a屏,0
99 | 由于是双十一买的,0
100 | 也优惠不了几块钱,-1
101 | 打400电话也不给安装,-1
102 | 电脑性价比可以,1
103 | 送货的师傅服务超好,1
104 | 拍照效果:拍照效果就是下面那个特别真实清晰,1
105 | 这次装修先预埋,0
106 | 这款颜色漂亮,1
107 | 我严重怀疑他们发的是翻新机,-1
108 | 一个人用大小刚好,1
109 | 二档稍微有点声音,0
110 | 比如刷,0
111 | 手机一直忘了来评价,0
112 | 好看的不得了,1
113 | 而且贵的壶和便宜的区别不大,-1
114 | 用了很长时间才来评论的,0
115 | 但是你要是戴上眼镜就不能识别了,-1
116 | 尺度也够用,1
117 | 好在很清晰,1
118 | 节省了地方,1
119 | 妈妈换手机很方便,1
120 | 运行速度:运行速度比其它手机更优秀,1
121 | 师傅送货也快,1
122 | 就拍了,0
123 | 自己买配件升级拖地,0
124 | 另外充电器有点大不太方便携带,-1
125 | 指示灯都是准确完好,1
126 | 外形外观:高端大气配置较高,1
127 | 非常失望的一次购物,-1
128 | 东西真心一般,0
129 | 还赠送了手机卡,1
130 | 忘记拍照就不上图了,0
131 | 事很多,0
132 | 比L580画面清晰,1
133 | 音质很好画质也很清晰,1
134 | 极具个性,1
135 | 作为入伙礼物买的,0
136 | 再到指纹锁的使用,0
137 | 售前售后都好,1
138 | 热情服务完,1
139 | 产品吸尘很好,1
140 | 这个价格买的很值得,1
141 | 这次买的电器很满意,1
142 | 和我同事华为的比快多了,1
143 | 以后买手机还来这买,1
144 | 中度使用也能一天,1
145 | 外形外观:手机,0
146 | 最最重要的是拿到手已经贴了原装手机屏幕膜,1
147 | 金属后盖,0
148 | 855puls稳得一批,1
149 | 办公用挺好,1
150 | 买来运动计步用的,0
151 | 扫起来很干净,1
152 | 帮朋友公司购买,0
153 | 却是没有水雾,-1
154 | 这款ⅵvo手机配骁龙855芯片,0
155 | 这个试了,0
156 | 主要客服也非常好哦,1
157 | 没想到千元机也这么强大,1
158 | 老婆很喜欢大小正合适,1
159 | 肖正友,0
160 | 发票到了,0
161 | 商家才有这样的底气,0
162 | 运行速度:麒麟980处理器运行速度很快,1
163 | 拍照质量真的超赞,1
164 | 噪音大小:还好,1
165 | 听着没啥感觉唉,0
166 | 来了就直接点亮了,0
167 | 相机算是对得起价格了,1
168 | 大小正好合适,1
169 | 沾了点水,0
170 | 就是插座这总插拔会不会坏,0
171 | 1.照相,0
172 | 卖家检测没问题,1
173 | 还一百多,0
174 | 诱光灯的效果相当不错,1
175 | 开关门声音太大了,-1
176 | 需要找人安装,0
177 | 颜值俱佳,1
178 | 装的时候就开了两三个小时还凉的挺快,1
179 | 所以硬盘都得多分点,0
180 | 值得回购哟,1
181 | 智能锁之初体验,0
182 | w这个配件还可以,1
183 | 然后还是拆开看了下,0
184 | 今年夏天特别的热,0
185 | 希望质量一如既往的好,1
186 | 购买之前对比了很多款式和牌子的电视,0
187 | 小孩子爱看动画片,1
188 | 其他的话有待日后确认,0
189 | 送给长辈用的,0
190 | 不过找遍了京东,0
191 | 好的卖家卖的宝贝,1
192 | 准备洗了,0
193 | 电视50寸以为没多大,0
194 | 600多的加湿器,0
195 | 搞得我水漏了一地,-1
196 | 一会要这样一会要那样,-1
197 | 五六个了,0
198 | 不知道是几手的,0
199 | 至少做工很不错,1
200 | 得亏有它,1
201 | 和电影里一样,0
202 | 新的刚到手就坏了你敢信,-1
203 | 基本每次充电不到半小时就能满,1
204 | 提示无网络,-1
205 | 东西巳收到,0
206 | 运行速度:暂时用起来很快,1
207 | 这两天主要试机,0
208 | 但是在京东买了这个给父亲用,0
209 | 但是图像处理能力很好,1
210 | 到今天一个月了,0
211 | 不要太挑剔,0
212 | 没其他毛病,1
213 | 解压压缩包就很热,-1
214 | 售后告诉我夏天要开6档,0
215 | 不过就是出雾好小,-1
216 | 最重要的是价格优惠,1
217 | 好说歹说收下了,0
218 | 充电二十多分钟就能充满,1
219 | 客服售后态度极差,-1
220 | 用了一天了没有一点水遗留在桌子上,1
221 | 没有蒸菜层的,-1
222 | 小不足,-1
223 | 耳机总是自己断开,-1
224 | 应该不影响使用,1
225 | 手机颜色跟图片有点不同,-1
226 | 自己找,0
227 | 值得购买大品牌,1
228 | 如果性价比可以再高一点的话就好啦,0
229 | 用十年是没问题的,1
230 | 按网上教程试了半天u盘里的都不识别,-1
231 | 不再断线了、超爽,1
232 | 估计我得抱着它睡觉,0
233 | 这次基础版入手,0
234 | 送的屏幕膜,1
235 | 在反复开关机之后才激活成功,-1
236 | 因为比实体店会便宜实惠好多,1
237 | 还没用就先恢复出厂设置了,-1
238 | 体验下,0
239 | 红薯粉色小鳄鱼竿支架子鼓浪屿啊啊啊啊五环之歌名,0
240 | 但是客服连最基本的东西都不懂,-1
241 | 雾气刚开始很小,-1
242 | 其他特色:就是原装充电器充电很慢,-1
243 | 华为10plus,0
244 | 今天安上了,0
245 | 下次再试试看,0
246 | 加水量并不多,1
247 | 说是可以无理由退换,1
248 | 总结:不推荐此型号,-1
249 | 不知道怎么样用段时间在看看,0
250 | 未使用的情况下,0
251 | 还好是练体育的,0
252 | 但是运行速度还行,1
253 | 问了又说去那边问,-1
254 | 也买了一台,0
255 | 收到宝贝特意用了一段时间过来评价,0
256 | 安装的师傅也很尽心,1
257 | 想不想买也就自己看的办吧,0
258 | 安装起来也很简单,1
259 | 还送整机10年保修,1
260 | 其他特色:拍照效果好,1
261 | 放两条就洗不干净了,-1
262 | 也不厚,0
263 | 对用户使用很友好,1
264 | 物流服务都超好,1
265 | 就是最简单的一种,1
266 | 比想象中的要小的多,-1
267 | 没想到这么大声音,-1
268 | 电视语音识别度很高在厨房都可以听清楚你要说的话,1
269 | 但是还是稍微重了一些,-1
270 | 和商家描述得一致,1
271 | 手机外观轻薄款音效非常好,1
272 | 厂家直接给我换一个,1
273 | 放心了不少,1
274 | 这个还没换呢,0
275 | 装上后那水立刻变得清澈透明,1
276 | 质量问题,-1
277 | 可以无线打印,1
278 | 比较热情,1
279 | 这种电器之类的,0
280 | 安装师傅的服务态度非常好,1
281 | 质量也满意,1
282 | 按照他们的方法,0
283 | 今天最后一局还剩百分之11的电我又开了一局,0
284 | 很好正品,1
285 | 亲测待机量30+,0
286 | 但是想起这个屏幕,0
287 | 沉浸式音乐感受&rdquo,1
288 | 京东大家电价格便宜,1
289 | 然后色彩还原度也比较好,1
290 | 1-2秒就能识别,1
291 | 明明是想买路由器的,0
292 | 能带大耳机,0
293 | 商品没啥,0
294 | 出差、旅行携带方便,1
295 | 声明快递员很给力,1
296 | 现在的快递都是下楼自己拿的,0
297 | 还没有我九块九三张的好,-1
298 | 特意把儿子煲好的IE60拿了试了下,0
299 | 快递很差,-1
300 | 为健康观影保驾护航,0
301 | 外形外观:白色的加点彩虹一样的颜色,0
--------------------------------------------------------------------------------
/sentiment_analysis_albert/hyperparameters.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2018
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | import sys
11 | pwd = os.path.dirname(os.path.abspath(__file__))
12 | sys.path.append(pwd)
13 |
14 |
15 | class Hyperparamters:
16 | # Train parameters
17 | num_train_epochs = 5
18 | print_step = 10
19 | batch_size = 64
20 | batch_size_eval = 128
21 | summary_step = 10
22 | num_saved_per_epoch = 3
23 | max_to_keep = 100
24 |
25 | # Model paths
26 | logdir = 'logdir/model_01'
27 | file_save_model = 'model/model_save'
28 | file_load_model = 'model/model_load'
29 |
30 | # Train data and test data
31 | train_data = "sa_train.csv"
32 | test_data = "sa_test.csv"
33 |
34 | # Optimization parameters
35 | warmup_proportion = 0.1
36 | use_tpu = None
37 | do_lower_case = True
38 | learning_rate = 5e-5
39 |
40 | # TextCNN parameters
41 | num_filters = 128
42 | filter_sizes = [2,3,4,5,6,7]
43 | embedding_size = 384
44 | keep_prob = 0.5
45 |
46 | # Sequence and Label
47 | sequence_length = 60
48 | num_labels = 3
49 | dict_label = {
50 | '0': '-1',
51 | '1': '0',
52 | '2': '1'}
53 |
54 | # ALBERT parameters
55 | name = 'albert_small_zh_google'
56 | bert_path = os.path.join(pwd,name)
57 | data_dir = os.path.join(pwd,'data')
58 | vocab_file = os.path.join(pwd,name,'vocab_chinese.txt')
59 | init_checkpoint = os.path.join(pwd,name,'albert_model.ckpt')
60 | saved_model_path = os.path.join(pwd,'model')
61 |
62 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/image/model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/image/model.png
--------------------------------------------------------------------------------
/sentiment_analysis_albert/lamb_optimizer.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | # Lint as: python2, python3
16 | """Functions and classes related to optimization (weight updates)."""
17 |
18 | from __future__ import absolute_import
19 | from __future__ import division
20 | from __future__ import print_function
21 |
22 | import re
23 | import six
24 | import tensorflow.compat.v1 as tf
25 |
26 | # pylint: disable=g-direct-tensorflow-import
27 | from tensorflow.python.ops import array_ops
28 | from tensorflow.python.ops import linalg_ops
29 | from tensorflow.python.ops import math_ops
30 | # pylint: enable=g-direct-tensorflow-import
31 |
32 |
33 | class LAMBOptimizer(tf.train.Optimizer):
34 | """LAMB (Layer-wise Adaptive Moments optimizer for Batch training)."""
35 | # A new optimizer that includes correct L2 weight decay, adaptive
36 | # element-wise updating, and layer-wise justification. The LAMB optimizer
37 | # was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,
38 | # James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT
39 | # Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962)
40 |
41 | def __init__(self,
42 | learning_rate,
43 | weight_decay_rate=0.0,
44 | beta_1=0.9,
45 | beta_2=0.999,
46 | epsilon=1e-6,
47 | exclude_from_weight_decay=None,
48 | exclude_from_layer_adaptation=None,
49 | name="LAMBOptimizer"):
50 | """Constructs a LAMBOptimizer."""
51 | super(LAMBOptimizer, self).__init__(False, name)
52 |
53 | self.learning_rate = learning_rate
54 | self.weight_decay_rate = weight_decay_rate
55 | self.beta_1 = beta_1
56 | self.beta_2 = beta_2
57 | self.epsilon = epsilon
58 | self.exclude_from_weight_decay = exclude_from_weight_decay
59 | # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the
60 | # arg is None.
61 | # TODO(jingli): validate if exclude_from_layer_adaptation is necessary.
62 | if exclude_from_layer_adaptation:
63 | self.exclude_from_layer_adaptation = exclude_from_layer_adaptation
64 | else:
65 | self.exclude_from_layer_adaptation = exclude_from_weight_decay
66 |
67 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
68 | """See base class."""
69 | assignments = []
70 | for (grad, param) in grads_and_vars:
71 | if grad is None or param is None:
72 | continue
73 |
74 | param_name = self._get_variable_name(param.name)
75 |
76 | m = tf.get_variable(
77 | name=six.ensure_str(param_name) + "/adam_m",
78 | shape=param.shape.as_list(),
79 | dtype=tf.float32,
80 | trainable=False,
81 | initializer=tf.zeros_initializer())
82 | v = tf.get_variable(
83 | name=six.ensure_str(param_name) + "/adam_v",
84 | shape=param.shape.as_list(),
85 | dtype=tf.float32,
86 | trainable=False,
87 | initializer=tf.zeros_initializer())
88 |
89 | # Standard Adam update.
90 | next_m = (
91 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
92 | next_v = (
93 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
94 | tf.square(grad)))
95 |
96 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
97 |
98 | # Just adding the square of the weights to the loss function is *not*
99 | # the correct way of using L2 regularization/weight decay with Adam,
100 | # since that will interact with the m and v parameters in strange ways.
101 | #
102 | # Instead we want ot decay the weights in a manner that doesn't interact
103 | # with the m/v parameters. This is equivalent to adding the square
104 | # of the weights to the loss with plain (non-momentum) SGD.
105 | if self._do_use_weight_decay(param_name):
106 | update += self.weight_decay_rate * param
107 |
108 | ratio = 1.0
109 | if self._do_layer_adaptation(param_name):
110 | w_norm = linalg_ops.norm(param, ord=2)
111 | g_norm = linalg_ops.norm(update, ord=2)
112 | ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where(
113 | math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0)
114 |
115 | update_with_lr = ratio * self.learning_rate * update
116 |
117 | next_param = param - update_with_lr
118 |
119 | assignments.extend(
120 | [param.assign(next_param),
121 | m.assign(next_m),
122 | v.assign(next_v)])
123 | return tf.group(*assignments, name=name)
124 |
125 | def _do_use_weight_decay(self, param_name):
126 | """Whether to use L2 weight decay for `param_name`."""
127 | if not self.weight_decay_rate:
128 | return False
129 | if self.exclude_from_weight_decay:
130 | for r in self.exclude_from_weight_decay:
131 | if re.search(r, param_name) is not None:
132 | return False
133 | return True
134 |
135 | def _do_layer_adaptation(self, param_name):
136 | """Whether to do layer-wise learning rate adaptation for `param_name`."""
137 | if self.exclude_from_layer_adaptation:
138 | for r in self.exclude_from_layer_adaptation:
139 | if re.search(r, param_name) is not None:
140 | return False
141 | return True
142 |
143 | def _get_variable_name(self, param_name):
144 | """Get the variable name from the tensor name."""
145 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
146 | if m is not None:
147 | param_name = m.group(1)
148 | return param_name
149 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553634.DESKTOP-QC1A83I:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553634.DESKTOP-QC1A83I
--------------------------------------------------------------------------------
/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553671.DESKTOP-QC1A83I:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553671.DESKTOP-QC1A83I
--------------------------------------------------------------------------------
/sentiment_analysis_albert/model/model_load/README.md:
--------------------------------------------------------------------------------
1 | # 推理所用的模型
2 | - model_load/checkpoit
3 | - model_load/model_1_0.ckpt.data-00000-of-00001
4 | - model_load/model_1_0.ckpt.index
5 | - model_load/model_1_0.ckpt.meta
6 |
7 | **checkpoint内容**
8 | ```
9 | model_checkpoint_path: "model_1_0.ckpt"
10 | ```
11 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/model/model_save/README.md:
--------------------------------------------------------------------------------
1 | # 训练过程所得模型
2 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/modules.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 21:01:45 2019
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import tensorflow as tf
10 | from tensorflow.contrib.rnn import DropoutWrapper
11 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp
12 |
13 |
14 |
15 | def cell_textcnn(inputs,is_training):
16 | # Add a dimension in final shape
17 | inputs_expand = tf.expand_dims(inputs, -1)
18 | # Create a convolution + maxpool layer for each filter size
19 | pooled_outputs = []
20 | with tf.name_scope("TextCNN"):
21 | for i, filter_size in enumerate(hp.filter_sizes):
22 | with tf.name_scope("conv-maxpool-%s" % filter_size):
23 | # Convolution Layer
24 | filter_shape = [filter_size, hp.embedding_size, 1, hp.num_filters]
25 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1),dtype=tf.float32, name="W")
26 | b = tf.Variable(tf.constant(0.1, shape=[hp.num_filters]),dtype=tf.float32, name="b")
27 | conv = tf.nn.conv2d(
28 | inputs_expand,
29 | W,
30 | strides=[1, 1, 1, 1],
31 | padding="VALID",
32 | name="conv")
33 | # Apply nonlinearity
34 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
35 | # Maxpooling over the outputs
36 | pooled = tf.nn.max_pool(
37 | h,
38 | ksize=[1, hp.sequence_length - filter_size + 1, 1, 1],
39 | strides=[1, 1, 1, 1],
40 | padding='VALID',
41 | name="pool")
42 | pooled_outputs.append(pooled)
43 | # Combine all the pooled features
44 | num_filters_total = hp.num_filters * len(hp.filter_sizes)
45 | h_pool = tf.concat(pooled_outputs, 3)
46 | h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])
47 | # Dropout
48 | h_pool_flat_dropout = tf.nn.dropout(h_pool_flat, keep_prob=hp.keep_prob if is_training else 1)
49 | return h_pool_flat_dropout
50 |
51 |
52 |
53 |
54 |
55 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/networks.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 20:44:42 2019
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 | import tensorflow as tf
10 | from sentiment_analysis_albert import modeling,optimization
11 | from sentiment_analysis_albert.classifier_utils import ClassifyProcessor
12 | from sentiment_analysis_albert.modules import cell_textcnn
13 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp
14 | from sentiment_analysis_albert.utils import time_now_string
15 |
16 |
17 | num_labels = hp.num_labels
18 | processor = ClassifyProcessor()
19 | bert_config_file = os.path.join(hp.bert_path,'albert_config.json')
20 | bert_config = modeling.AlbertConfig.from_json_file(bert_config_file)
21 |
22 |
23 | def count_model_params():
24 | """
25 | Compte the parameters
26 | """
27 | total_parameters = 0
28 | for variable in tf.trainable_variables():
29 | shape = variable.get_shape()
30 | variable_parameters = 1
31 | for dim in shape:
32 | variable_parameters *= dim.value
33 | total_parameters += variable_parameters
34 | print(' + Number of params: %.2fM' % (total_parameters / 1e6))
35 |
36 |
37 | class NetworkAlbert(object):
38 | def __init__(self,is_training):
39 | self.is_training = is_training
40 | self.input_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_ids')
41 | self.input_masks = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_masks')
42 | self.segment_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='segment_ids')
43 | self.label_ids = tf.placeholder(tf.int32, shape=[None], name='label_ids')
44 | # Load BERT Pre-training LM
45 | self.model = modeling.AlbertModel(
46 | config=bert_config,
47 | is_training=self.is_training,
48 | input_ids=self.input_ids,
49 | input_mask=self.input_masks,
50 | token_type_ids=self.segment_ids,
51 | use_one_hot_embeddings=False)
52 |
53 | # Get the feature vector with size 3D:(batch_size,sequence_length,hidden_size)
54 | output_layer_init = self.model.get_sequence_output()
55 | # Cell textcnn
56 | output_layer = cell_textcnn(output_layer_init,self.is_training)
57 | # Hidden size
58 | hidden_size = output_layer.shape[-1].value
59 | # Dense
60 | with tf.name_scope("Full-connection"):
61 | output_weights = tf.get_variable(
62 | "output_weights", [num_labels, hidden_size],
63 | initializer=tf.truncated_normal_initializer(stddev=0.02))
64 |
65 | output_bias = tf.get_variable(
66 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
67 | # Logit
68 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
69 | self.logits = tf.nn.bias_add(logits, output_bias)
70 | self.probabilities = tf.nn.softmax(self.logits, axis=-1)
71 | # Prediction
72 | with tf.variable_scope("Prediction"):
73 | self.preds = tf.argmax(self.logits, axis=-1, output_type=tf.int32)
74 | # Summary for tensorboard
75 | with tf.variable_scope("Loss"):
76 | if self.is_training:
77 | self.accuracy = tf.reduce_mean(tf.to_float(tf.equal(self.preds, self.label_ids)))
78 | tf.summary.scalar('Accuracy', self.accuracy)
79 |
80 | # Check whether has loaded model
81 | ckpt = tf.train.get_checkpoint_state(hp.saved_model_path)
82 | checkpoint_suffix = ".index"
83 | if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path + checkpoint_suffix):
84 | print('='*10,'Restoring model from checkpoint!','='*10)
85 | print("%s - Restoring model from checkpoint ~%s" % (time_now_string(),
86 | ckpt.model_checkpoint_path))
87 | else:
88 | # Load BERT Pre-training LM
89 | print('='*10,'First time load BERT model!','='*10)
90 | tvars = tf.trainable_variables()
91 | if hp.init_checkpoint:
92 | (assignment_map, initialized_variable_names) = \
93 | modeling.get_assignment_map_from_checkpoint(tvars,
94 | hp.init_checkpoint)
95 | tf.train.init_from_checkpoint(hp.init_checkpoint, assignment_map)
96 |
97 | # Optimization
98 | if self.is_training:
99 | # Global_step
100 | self.global_step = tf.Variable(0, name='global_step', trainable=False)
101 | # Loss
102 | log_probs = tf.nn.log_softmax(self.logits, axis=-1)
103 | one_hot_labels = tf.one_hot(self.label_ids, depth=num_labels, dtype=tf.float32)
104 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
105 | self.loss = tf.reduce_mean(per_example_loss)
106 | # Optimizer
107 | train_examples = processor.get_train_examples(hp.data_dir)
108 | num_train_steps = int(
109 | len(train_examples) / hp.batch_size * hp.num_train_epochs)
110 | num_warmup_steps = int(num_train_steps * hp.warmup_proportion)
111 | self.optimizer = optimization.create_optimizer(self.loss,
112 | hp.learning_rate,
113 | num_train_steps,
114 | num_warmup_steps,
115 | hp.use_tpu,
116 | Global_step=self.global_step,
117 | )
118 | # Summary for tensorboard
119 | tf.summary.scalar('loss', self.loss)
120 | self.merged = tf.summary.merge_all()
121 |
122 | # Compte the parameters
123 | count_model_params()
124 | vs = tf.trainable_variables()
125 | for l in vs:
126 | print(l)
127 |
128 |
129 |
130 |
131 | if __name__ == '__main__':
132 | # Load model
133 | albert = NetworkAlbert(is_training=True)
134 |
135 |
136 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/optimization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | # Lint as: python2, python3
16 | """Functions and classes related to optimization (weight updates)."""
17 |
18 | from __future__ import absolute_import
19 | from __future__ import division
20 | from __future__ import print_function
21 | import re
22 | import lamb_optimizer
23 | import six
24 | from six.moves import zip
25 | import tensorflow.compat.v1 as tf
26 | from tensorflow.contrib import tpu as contrib_tpu
27 |
28 |
29 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu,Global_step,
30 | optimizer="adamw", poly_power=1.0, start_warmup_step=0):
31 | #optimizer="adamw", poly_power=1.0, start_warmup_step=0):
32 | """Creates an optimizer training op."""
33 | if Global_step:
34 | global_step = Global_step
35 | else:
36 | global_step = tf.train.get_or_create_global_step()
37 |
38 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
39 |
40 | # Implements linear decay of the learning rate.
41 | learning_rate = tf.train.polynomial_decay(
42 | learning_rate,
43 | global_step,
44 | num_train_steps,
45 | end_learning_rate=0.0,
46 | power=poly_power,
47 | cycle=False)
48 |
49 | # Implements linear warmup. I.e., if global_step - start_warmup_step <
50 | # num_warmup_steps, the learning rate will be
51 | # `(global_step - start_warmup_step)/num_warmup_steps * init_lr`.
52 | if num_warmup_steps:
53 | tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step)
54 | + ", for " + str(num_warmup_steps) + " steps ++++++")
55 | global_steps_int = tf.cast(global_step, tf.int32)
56 | start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32)
57 | global_steps_int = global_steps_int - start_warm_int
58 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
59 |
60 | global_steps_float = tf.cast(global_steps_int, tf.float32)
61 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
62 |
63 | warmup_percent_done = global_steps_float / warmup_steps_float
64 | warmup_learning_rate = init_lr * warmup_percent_done
65 |
66 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
67 | learning_rate = (
68 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
69 |
70 | # It is OK that you use this optimizer for finetuning, since this
71 | # is how the model was trained (note that the Adam m/v variables are NOT
72 | # loaded from init_checkpoint.)
73 | # It is OK to use AdamW in the finetuning even the model is trained by LAMB.
74 | # As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune
75 | # is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a
76 | # batch size of 64 in the finetune.
77 | if optimizer == "adamw":
78 | tf.logging.info("using adamw")
79 | optimizer = AdamWeightDecayOptimizer(
80 | learning_rate=learning_rate,
81 | weight_decay_rate=0.01,
82 | beta_1=0.9,
83 | beta_2=0.999,
84 | epsilon=1e-6,
85 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
86 | elif optimizer == "lamb":
87 | tf.logging.info("using lamb")
88 | optimizer = lamb_optimizer.LAMBOptimizer(
89 | learning_rate=learning_rate,
90 | weight_decay_rate=0.01,
91 | beta_1=0.9,
92 | beta_2=0.999,
93 | epsilon=1e-6,
94 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
95 | else:
96 | raise ValueError("Not supported optimizer: ", optimizer)
97 |
98 | if use_tpu:
99 | optimizer = contrib_tpu.CrossShardOptimizer(optimizer)
100 |
101 | tvars = tf.trainable_variables()
102 | grads = tf.gradients(loss, tvars)
103 |
104 | # This is how the model was pre-trained.
105 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
106 |
107 | train_op = optimizer.apply_gradients(
108 | list(zip(grads, tvars)), global_step=global_step)
109 |
110 | # Normally the global step update is done inside of `apply_gradients`.
111 | # However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this.
112 | # But if you use a different optimizer, you should probably take this line
113 | # out.
114 | new_global_step = global_step + 1
115 | train_op = tf.group(train_op, [global_step.assign(new_global_step)])
116 | return train_op
117 |
118 |
119 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
120 | """A basic Adam optimizer that includes "correct" L2 weight decay."""
121 |
122 | def __init__(self,
123 | learning_rate,
124 | weight_decay_rate=0.0,
125 | beta_1=0.9,
126 | beta_2=0.999,
127 | epsilon=1e-6,
128 | exclude_from_weight_decay=None,
129 | name="AdamWeightDecayOptimizer"):
130 | """Constructs a AdamWeightDecayOptimizer."""
131 | super(AdamWeightDecayOptimizer, self).__init__(False, name)
132 |
133 | self.learning_rate = learning_rate
134 | self.weight_decay_rate = weight_decay_rate
135 | self.beta_1 = beta_1
136 | self.beta_2 = beta_2
137 | self.epsilon = epsilon
138 | self.exclude_from_weight_decay = exclude_from_weight_decay
139 |
140 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
141 | """See base class."""
142 | assignments = []
143 | for (grad, param) in grads_and_vars:
144 | if grad is None or param is None:
145 | continue
146 |
147 | param_name = self._get_variable_name(param.name)
148 |
149 | m = tf.get_variable(
150 | name=six.ensure_str(param_name) + "/adam_m",
151 | shape=param.shape.as_list(),
152 | dtype=tf.float32,
153 | trainable=False,
154 | initializer=tf.zeros_initializer())
155 | v = tf.get_variable(
156 | name=six.ensure_str(param_name) + "/adam_v",
157 | shape=param.shape.as_list(),
158 | dtype=tf.float32,
159 | trainable=False,
160 | initializer=tf.zeros_initializer())
161 |
162 | # Standard Adam update.
163 | next_m = (
164 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
165 | next_v = (
166 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
167 | tf.square(grad)))
168 |
169 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
170 |
171 | # Just adding the square of the weights to the loss function is *not*
172 | # the correct way of using L2 regularization/weight decay with Adam,
173 | # since that will interact with the m and v parameters in strange ways.
174 | #
175 | # Instead we want ot decay the weights in a manner that doesn't interact
176 | # with the m/v parameters. This is equivalent to adding the square
177 | # of the weights to the loss with plain (non-momentum) SGD.
178 | if self._do_use_weight_decay(param_name):
179 | update += self.weight_decay_rate * param
180 |
181 | update_with_lr = self.learning_rate * update
182 |
183 | next_param = param - update_with_lr
184 |
185 | assignments.extend(
186 | [param.assign(next_param),
187 | m.assign(next_m),
188 | v.assign(next_v)])
189 | return tf.group(*assignments, name=name)
190 |
191 | def _do_use_weight_decay(self, param_name):
192 | """Whether to use L2 weight decay for `param_name`."""
193 | if not self.weight_decay_rate:
194 | return False
195 | if self.exclude_from_weight_decay:
196 | for r in self.exclude_from_weight_decay:
197 | if re.search(r, param_name) is not None:
198 | return False
199 | return True
200 |
201 | def _get_variable_name(self, param_name):
202 | """Get the variable name from the tensor name."""
203 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
204 | if m is not None:
205 | param_name = m.group(1)
206 | return param_name
207 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/predict.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 17:12:37 2019
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | pwd = os.path.dirname(os.path.abspath(__file__))
11 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1'
12 | import sys
13 | sys.path.append(os.path.dirname(os.path.dirname(__file__)))
14 | import tensorflow as tf
15 | from sentiment_analysis_albert.networks import NetworkAlbert
16 | from sentiment_analysis_albert.classifier_utils import get_feature_test
17 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp
18 |
19 |
20 |
21 | class ModelAlbertTextCNN(object):
22 | """
23 | Load NetworkAlbert TextCNN model
24 | """
25 | def __init__(self,):
26 | self.albert, self.sess = self.load_model()
27 | @staticmethod
28 | def load_model():
29 | with tf.Graph().as_default():
30 | sess = tf.Session()
31 | out_dir = os.path.join(pwd, "model")
32 | with sess.as_default():
33 | albert = NetworkAlbert(is_training=False)
34 | saver = tf.train.Saver()
35 | sess.run(tf.global_variables_initializer())
36 | checkpoint_dir = os.path.abspath(os.path.join(out_dir,hp.file_load_model))
37 | print (checkpoint_dir)
38 | ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
39 | saver.restore(sess, ckpt.model_checkpoint_path)
40 | return albert,sess
41 |
42 | MODEL = ModelAlbertTextCNN()
43 | print('Load model finished!')
44 |
45 |
46 |
47 | def sa(sentence):
48 | """
49 | Prediction of the sentence's sentiment.
50 | """
51 | feature = get_feature_test(sentence)
52 | fd = {MODEL.albert.input_ids: [feature[0]],
53 | MODEL.albert.input_masks: [feature[1]],
54 | MODEL.albert.segment_ids:[feature[2]],
55 | }
56 | output = MODEL.sess.run(MODEL.albert.preds, feed_dict=fd)
57 | return output[0]-1
58 |
59 |
60 |
61 |
62 | if __name__ == '__main__':
63 | ##
64 | import time
65 | start = time.time()
66 | sent = '我喜欢这个地方'
67 | print(sa(sent))
68 | end = time.time()
69 | print(end-start)
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow1.15
2 | sentencepiece
3 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/tokenization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | # Lint as: python2, python3
16 | # coding=utf-8
17 | """Tokenization classes."""
18 |
19 | from __future__ import absolute_import
20 | from __future__ import division
21 | from __future__ import print_function
22 |
23 | import collections
24 | import re
25 | import unicodedata
26 | import six
27 | from six.moves import range
28 | import tensorflow.compat.v1 as tf
29 | #import tensorflow_hub as hub
30 | import sentencepiece as spm
31 |
32 | SPIECE_UNDERLINE = u"▁".encode("utf-8")
33 |
34 |
35 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
36 | """Checks whether the casing config is consistent with the checkpoint name."""
37 |
38 | # The casing has to be passed in by the user and there is no explicit check
39 | # as to whether it matches the checkpoint. The casing information probably
40 | # should have been stored in the bert_config.json file, but it's not, so
41 | # we have to heuristically detect it to validate.
42 |
43 | if not init_checkpoint:
44 | return
45 |
46 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt",
47 | six.ensure_str(init_checkpoint))
48 | if m is None:
49 | return
50 |
51 | model_name = m.group(1)
52 |
53 | lower_models = [
54 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
55 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
56 | ]
57 |
58 | cased_models = [
59 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
60 | "multi_cased_L-12_H-768_A-12"
61 | ]
62 |
63 | is_bad_config = False
64 | if model_name in lower_models and not do_lower_case:
65 | is_bad_config = True
66 | actual_flag = "False"
67 | case_name = "lowercased"
68 | opposite_flag = "True"
69 |
70 | if model_name in cased_models and do_lower_case:
71 | is_bad_config = True
72 | actual_flag = "True"
73 | case_name = "cased"
74 | opposite_flag = "False"
75 |
76 | if is_bad_config:
77 | raise ValueError(
78 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
79 | "However, `%s` seems to be a %s model, so you "
80 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
81 | "how the model was pre-training. If this error is wrong, please "
82 | "just comment out this check." % (actual_flag, init_checkpoint,
83 | model_name, case_name, opposite_flag))
84 |
85 |
86 | def preprocess_text(inputs, remove_space=True, lower=False):
87 | """preprocess data by removing extra space and normalize data."""
88 | outputs = inputs
89 | if remove_space:
90 | outputs = " ".join(inputs.strip().split())
91 |
92 | if six.PY2 and isinstance(outputs, str):
93 | try:
94 | outputs = six.ensure_text(outputs, "utf-8")
95 | except UnicodeDecodeError:
96 | outputs = six.ensure_text(outputs, "latin-1")
97 |
98 | outputs = unicodedata.normalize("NFKD", outputs)
99 | outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
100 | if lower:
101 | outputs = outputs.lower()
102 |
103 | return outputs
104 |
105 |
106 | def encode_pieces(sp_model, text, return_unicode=True, sample=False):
107 | """turn sentences into word pieces."""
108 |
109 | if six.PY2 and isinstance(text, six.text_type):
110 | text = six.ensure_binary(text, "utf-8")
111 |
112 | if not sample:
113 | pieces = sp_model.EncodeAsPieces(text)
114 | else:
115 | pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
116 | new_pieces = []
117 | for piece in pieces:
118 | piece = printable_text(piece)
119 | if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
120 | cur_pieces = sp_model.EncodeAsPieces(
121 | six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
122 | if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
123 | if len(cur_pieces[0]) == 1:
124 | cur_pieces = cur_pieces[1:]
125 | else:
126 | cur_pieces[0] = cur_pieces[0][1:]
127 | cur_pieces.append(piece[-1])
128 | new_pieces.extend(cur_pieces)
129 | else:
130 | new_pieces.append(piece)
131 |
132 | # note(zhiliny): convert back to unicode for py2
133 | if six.PY2 and return_unicode:
134 | ret_pieces = []
135 | for piece in new_pieces:
136 | if isinstance(piece, str):
137 | piece = six.ensure_text(piece, "utf-8")
138 | ret_pieces.append(piece)
139 | new_pieces = ret_pieces
140 |
141 | return new_pieces
142 |
143 |
144 | def encode_ids(sp_model, text, sample=False):
145 | pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
146 | ids = [sp_model.PieceToId(piece) for piece in pieces]
147 | return ids
148 |
149 |
150 | def convert_to_unicode(text):
151 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
152 | if six.PY3:
153 | if isinstance(text, str):
154 | return text
155 | elif isinstance(text, bytes):
156 | return six.ensure_text(text, "utf-8", "ignore")
157 | else:
158 | raise ValueError("Unsupported string type: %s" % (type(text)))
159 | elif six.PY2:
160 | if isinstance(text, str):
161 | return six.ensure_text(text, "utf-8", "ignore")
162 | elif isinstance(text, six.text_type):
163 | return text
164 | else:
165 | raise ValueError("Unsupported string type: %s" % (type(text)))
166 | else:
167 | raise ValueError("Not running on Python2 or Python 3?")
168 |
169 |
170 | def printable_text(text):
171 | """Returns text encoded in a way suitable for print or `tf.logging`."""
172 |
173 | # These functions want `str` for both Python2 and Python3, but in one case
174 | # it's a Unicode string and in the other it's a byte string.
175 | if six.PY3:
176 | if isinstance(text, str):
177 | return text
178 | elif isinstance(text, bytes):
179 | return six.ensure_text(text, "utf-8", "ignore")
180 | else:
181 | raise ValueError("Unsupported string type: %s" % (type(text)))
182 | elif six.PY2:
183 | if isinstance(text, str):
184 | return text
185 | elif isinstance(text, six.text_type):
186 | return six.ensure_binary(text, "utf-8")
187 | else:
188 | raise ValueError("Unsupported string type: %s" % (type(text)))
189 | else:
190 | raise ValueError("Not running on Python2 or Python 3?")
191 |
192 |
193 | def load_vocab(vocab_file):
194 | """Loads a vocabulary file into a dictionary."""
195 | vocab = collections.OrderedDict()
196 | with tf.gfile.GFile(vocab_file, "r") as reader:
197 | while True:
198 | token = convert_to_unicode(reader.readline())
199 | if not token:
200 | break
201 | token = token.strip()#.split()[0]
202 | if token not in vocab:
203 | vocab[token] = len(vocab)
204 | return vocab
205 |
206 |
207 | def convert_by_vocab(vocab, items):
208 | """Converts a sequence of [tokens|ids] using the vocab."""
209 | output = []
210 | for item in items:
211 | output.append(vocab[item])
212 | return output
213 |
214 |
215 | def convert_tokens_to_ids(vocab, tokens):
216 | return convert_by_vocab(vocab, tokens)
217 |
218 |
219 | def convert_ids_to_tokens(inv_vocab, ids):
220 | return convert_by_vocab(inv_vocab, ids)
221 |
222 |
223 | def whitespace_tokenize(text):
224 | """Runs basic whitespace cleaning and splitting on a piece of text."""
225 | text = text.strip()
226 | if not text:
227 | return []
228 | tokens = text.split()
229 | return tokens
230 |
231 |
232 | class FullTokenizer(object):
233 | """Runs end-to-end tokenziation."""
234 |
235 | def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None):
236 | self.vocab = None
237 | self.sp_model = None
238 | if spm_model_file:
239 | self.sp_model = spm.SentencePieceProcessor()
240 | tf.logging.info("loading sentence piece model")
241 | self.sp_model.Load(spm_model_file)
242 | # Note(mingdachen): For the purpose of consisent API, we are
243 | # generating a vocabulary for the sentence piece tokenizer.
244 | self.vocab = {self.sp_model.IdToPiece(i): i for i
245 | in range(self.sp_model.GetPieceSize())}
246 | else:
247 | self.vocab = load_vocab(vocab_file)
248 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
249 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
250 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
251 |
252 | @classmethod
253 | def from_scratch(cls, vocab_file, do_lower_case, spm_model_file):
254 | return FullTokenizer(vocab_file, do_lower_case, spm_model_file)
255 |
256 | # @classmethod
257 | # def from_hub_module(cls, hub_module, spm_model_file):
258 | # """Get the vocab file and casing info from the Hub module."""
259 | # with tf.Graph().as_default():
260 | # albert_module = hub.Module(hub_module)
261 | # tokenization_info = albert_module(signature="tokenization_info",
262 | # as_dict=True)
263 | # with tf.Session() as sess:
264 | # vocab_file, do_lower_case = sess.run(
265 | # [tokenization_info["vocab_file"],
266 | # tokenization_info["do_lower_case"]])
267 | # return FullTokenizer(
268 | # vocab_file=vocab_file, do_lower_case=do_lower_case,
269 | # spm_model_file=spm_model_file)
270 |
271 | def tokenize(self, text):
272 | if self.sp_model:
273 | split_tokens = encode_pieces(self.sp_model, text, return_unicode=False)
274 | else:
275 | split_tokens = []
276 | for token in self.basic_tokenizer.tokenize(text):
277 | for sub_token in self.wordpiece_tokenizer.tokenize(token):
278 | split_tokens.append(sub_token)
279 |
280 | return split_tokens
281 |
282 | def convert_tokens_to_ids(self, tokens):
283 | if self.sp_model:
284 | tf.logging.info("using sentence piece tokenzier.")
285 | return [self.sp_model.PieceToId(
286 | printable_text(token)) for token in tokens]
287 | else:
288 | return convert_by_vocab(self.vocab, tokens)
289 |
290 | def convert_ids_to_tokens(self, ids):
291 | if self.sp_model:
292 | tf.logging.info("using sentence piece tokenzier.")
293 | return [self.sp_model.IdToPiece(id_) for id_ in ids]
294 | else:
295 | return convert_by_vocab(self.inv_vocab, ids)
296 |
297 |
298 | class BasicTokenizer(object):
299 | """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
300 |
301 | def __init__(self, do_lower_case=True):
302 | """Constructs a BasicTokenizer.
303 |
304 | Args:
305 | do_lower_case: Whether to lower case the input.
306 | """
307 | self.do_lower_case = do_lower_case
308 |
309 | def tokenize(self, text):
310 | """Tokenizes a piece of text."""
311 | text = convert_to_unicode(text)
312 | text = self._clean_text(text)
313 |
314 | # This was added on November 1st, 2018 for the multilingual and Chinese
315 | # models. This is also applied to the English models now, but it doesn't
316 | # matter since the English models were not trained on any Chinese data
317 | # and generally don't have any Chinese data in them (there are Chinese
318 | # characters in the vocabulary because Wikipedia does have some Chinese
319 | # words in the English Wikipedia.).
320 | text = self._tokenize_chinese_chars(text)
321 |
322 | orig_tokens = whitespace_tokenize(text)
323 | split_tokens = []
324 | for token in orig_tokens:
325 | if self.do_lower_case:
326 | token = token.lower()
327 | token = self._run_strip_accents(token)
328 | split_tokens.extend(self._run_split_on_punc(token))
329 |
330 | output_tokens = whitespace_tokenize(" ".join(split_tokens))
331 | return output_tokens
332 |
333 | def _run_strip_accents(self, text):
334 | """Strips accents from a piece of text."""
335 | text = unicodedata.normalize("NFD", text)
336 | output = []
337 | for char in text:
338 | cat = unicodedata.category(char)
339 | if cat == "Mn":
340 | continue
341 | output.append(char)
342 | return "".join(output)
343 |
344 | def _run_split_on_punc(self, text):
345 | """Splits punctuation on a piece of text."""
346 | chars = list(text)
347 | i = 0
348 | start_new_word = True
349 | output = []
350 | while i < len(chars):
351 | char = chars[i]
352 | if _is_punctuation(char):
353 | output.append([char])
354 | start_new_word = True
355 | else:
356 | if start_new_word:
357 | output.append([])
358 | start_new_word = False
359 | output[-1].append(char)
360 | i += 1
361 |
362 | return ["".join(x) for x in output]
363 |
364 | def _tokenize_chinese_chars(self, text):
365 | """Adds whitespace around any CJK character."""
366 | output = []
367 | for char in text:
368 | cp = ord(char)
369 | if self._is_chinese_char(cp):
370 | output.append(" ")
371 | output.append(char)
372 | output.append(" ")
373 | else:
374 | output.append(char)
375 | return "".join(output)
376 |
377 | def _is_chinese_char(self, cp):
378 | """Checks whether CP is the codepoint of a CJK character."""
379 | # This defines a "chinese character" as anything in the CJK Unicode block:
380 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
381 | #
382 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
383 | # despite its name. The modern Korean Hangul alphabet is a different block,
384 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write
385 | # space-separated words, so they are not treated specially and handled
386 | # like the all of the other languages.
387 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
388 | (cp >= 0x3400 and cp <= 0x4DBF) or #
389 | (cp >= 0x20000 and cp <= 0x2A6DF) or #
390 | (cp >= 0x2A700 and cp <= 0x2B73F) or #
391 | (cp >= 0x2B740 and cp <= 0x2B81F) or #
392 | (cp >= 0x2B820 and cp <= 0x2CEAF) or
393 | (cp >= 0xF900 and cp <= 0xFAFF) or #
394 | (cp >= 0x2F800 and cp <= 0x2FA1F)): #
395 | return True
396 |
397 | return False
398 |
399 | def _clean_text(self, text):
400 | """Performs invalid character removal and whitespace cleanup on text."""
401 | output = []
402 | for char in text:
403 | cp = ord(char)
404 | if cp == 0 or cp == 0xfffd or _is_control(char):
405 | continue
406 | if _is_whitespace(char):
407 | output.append(" ")
408 | else:
409 | output.append(char)
410 | return "".join(output)
411 |
412 |
413 | class WordpieceTokenizer(object):
414 | """Runs WordPiece tokenziation."""
415 |
416 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
417 | self.vocab = vocab
418 | self.unk_token = unk_token
419 | self.max_input_chars_per_word = max_input_chars_per_word
420 |
421 | def tokenize(self, text):
422 | """Tokenizes a piece of text into its word pieces.
423 |
424 | This uses a greedy longest-match-first algorithm to perform tokenization
425 | using the given vocabulary.
426 |
427 | For example:
428 | input = "unaffable"
429 | output = ["un", "##aff", "##able"]
430 |
431 | Args:
432 | text: A single token or whitespace separated tokens. This should have
433 | already been passed through `BasicTokenizer.
434 |
435 | Returns:
436 | A list of wordpiece tokens.
437 | """
438 |
439 | text = convert_to_unicode(text)
440 |
441 | output_tokens = []
442 | for token in whitespace_tokenize(text):
443 | chars = list(token)
444 | if len(chars) > self.max_input_chars_per_word:
445 | output_tokens.append(self.unk_token)
446 | continue
447 |
448 | is_bad = False
449 | start = 0
450 | sub_tokens = []
451 | while start < len(chars):
452 | end = len(chars)
453 | cur_substr = None
454 | while start < end:
455 | substr = "".join(chars[start:end])
456 | if start > 0:
457 | substr = "##" + six.ensure_str(substr)
458 | if substr in self.vocab:
459 | cur_substr = substr
460 | break
461 | end -= 1
462 | if cur_substr is None:
463 | is_bad = True
464 | break
465 | sub_tokens.append(cur_substr)
466 | start = end
467 |
468 | if is_bad:
469 | output_tokens.append(self.unk_token)
470 | else:
471 | output_tokens.extend(sub_tokens)
472 | return output_tokens
473 |
474 |
475 | def _is_whitespace(char):
476 | """Checks whether `chars` is a whitespace character."""
477 | # \t, \n, and \r are technically control characters but we treat them
478 | # as whitespace since they are generally considered as such.
479 | if char == " " or char == "\t" or char == "\n" or char == "\r":
480 | return True
481 | cat = unicodedata.category(char)
482 | if cat == "Zs":
483 | return True
484 | return False
485 |
486 |
487 | def _is_control(char):
488 | """Checks whether `chars` is a control character."""
489 | # These are technically control characters but we count them as whitespace
490 | # characters.
491 | if char == "\t" or char == "\n" or char == "\r":
492 | return False
493 | cat = unicodedata.category(char)
494 | if cat in ("Cc", "Cf"):
495 | return True
496 | return False
497 |
498 |
499 | def _is_punctuation(char):
500 | """Checks whether `chars` is a punctuation character."""
501 | cp = ord(char)
502 | # We treat all non-letter/number ASCII as punctuation.
503 | # Characters such as "^", "$", and "`" are not in the Unicode
504 | # Punctuation class but we treat them as punctuation anyways, for
505 | # consistency.
506 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
507 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
508 | return True
509 | cat = unicodedata.category(char)
510 | if cat.startswith("P"):
511 | return True
512 | return False
513 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/train.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 21:42:07 2019
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1'
11 | import numpy as np
12 | import tensorflow as tf
13 | from sentiment_analysis_albert.classifier_utils import get_features,get_features_test
14 | from sentiment_analysis_albert.networks import NetworkAlbert
15 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp
16 | from sentiment_analysis_albert.utils import shuffle_one,select,time_now_string
17 |
18 |
19 | # Load Model
20 | pwd = os.path.dirname(os.path.abspath(__file__))
21 | MODEL = NetworkAlbert(is_training=True )
22 |
23 | # Get data features
24 | input_ids,input_masks,segment_ids,label_ids = get_features()
25 | input_ids_test,input_masks_test,segment_ids_test,label_ids_test = get_features_test()
26 | num_train_samples = len(input_ids)
27 | arr = np.arange(num_train_samples)
28 | num_batchs = int((num_train_samples - 1)/hp.batch_size) + 1
29 | print('number of batch:',num_batchs)
30 | ids_test = np.arange(len(input_ids_test))
31 |
32 | # Set up the graph
33 | saver = tf.train.Saver(max_to_keep=hp.max_to_keep)
34 | sess = tf.Session()
35 | sess.run(tf.global_variables_initializer())
36 |
37 | # Load model saved before
38 | MODEL_SAVE_PATH = os.path.join(pwd, hp.file_save_model)
39 | ckpt = tf.train.get_checkpoint_state(MODEL_SAVE_PATH)
40 | if ckpt and ckpt.model_checkpoint_path:
41 | saver.restore(sess, ckpt.model_checkpoint_path)
42 | print('Restored model!')
43 |
44 |
45 | with sess.as_default():
46 | # Tensorboard writer
47 | writer = tf.summary.FileWriter(hp.logdir, sess.graph)
48 | for i in range(hp.num_train_epochs):
49 | indexs = shuffle_one(arr)
50 | for j in range(num_batchs-1):
51 | i1 = indexs[j * hp.batch_size:min((j + 1) * hp.batch_size, num_train_samples)]
52 | # Get features
53 | input_id_ = select(input_ids,i1)
54 | input_mask_ = select(input_masks,i1)
55 | segment_id_ = select(segment_ids,i1)
56 | label_id_ = select(label_ids,i1)
57 | # Feed dict
58 | fd = {MODEL.input_ids: input_id_,
59 | MODEL.input_masks: input_mask_,
60 | MODEL.segment_ids:segment_id_,
61 | MODEL.label_ids:label_id_}
62 | # Optimizer
63 | sess.run(MODEL.optimizer, feed_dict = fd)
64 | # Tensorboard
65 | if j%hp.summary_step==0:
66 | summary,glolal_step = sess.run([MODEL.merged,MODEL.global_step], feed_dict = fd)
67 | writer.add_summary(summary, glolal_step)
68 | # Save Model
69 | if j%(num_batchs//hp.num_saved_per_epoch)==0:
70 | if not os.path.exists(os.path.join(pwd, hp.file_save_model)):
71 | os.makedirs(os.path.join(pwd, hp.file_save_model))
72 | saver.save(sess, os.path.join(pwd, hp.file_save_model, 'model_%s_%s.ckpt'%(str(i),str(j))))
73 | # Log
74 | if j % hp.print_step == 0:
75 | # Loss of Train data
76 | fd = {MODEL.input_ids: input_id_,
77 | MODEL.input_masks: input_mask_ ,
78 | MODEL.segment_ids:segment_id_,
79 | MODEL.label_ids:label_id_}
80 | loss = sess.run(MODEL.loss, feed_dict = fd)
81 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss:%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss)))
82 | # Loss of Test data
83 | indexs_test = shuffle_one(ids_test)[:hp.batch_size_eval]
84 | input_id_test = select(input_ids_test,indexs_test)
85 | input_mask_test = select(input_masks_test,indexs_test)
86 | segment_id_test = select(segment_ids_test,indexs_test)
87 | label_id_test = select(label_ids_test,indexs_test)
88 | fd_test = {MODEL.input_ids:input_id_test,
89 | MODEL.input_masks:input_mask_test ,
90 | MODEL.segment_ids:segment_id_test,
91 | MODEL.label_ids:label_id_test}
92 | loss = sess.run(MODEL.loss, feed_dict = fd_test)
93 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss(test):%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss)))
94 | print('Optimization finished')
95 |
96 |
97 |
98 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 29 20:40:40 2019
4 |
5 | @author: cm
6 | """
7 |
8 | import time
9 | import pandas as pd
10 | import numpy as np
11 |
12 |
13 |
14 | def time_now_string():
15 | return time.strftime("%Y-%m-%d %H:%M:%S",time.localtime( time.time() ))
16 |
17 |
18 | def load_csv(file,header=0,encoding="utf-8"):
19 | return pd.read_csv(file,
20 | encoding=encoding,
21 | header=header,
22 | error_bad_lines=False)
23 |
24 |
25 | def save_csv(dataframe,file,header=True,index=None,encoding="utf-8"):
26 | return dataframe.to_csv(file,
27 | mode='w+',
28 | header=header,
29 | index=index,
30 | encoding=encoding)
31 |
32 | def save_excel(dataframe,file,header=True,sheetname='Sheet1'):
33 | return dataframe.to_excel(file,
34 | header=header,
35 | sheet_name=sheetname)
36 |
37 | def load_excel(file,header=0,sheetname=None):
38 | dfs = pd.read_excel(file,
39 | header=header,
40 | sheet_name=sheetname)
41 | sheet_names = list(dfs.keys())
42 | print('Name of first sheet:',sheet_names[0])
43 | df = dfs[sheet_names[0]]
44 | print('Load excel data finished!')
45 | return df.fillna("")
46 |
47 | def load_txt(file):
48 | with open(file,encoding='utf-8',errors='ignore') as fp:
49 | lines = fp.readlines()
50 | lines = [l.strip() for l in lines]
51 | return lines
52 |
53 |
54 | def save_txt(file,lines):
55 | lines = [l+'\n' for l in lines]
56 | with open(file,'w+',encoding='utf-8') as fp:#a+添加
57 | fp.writelines(lines)
58 |
59 |
60 | def select(data,ids):
61 | return [data[i] for i in ids]
62 |
63 | def shuffle_one(a1):
64 | ran = np.arange(len(a1))
65 | np.random.shuffle(ran)
66 | a1_ = [a1[l] for l in ran]
67 | return a1_
68 |
69 |
70 | def shuffle_two(a1,a2):
71 | """
72 | 随机打乱a1和a2两个
73 | """
74 | ran = np.arange(len(a1))
75 | np.random.shuffle(ran)
76 | a1_ = [a1[l] for l in ran]
77 | a2_ = [a2[l] for l in ran]
78 | return a1_, a2_
79 |
80 | def load_vocabulary(file_vocabulary_label):
81 | """
82 | Load vocabulary to dict
83 | """
84 | vocabulary = load_txt(file_vocabulary_label)
85 | dic = {}
86 | for i,l in enumerate(vocabulary):
87 | dic[str(i)] = str(l)
88 | return dic
89 |
90 |
91 | if __name__ == "__main__":
92 | print(time_now_string())
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/README.md:
--------------------------------------------------------------------------------
1 | # 简介
2 | 1、本项目是在tensorflow版本1.15.0的基础上做的训练和测试。
3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。
4 | 3、albert_small_zh_google对应的百度云下载地址:
5 | 链接:https://pan.baidu.com/s/1RKzGJTazlZ7y12YRbAWvyA
6 | 提取码:wuxw
7 |
8 | # 使用方法
9 | 1、准备数据
10 | 数据格式为:data/sa_train.csv(训练), data/sa_test.csv(测试)
11 | 2、参数设置
12 | 参考脚本 hyperparameters.py,直接修改里面的数值即可。
13 | 3、训练
14 | python train.py
15 | 4、推理 python predict.py
16 |
17 | # 知乎代码解读
18 | https://zhuanlan.zhihu.com/p/338806367
19 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/__init__.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/albert_base_zh/README.md:
--------------------------------------------------------------------------------
1 | # 语言模型
2 |
3 | ## ALBERT Base Chinese
4 | - albert_base_zh/albert_config.json
5 | - albert_base_zh/albert_model.ckpt.data-00000-of-00001
6 | - albert_base_zh/albert_model.ckpt.index
7 | - albert_base_zh/albert_model.ckpt.meta
8 | - albert_base_zh/checkpoint
9 | - albert_base_zh/vocab_chinese.txt
10 | - albert_base_zh/vocab_emoji.txt
11 |
12 | ## 下载路径
13 | 链接:https://pan.baidu.com/s/1BuXZyj1VmlvX60agv0cE5A?pwd=b9aq
14 | 提取码:b9aq
15 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/albert_small_zh_google/README.md:
--------------------------------------------------------------------------------
1 | # 语言模型: ALBERT Base Chinese
2 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/classifier_utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | import os
11 | import collections
12 | import tensorflow_hub as hub
13 | import tensorflow.compat.v1 as tf
14 | from tensorflow.contrib import tpu as contrib_tpu
15 | from tensorflow.contrib import data as contrib_data
16 | from tensorflow.contrib import metrics as contrib_metrics
17 |
18 | from sentiment_analysis_albert_emoji import modeling
19 | from sentiment_analysis_albert_emoji import optimization
20 | from sentiment_analysis_albert_emoji import tokenization
21 | from sentiment_analysis_albert_emoji.utils import load_csv
22 | from sentiment_analysis_albert_emoji.utils import get_emoji
23 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp
24 |
25 |
26 |
27 | def index2label(index):
28 | return hp.dict_label[str(index)]
29 |
30 |
31 | def emoji2id(string_emoji):
32 | """
33 | 将emoji转为id形式,设置sequence_length=hp.sequence_length
34 | """
35 | length = len(string_emoji)
36 | if length <= hp.sequence_length_emoji:
37 | return [hp.vocab_emoji_char2id.get(l,0) for l in string_emoji]+[0]*(hp.sequence_length_emoji-length)
38 | elif length > hp.sequence_length_emoji:
39 | return [hp.vocab_emoji_char2id.get(l,0) for l in string_emoji[:hp.sequence_length_emoji]]
40 |
41 |
42 | def read_csv(input_file):
43 | """Reads a tab separated value file."""
44 | df = load_csv(input_file)
45 | jobcontent = df['content'].tolist()
46 | jlabel = df['label'].tolist()
47 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))]
48 | print('Read csv finished!(1)')
49 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str]
50 | return lines2
51 |
52 |
53 |
54 | class InputExample(object):
55 | """A single training/test example for simple sequence classification."""
56 |
57 | def __init__(self, guid, text_a, text_b=None, label=None):
58 | """Constructs a InputExample.
59 |
60 | Args:
61 | guid: Unique id for the example.
62 | text_a: string. The untokenized text of the first sequence. For single
63 | sequence tasks, only this sequence must be specified.
64 | text_b: (Optional) string. The untokenized text of the second sequence.
65 | Only must be specified for sequence pair tasks.
66 | label: (Optional) string. The label of the example. This should be
67 | specified for train and dev examples, but not for test examples.
68 | """
69 | self.guid = guid
70 | self.text_a = text_a
71 | self.text_b = text_b
72 | self.label = label
73 |
74 |
75 | class PaddingInputExample(object):
76 | """Fake example so the num input examples is a multiple of the batch size.
77 |
78 | When running eval/predict on the TPU, we need to pad the number of examples
79 | to be a multiple of the batch size, because the TPU requires a fixed batch
80 | size. The alternative is to drop the last batch, which is bad because it means
81 | the entire output data won't be generated.
82 |
83 | We use this class instead of `None` because treating `None` as padding
84 | battches could cause silent errors.
85 | """
86 |
87 |
88 | class InputFeatures(object):
89 | """A single set of features of data."""
90 |
91 | def __init__(self,
92 | input_ids,
93 | input_mask,
94 | segment_ids,
95 | label_id,
96 | guid=None,
97 | example_id=None,
98 | is_real_example=True):
99 | self.input_ids = input_ids
100 | self.input_mask = input_mask
101 | self.segment_ids = segment_ids
102 | self.label_id = label_id
103 | self.example_id = example_id
104 | self.guid = guid
105 | self.is_real_example = is_real_example
106 |
107 |
108 | class DataProcessor(object):
109 | """Base class for data converters for sequence classification data sets."""
110 |
111 | def __init__(self, use_spm, do_lower_case):
112 | super(DataProcessor, self).__init__()
113 | self.use_spm = use_spm
114 | self.do_lower_case = do_lower_case
115 |
116 | def get_train_examples(self, data_dir):
117 | """Gets a collection of `InputExample`s for the train set."""
118 | raise NotImplementedError()
119 |
120 | def get_dev_examples(self, data_dir):
121 | """Gets a collection of `InputExample`s for the dev set."""
122 | raise NotImplementedError()
123 |
124 | def get_test_examples(self, data_dir):
125 | """Gets a collection of `InputExample`s for prediction."""
126 | raise NotImplementedError()
127 |
128 | def get_labels(self):
129 | """Gets the list of labels for this data set."""
130 | raise NotImplementedError()
131 |
132 | @classmethod
133 | def _read_csv(cls,input_file):
134 | """Reads a tab separated value file."""
135 | df = load_csv(input_file)
136 | jobcontent = df['content'].tolist()
137 | jlabel = df['label'].tolist()
138 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))]
139 | print('Length of data:',len(lines))
140 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str]
141 | return lines2
142 |
143 |
144 | class ClassifyProcessor(DataProcessor):
145 | """Processor for the MRPC data set (GLUE version)."""
146 |
147 | def __init__(self):
148 | self.labels = set()
149 |
150 | def get_train_examples(self, data_dir):
151 | """See base class."""
152 | return self._create_examples(
153 | self._read_csv(os.path.join(data_dir, hp.train_data)), "train")
154 |
155 | def get_dev_examples(self, data_dir):
156 | """See base class."""
157 | return self._create_examples(
158 | self._read_csv(os.path.join(data_dir, hp.test_data)), "dev")
159 |
160 | def get_test_examples(self, data_dir):
161 | """See base class."""
162 | return self._create_examples(
163 | self._read_csv(os.path.join(data_dir, hp.test_data)), "test")
164 |
165 | def get_labels(self):
166 | """See base class."""
167 | return ['0','1','2']
168 |
169 | def _create_examples(self, lines, set_type):
170 | """Creates examples for the training and dev sets."""
171 | examples = []
172 | for (i, line) in enumerate(lines):
173 | guid = "%s-%s" % (set_type, i)
174 | text_a = tokenization.convert_to_unicode(line[1])
175 | label = tokenization.convert_to_unicode(line[0])
176 | self.labels.add(label)
177 | examples.append(
178 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
179 | return examples
180 |
181 |
182 | def convert_single_example(ex_index, example, label_list, max_seq_length,
183 | tokenizer, task_name):
184 | """Converts a single `InputExample` into a single `InputFeatures`."""
185 |
186 | if isinstance(example, PaddingInputExample):
187 | return InputFeatures(
188 | input_ids=[0] * max_seq_length,
189 | input_mask=[0] * max_seq_length,
190 | segment_ids=[0] * max_seq_length,
191 | label_id=0,
192 | is_real_example=False)
193 |
194 | if task_name != "sts-b":
195 | label_map = {}
196 | for (i, label) in enumerate(label_list):
197 | label_map[label] = i
198 |
199 | tokens_a = tokenizer.tokenize(example.text_a)
200 | tokens_b = None
201 | if example.text_b:
202 | tokens_b = tokenizer.tokenize(example.text_b)
203 |
204 | if tokens_b:
205 | # Modifies `tokens_a` and `tokens_b` in place so that the total
206 | # length is less than the specified length.
207 | # Account for [CLS], [SEP], [SEP] with "- 3"
208 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
209 | else:
210 | # Account for [CLS] and [SEP] with "- 2"
211 | if len(tokens_a) > max_seq_length - 2:
212 | tokens_a = tokens_a[0:(max_seq_length - 2)]
213 |
214 | # The convention in ALBERT is:
215 | # (a) For sequence pairs:
216 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
217 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
218 | # (b) For single sequences:
219 | # tokens: [CLS] the dog is hairy . [SEP]
220 | # type_ids: 0 0 0 0 0 0 0
221 | #
222 | # Where "type_ids" are used to indicate whether this is the first
223 | # sequence or the second sequence. The embedding vectors for `type=0` and
224 | # `type=1` were learned during pre-training and are added to the
225 | # embedding vector (and position vector). This is not *strictly* necessary
226 | # since the [SEP] token unambiguously separates the sequences, but it makes
227 | # it easier for the model to learn the concept of sequences.
228 | #
229 | # For classification tasks, the first vector (corresponding to [CLS]) is
230 | # used as the "sentence vector". Note that this only makes sense because
231 | # the entire model is fine-tuned.
232 | tokens = []
233 | segment_ids = []
234 | tokens.append("[CLS]")
235 | segment_ids.append(0)
236 | for token in tokens_a:
237 | tokens.append(token)
238 | segment_ids.append(0)
239 | tokens.append("[SEP]")
240 | segment_ids.append(0)
241 |
242 | if tokens_b:
243 | for token in tokens_b:
244 | tokens.append(token)
245 | segment_ids.append(1)
246 | tokens.append("[SEP]")
247 | segment_ids.append(1)
248 |
249 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
250 |
251 | # The mask has 1 for real tokens and 0 for padding tokens. Only real
252 | # tokens are attended to.
253 | input_mask = [1] * len(input_ids)
254 |
255 | # Zero-pad up to the sequence length.
256 | while len(input_ids) < max_seq_length:
257 | input_ids.append(0)
258 | input_mask.append(0)
259 | segment_ids.append(0)
260 |
261 | assert len(input_ids) == max_seq_length
262 | assert len(input_mask) == max_seq_length
263 | assert len(segment_ids) == max_seq_length
264 |
265 | if task_name != "sts-b":
266 | label_id = label_map[example.label]
267 | else:
268 | label_id = example.label
269 |
270 | # if ex_index < 5:
271 | # tf.logging.info("*** Example ***")
272 | # tf.logging.info("guid: %s" % (example.guid))
273 | # tf.logging.info("tokens: %s" % " ".join(
274 | # [tokenization.printable_text(x) for x in tokens]))
275 | # tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
276 | # tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
277 | # tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
278 | # tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
279 |
280 | feature = InputFeatures(
281 | input_ids=input_ids,
282 | input_mask=input_mask,
283 | segment_ids=segment_ids,
284 | label_id=label_id,
285 | is_real_example=True)
286 | return feature
287 |
288 |
289 | def file_based_convert_examples_to_features(
290 | examples, label_list, max_seq_length, tokenizer, output_file, task_name):
291 | """Convert a set of `InputExample`s to a TFRecord file."""
292 |
293 | writer = tf.python_io.TFRecordWriter(output_file)
294 |
295 | for (ex_index, example) in enumerate(examples):
296 | if ex_index % 10000 == 0:
297 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
298 |
299 | feature = convert_single_example(ex_index, example, label_list,
300 | max_seq_length, tokenizer, task_name)
301 |
302 | def create_int_feature(values):
303 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
304 | return f
305 |
306 | def create_float_feature(values):
307 | f = tf.train.Feature(float_list=tf.train.FloatList(value=list(values)))
308 | return f
309 |
310 | features = collections.OrderedDict()
311 | features["input_ids"] = create_int_feature(feature.input_ids)
312 | features["input_mask"] = create_int_feature(feature.input_mask)
313 | features["segment_ids"] = create_int_feature(feature.segment_ids)
314 | features["label_ids"] = create_float_feature([feature.label_id])\
315 | if task_name == "sts-b" else create_int_feature([feature.label_id])
316 | features["is_real_example"] = create_int_feature(
317 | [int(feature.is_real_example)])
318 |
319 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
320 | writer.write(tf_example.SerializeToString())
321 | writer.close()
322 |
323 |
324 | def file_based_input_fn_builder(input_file, seq_length, is_training,
325 | drop_remainder, task_name, use_tpu, bsz,
326 | multiple=1):
327 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
328 | labeltype = tf.float32 if task_name == "sts-b" else tf.int64
329 |
330 | name_to_features = {
331 | "input_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64),
332 | "input_mask": tf.FixedLenFeature([seq_length * multiple], tf.int64),
333 | "segment_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64),
334 | "label_ids": tf.FixedLenFeature([], labeltype),
335 | "is_real_example": tf.FixedLenFeature([], tf.int64),
336 | }
337 |
338 | def _decode_record(record, name_to_features):
339 | """Decodes a record to a TensorFlow example."""
340 | example = tf.parse_single_example(record, name_to_features)
341 |
342 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
343 | # So cast all int64 to int32.
344 | for name in list(example.keys()):
345 | t = example[name]
346 | if t.dtype == tf.int64:
347 | t = tf.to_int32(t)
348 | example[name] = t
349 |
350 | return example
351 |
352 | def input_fn(params):
353 | """The actual input function."""
354 | if use_tpu:
355 | batch_size = params["batch_size"]
356 | else:
357 | batch_size = bsz
358 |
359 | # For training, we want a lot of parallel reading and shuffling.
360 | # For eval, we want no shuffling and parallel reading doesn't matter.
361 | d = tf.data.TFRecordDataset(input_file)
362 | if is_training:
363 | d = d.repeat()
364 | d = d.shuffle(buffer_size=100)
365 |
366 | d = d.apply(
367 | contrib_data.map_and_batch(
368 | lambda record: _decode_record(record, name_to_features),
369 | batch_size=batch_size,
370 | drop_remainder=drop_remainder))
371 |
372 | return d
373 |
374 | return input_fn
375 |
376 |
377 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
378 | """Truncates a sequence pair in place to the maximum length."""
379 |
380 | # This is a simple heuristic which will always truncate the longer sequence
381 | # one token at a time. This makes more sense than truncating an equal percent
382 | # of tokens from each, since if one sequence is very short then each token
383 | # that's truncated likely contains more information than a longer sequence.
384 | while True:
385 | total_length = len(tokens_a) + len(tokens_b)
386 | if total_length <= max_length:
387 | break
388 | if len(tokens_a) > len(tokens_b):
389 | tokens_a.pop()
390 | else:
391 | tokens_b.pop()
392 |
393 |
394 | def _create_model_from_hub(hub_module, is_training, input_ids, input_mask,
395 | segment_ids):
396 | """Creates an ALBERT model from TF-Hub."""
397 | tags = set()
398 | if is_training:
399 | tags.add("train")
400 | albert_module = hub.Module(hub_module, tags=tags, trainable=True)
401 | albert_inputs = dict(
402 | input_ids=input_ids,
403 | input_mask=input_mask,
404 | segment_ids=segment_ids)
405 | albert_outputs = albert_module(
406 | inputs=albert_inputs,
407 | signature="tokens",
408 | as_dict=True)
409 | output_layer = albert_outputs["pooled_output"]
410 | return output_layer
411 |
412 |
413 | def _create_model_from_scratch(albert_config, is_training, input_ids,
414 | input_mask, segment_ids, use_one_hot_embeddings):
415 | """Creates an ALBERT model from scratch (as opposed to hub)."""
416 | model = modeling.AlbertModel(
417 | config=albert_config,
418 | is_training=is_training,
419 | input_ids=input_ids,
420 | input_mask=input_mask,
421 | token_type_ids=segment_ids,
422 | use_one_hot_embeddings=use_one_hot_embeddings)
423 | output_layer = model.get_pooled_output()
424 | return output_layer
425 |
426 |
427 | def create_model(albert_config, is_training, input_ids, input_mask, segment_ids,
428 | labels, num_labels, use_one_hot_embeddings, task_name,
429 | hub_module):
430 | """Creates a classification model."""
431 | if hub_module:
432 | tf.logging.info("creating model from hub_module: %s", hub_module)
433 | output_layer = _create_model_from_hub(hub_module, is_training, input_ids,
434 | input_mask, segment_ids)
435 | else:
436 | tf.logging.info("creating model from albert_config")
437 | output_layer = _create_model_from_scratch(albert_config, is_training,
438 | input_ids, input_mask,
439 | segment_ids,
440 | use_one_hot_embeddings)
441 |
442 | hidden_size = output_layer.shape[-1].value
443 |
444 | output_weights = tf.get_variable(
445 | "output_weights", [num_labels, hidden_size],
446 | initializer=tf.truncated_normal_initializer(stddev=0.02))
447 |
448 | output_bias = tf.get_variable(
449 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
450 |
451 | with tf.variable_scope("loss"):
452 | if is_training:
453 | # I.e., 0.1 dropout
454 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
455 |
456 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
457 | logits = tf.nn.bias_add(logits, output_bias)
458 | if task_name != "sts-b":
459 | probabilities = tf.nn.softmax(logits, axis=-1)
460 | predictions = tf.argmax(probabilities, axis=-1, output_type=tf.int32)
461 | log_probs = tf.nn.log_softmax(logits, axis=-1)
462 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
463 |
464 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
465 | else:
466 | probabilities = logits
467 | logits = tf.squeeze(logits, [-1])
468 | predictions = logits
469 | per_example_loss = tf.square(logits - labels)
470 | loss = tf.reduce_mean(per_example_loss)
471 |
472 | return (loss, per_example_loss, probabilities, logits, predictions)
473 |
474 |
475 | def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate,
476 | num_train_steps, num_warmup_steps, use_tpu,
477 | use_one_hot_embeddings, task_name, hub_module=None,
478 | optimizer="adamw"):
479 | """Returns `model_fn` closure for TPUEstimator."""
480 |
481 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
482 | """The `model_fn` for TPUEstimator."""
483 |
484 | tf.logging.info("*** Features ***")
485 | for name in sorted(features.keys()):
486 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
487 |
488 | input_ids = features["input_ids"]
489 | input_mask = features["input_mask"]
490 | segment_ids = features["segment_ids"]
491 | label_ids = features["label_ids"]
492 | is_real_example = None
493 | if "is_real_example" in features:
494 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
495 | else:
496 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
497 |
498 | is_training = (mode == tf.estimator.ModeKeys.TRAIN)
499 |
500 | (total_loss, per_example_loss, probabilities, logits, predictions) = \
501 | create_model(albert_config, is_training, input_ids, input_mask,
502 | segment_ids, label_ids, num_labels,
503 | use_one_hot_embeddings, task_name, hub_module)
504 |
505 | tvars = tf.trainable_variables()
506 | initialized_variable_names = {}
507 | scaffold_fn = None
508 | if init_checkpoint:
509 | (assignment_map, initialized_variable_names
510 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
511 | if use_tpu:
512 |
513 | def tpu_scaffold():
514 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
515 | return tf.train.Scaffold()
516 |
517 | scaffold_fn = tpu_scaffold
518 | else:
519 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
520 |
521 | tf.logging.info("**** Trainable Variables ****")
522 | for var in tvars:
523 | init_string = ""
524 | if var.name in initialized_variable_names:
525 | init_string = ", *INIT_FROM_CKPT*"
526 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
527 | init_string)
528 |
529 | output_spec = None
530 | if mode == tf.estimator.ModeKeys.TRAIN:
531 |
532 | train_op = optimization.create_optimizer(
533 | total_loss, learning_rate, num_train_steps, num_warmup_steps,
534 | use_tpu, optimizer)
535 |
536 | output_spec = contrib_tpu.TPUEstimatorSpec(
537 | mode=mode,
538 | loss=total_loss,
539 | train_op=train_op,
540 | scaffold_fn=scaffold_fn)
541 | elif mode == tf.estimator.ModeKeys.EVAL:
542 | if task_name not in ["sts-b", "cola"]:
543 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
544 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
545 | accuracy = tf.metrics.accuracy(
546 | labels=label_ids, predictions=predictions,
547 | weights=is_real_example)
548 | loss = tf.metrics.mean(
549 | values=per_example_loss, weights=is_real_example)
550 | return {
551 | "eval_accuracy": accuracy,
552 | "eval_loss": loss,
553 | }
554 | elif task_name == "sts-b":
555 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
556 | """Compute Pearson correlations for STS-B."""
557 | # Display labels and predictions
558 | concat1 = contrib_metrics.streaming_concat(logits)
559 | concat2 = contrib_metrics.streaming_concat(label_ids)
560 |
561 | # Compute Pearson correlation
562 | pearson = contrib_metrics.streaming_pearson_correlation(
563 | logits, label_ids, weights=is_real_example)
564 |
565 | # Compute MSE
566 | # mse = tf.metrics.mean(per_example_loss)
567 | mse = tf.metrics.mean_squared_error(
568 | label_ids, logits, weights=is_real_example)
569 |
570 | loss = tf.metrics.mean(
571 | values=per_example_loss,
572 | weights=is_real_example)
573 |
574 | return {"pred": concat1, "label_ids": concat2, "pearson": pearson,
575 | "MSE": mse, "eval_loss": loss,}
576 | elif task_name == "cola":
577 | def metric_fn(per_example_loss, label_ids, logits, is_real_example):
578 | """Compute Matthew's correlations for STS-B."""
579 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
580 | # https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
581 | tp, tp_op = tf.metrics.true_positives(
582 | predictions, label_ids, weights=is_real_example)
583 | tn, tn_op = tf.metrics.true_negatives(
584 | predictions, label_ids, weights=is_real_example)
585 | fp, fp_op = tf.metrics.false_positives(
586 | predictions, label_ids, weights=is_real_example)
587 | fn, fn_op = tf.metrics.false_negatives(
588 | predictions, label_ids, weights=is_real_example)
589 |
590 | # Compute Matthew's correlation
591 | mcc = tf.div_no_nan(
592 | tp * tn - fp * fn,
593 | tf.pow((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn), 0.5))
594 |
595 | # Compute accuracy
596 | accuracy = tf.metrics.accuracy(
597 | labels=label_ids, predictions=predictions,
598 | weights=is_real_example)
599 |
600 | loss = tf.metrics.mean(
601 | values=per_example_loss,
602 | weights=is_real_example)
603 |
604 | return {"matthew_corr": (mcc, tf.group(tp_op, tn_op, fp_op, fn_op)),
605 | "eval_accuracy": accuracy, "eval_loss": loss,}
606 |
607 | eval_metrics = (metric_fn,
608 | [per_example_loss, label_ids, logits, is_real_example])
609 | output_spec = contrib_tpu.TPUEstimatorSpec(
610 | mode=mode,
611 | loss=total_loss,
612 | eval_metrics=eval_metrics,
613 | scaffold_fn=scaffold_fn)
614 | else:
615 | output_spec = contrib_tpu.TPUEstimatorSpec(
616 | mode=mode,
617 | predictions={
618 | "probabilities": probabilities,
619 | "predictions": predictions
620 | },
621 | scaffold_fn=scaffold_fn)
622 | return output_spec
623 |
624 | return model_fn
625 |
626 |
627 | # This function is not used by this file but is still used by the Colab and
628 | # people who depend on it.
629 | def input_fn_builder(features, seq_length, is_training, drop_remainder):
630 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
631 |
632 | all_input_ids = []
633 | all_input_mask = []
634 | all_segment_ids = []
635 | all_label_ids = []
636 |
637 | for feature in features:
638 | all_input_ids.append(feature.input_ids)
639 | all_input_mask.append(feature.input_mask)
640 | all_segment_ids.append(feature.segment_ids)
641 | all_label_ids.append(feature.label_id)
642 |
643 | def input_fn(params):
644 | """The actual input function."""
645 | batch_size = params["batch_size"]
646 |
647 | num_examples = len(features)
648 |
649 | # This is for demo purposes and does NOT scale to large data sets. We do
650 | # not use Dataset.from_generator() because that uses tf.py_func which is
651 | # not TPU compatible. The right way to load data is with TFRecordReader.
652 | d = tf.data.Dataset.from_tensor_slices({
653 | "input_ids":
654 | tf.constant(
655 | all_input_ids, shape=[num_examples, seq_length],
656 | dtype=tf.int32),
657 | "input_mask":
658 | tf.constant(
659 | all_input_mask,
660 | shape=[num_examples, seq_length],
661 | dtype=tf.int32),
662 | "segment_ids":
663 | tf.constant(
664 | all_segment_ids,
665 | shape=[num_examples, seq_length],
666 | dtype=tf.int32),
667 | "label_ids":
668 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
669 | })
670 |
671 | if is_training:
672 | d = d.repeat()
673 | d = d.shuffle(buffer_size=100)
674 |
675 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
676 | return d
677 |
678 | return input_fn
679 |
680 |
681 | # This function is not used by this file but is still used by the Colab and
682 | # people who depend on it.
683 | def convert_examples_to_features(examples, label_list, max_seq_length,
684 | tokenizer, task_name):
685 | """Convert a set of `InputExample`s to a list of `InputFeatures`."""
686 |
687 | features = []
688 | print('Length of examples:',len(examples))
689 | for (ex_index, example) in enumerate(examples):
690 | if ex_index % 10000 == 0:
691 | #tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
692 | print("Writing example %d of %d" % (ex_index, len(examples)))
693 | feature = convert_single_example(ex_index, example, label_list,
694 | max_seq_length, tokenizer, task_name)
695 |
696 | features.append(feature)
697 | return features
698 |
699 |
700 |
701 |
702 | # Load parameters
703 | max_seq_length = hp.sequence_length
704 | do_lower_case = hp.do_lower_case
705 | vocab_file = hp.vocab_file
706 | tokenizer = tokenization.FullTokenizer.from_scratch(vocab_file=vocab_file,
707 | do_lower_case=do_lower_case,
708 | spm_model_file=None)
709 | processor = ClassifyProcessor()
710 | label_list = processor.get_labels()
711 | data_dir = hp.data_dir
712 |
713 |
714 | def get_features():
715 | # Load train data
716 | train_examples = processor.get_train_examples(data_dir)
717 | # Get onehot feature
718 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify')
719 | input_ids = [f.input_ids for f in features]
720 | input_masks = [f.input_mask for f in features]
721 | segment_ids = [f.segment_ids for f in features]
722 | label_ids = [f.label_id for f in features]
723 | print('Get features finished!')
724 | return input_ids,input_masks,segment_ids,label_ids
725 |
726 |
727 | def get_features_test():
728 | # Load test data
729 | train_examples = processor.get_test_examples(data_dir)
730 | # Get onehot feature
731 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify_test')
732 | input_ids = [f.input_ids for f in features]
733 | input_masks = [f.input_mask for f in features]
734 | segment_ids = [f.segment_ids for f in features]
735 | label_ids = [f.label_id for f in features]
736 | print('Get features(test) finished!')
737 | return input_ids,input_masks,segment_ids,label_ids
738 |
739 |
740 | def create_example(line,set_type):
741 | """Creates examples for the training and dev sets."""
742 | guid = "%s-%s" % (set_type, 1)
743 | text_a = tokenization.convert_to_unicode(line[1])
744 | label = tokenization.convert_to_unicode(line[0])
745 | example = InputExample(guid=guid, text_a=text_a, text_b=None, label=label)
746 | return example
747 |
748 |
749 | def get_feature_test(sentence):
750 | example = create_example(['0',sentence],'test')
751 | feature = convert_single_example(0, example, label_list,max_seq_length, tokenizer,task_name='classify')
752 | return feature.input_ids,feature.input_mask,feature.segment_ids,feature.label_id
753 |
754 |
755 | def get_features_emoji():
756 | # Load train data
757 | lines = read_csv(os.path.join(hp.data_dir, hp.train_data))
758 | # Get features emoji
759 | contents = [l[1] for l in lines]
760 | contents_emoji = [get_emoji(l) for l in contents]
761 | return [emoji2id(l) for l in contents_emoji]
762 |
763 |
764 | def get_features_emoji_test():
765 | # Load train data
766 | lines = read_csv(os.path.join(hp.data_dir, hp.test_data))
767 | # Get features emoji
768 | contents = [l[1] for l in lines]
769 | contents_emoji = [get_emoji(l) for l in contents]
770 | return [emoji2id(l) for l in contents_emoji]
771 |
772 |
773 | def get_feature_emoji_test(sentence):
774 | # Get feature emoji
775 | content_emoji = get_emoji(sentence)
776 | return emoji2id(content_emoji)
777 |
778 |
779 | if __name__ == '__main__':
780 | ## 获取参数: Test
781 | sentence = '天天向上'
782 | feature = get_feature_test(sentence)
783 | print(feature)
784 | sentence = "⏲⌚豌豆🥳🥳🥳射手🥰🦰"
785 | feature = get_feature_emoji_test(sentence)
786 | print(feature)
787 |
788 |
789 |
790 |
791 |
792 |
793 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/data/README.md:
--------------------------------------------------------------------------------
1 | # 数据集
2 | - data/sa_train.csv
3 | - data/sa_test.csv
4 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/data/sa_test.csv:
--------------------------------------------------------------------------------
1 | content,label
2 | 新配色喜欢,1
3 | 一星也不想给,-1
4 | 我就想说上个苹果X我媳妇儿用了差不多两年,0
5 | 大部分人还是被iso的流畅度吸引,1
6 | 一共买了7台,0
7 | 全金属身、超薄机身,1
8 | 还是值得了,1
9 | 紫色真的很好看,1
10 | 下次介绍朋友买,1
11 | 四个档位的水雾都拍了一张图,0
12 | 外观我觉得还可以,1
13 | 而且快递竟然打电话说下午不送了,-1
14 | 比不多一个晚上,0
15 | 这说明原创上电量使用是吹了牛+的,-1
16 | 放客厅非常合适,1
17 | 可惜没那么多钱了,0
18 | 因为没有防抖,-1
19 | 一个办用,0
20 | 电陶炉不挑锅具真是太方便啦,1
21 | 安装师傅也很尽心尽力,1
22 | 但是就是等得太久了,-1
23 | 网络速度快了很多,1
24 | 非标快哦,1
25 | 目前体验了下,0
26 | 不带音频,0
27 | 让自己想办法,0
28 | 但是时间长了就好了,0
29 | 客服人员服务态度也很好,1
30 | 买来给老父亲的,0
31 | 简直就是卡死,-1
32 | 最烦的就是这个充电头了,-1
33 | 女生也能单手操作,1
34 | 总是乱来,-1
35 | 国际品牌,0
36 | 虽然四个滤芯儿是送的,0
37 | 商品跟原装的一样灵活,1
38 | 安装师傅态度还不错.第二天打了客服电话后,1
39 | 这次给女儿新房又买一款,1
40 | 家里基本上都是小米的产品,1
41 | 两个人用完全没问题,1
42 | 超级超级好,1
43 | 在4米范围内小声说话都能收到音,1
44 | 水质过滤后改善了好多,1
45 | 都是屏幕和外框接口有缝隙,-1
46 | 开始想买康佳用的是LG萍屏,0
47 | 说月底盘点月初发,0
48 | 70寸还带主机音响,1
49 | 主要功能用于看时间跟测各种健康指标,1
50 | 布料也比较结实,1
51 | 但是整体体验很好,1
52 | 可以洗好多碗具,1
53 | 看着就非常的爽,1
54 | 摁一下自动匹配,0
55 | 才想起来没有评论,0
56 | 手感较好,1
57 | 电量也只用了一格,1
58 | 用过两个了,0
59 | 你们自己慢慢品,0
60 | 整体的效果不错,1
61 | 就像地摊20块钱以下的东西,-1
62 | 水容器大,1
63 | 讲真的以后真的不想再买苹果了,-1
64 | 外形外观:非常的好看,1
65 | 屏幕音效:质还是不错的,1
66 | 然后才发现原来的手机是移动定制机,0
67 | 就直接放在保安室,0
68 | 外观在能接受的范围,1
69 | 电池续航能力也不错,1
70 | 噪音比以前的小太多,1
71 | 价格贵点,-1
72 | 老公也说好,1
73 | 再清澈一些就好了,0
74 | 然后快递师傅还挺好的,1
75 | 一路给我打电话送到家的,1
76 | 适合小设计师们,1
77 | 用了空气泡沫包装,0
78 | 送给对象很喜欢,1
79 | 这款Lenovo,0
80 | 只当给自己起心理安慰吧,1
81 | 这两天热的时候还能降降温,1
82 | 我家阳台小尺寸刚刚好,1
83 | 送个老公的生日礼物,0
84 | 虽说不是JD自己的物流(日日顺快递公司),0
85 | 屏幕音效:l很好的,1
86 | 垃圾的东西,-1
87 | 唯一的缺点应该是拍照和想象中的差点,-1
88 | 商家送的备用PP棉芯很好,1
89 | 望继续做好售后服务,0
90 | 还好客服售后耐心与我沟通消除误会,1
91 | 荣耀品质做工越来越好了,1
92 | 我买的是86寸的电视,0
93 | 国产机越来越好了,1
94 | 使用起来相当流畅,1
95 | 京东买的有保障,1
96 | 智能反应超级迟钝,-1
97 | 用起来键盘还不错,1
98 | 居然还是坏的,-1
99 | 甚至可以用评价高达模型的角度来评价他,0
100 | 给远程安装,0
101 | 同时存电话号码时拼音输不全,-1
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/data/sa_train.csv:
--------------------------------------------------------------------------------
1 | content,label
2 | 超划算的,1
3 | iPhone的性价比之王,1
4 | 一套不同国家电源转换插头,-1
5 | 由于安装师傅上门安装时,0
6 | 55英寸的这个创维,0
7 | 我更喜欢那个的感觉,1
8 | 拿到手机第二天莫名其妙保护膜就裂了,-1
9 | 全部试了一遍都不行,-1
10 | 送的装备也很齐全,1
11 | 原来买的是内存32G的我在用好用,1
12 | 因为修剪器充电口处有一凸起,-1
13 | 提升幸福感必备,1
14 | 9号送人,0
15 | 紫色骚气,1
16 | 搭配手持吸尘器,0
17 | 包装的很精细,1
18 | 买了好几个华为手机了,1
19 | 当时还一下子买了3台,0
20 | 很喜欢这个牌子的啊,1
21 | 运行速度:除了反应慢,-1
22 | 关键是买的别的加湿器都是水干自动断电,-1
23 | 我只想问你们一个电脑利润就这么高,-1
24 | 用用还可以暂时没有发现问题,1
25 | 在京东购物总是让我高兴,1
26 | 这个门卡使用非常灵敏、方便,1
27 | 包装充分完好,1
28 | 华为的路由器好,1
29 | 希望以后也可以的屏幕跟音效都不错的,1
30 | 无论是正面还是侧面,0
31 | 高大上的东西,1
32 | 我的收据没帮开过来,-1
33 | 外形外观:和x一样,0
34 | 商家能积极处理客户碰到的问题,1
35 | 确实不会买苹果的手机,-1
36 | 目前使用一切正常,1
37 | 唯一觉得不足的就是app,-1
38 | 其他特色:现在我京东的速度很快,1
39 | 直接不回,-1
40 | 两个不足:,-1
41 | 很好支持国货,1
42 | 从内到外都很棒,1
43 | 的质量很好,1
44 | 待机时间:能从早用到晚,1
45 | 待机时间:续航是个鸡肋,-1
46 | 看看没动手,0
47 | 说不能直接放在地板上,0
48 | 后续体验再来追评,0
49 | 而且我还是联通卡,0
50 | 我们生活使用日常杠杠滴,1
51 | 充电慢费电快,-1
52 | 自从知道评论之后京豆可以抵现金了,0
53 | 运行速度:效果很不错,1
54 | 中午送到马上装好,0
55 | 然后Windows10和以前的系统有点不同,0
56 | 到的也快,1
57 | 感觉没有我的5T好用,-1
58 | 相信京东没问题,1
59 | 晚上睡觉开起好睡多了,1
60 | 华为是首选,1
61 | 还好可以,1
62 | 适合单身租房的,1
63 | 但是只给5月24以后买的保价,0
64 | 昨晚定,0
65 | 电池比较耐用、不再担心一天一次冲,1
66 | 已经离不开戴森了,1
67 | 物流速度很快电脑到了没有任何瑕疵,1
68 | 感觉音质很棒,1
69 | 待机时间没使用的情况下4天左右可以的,0
70 | 拍照效果:基本不用,0
71 | 这次活动价,0
72 | 东西用着很好,1
73 | 好不好用还没试,0
74 | 后悔换苹果了,-1
75 | 看起来挺不错,1
76 | 以后还找你购买,1
77 | 总算是没有让人失望,1
78 | 待机时间:全是很久,1
79 | 和v20犹豫了一番,0
80 | 扫拖一起干,1
81 | 这次买净水器滤芯也是很坚决,0
82 | 买的也都是256G,0
83 | 反复装卡重启都不行,-1
84 | 我觉得有点厚,-1
85 | 结果京东物流速度很快,1
86 | 公司常年采购,0
87 | 到货速度稍慢些,-1
88 | 宝宝们都很喜欢,1
89 | 外观看上去赞,1
90 | 这个电水壶确实好用,1
91 | 物流又给力,1
92 | 在门上有档次解锁识别速度,1
93 | 商家非常诚信,1
94 | 半个月过去了,0
95 | 容量大小:3口之家正合适,1
96 | 而且京东快递小姐姐服务还好,1
97 | 平时折叠放着特别的小,0
98 | 屏幕音效:屏幕a屏,0
99 | 由于是双十一买的,0
100 | 也优惠不了几块钱,-1
101 | 打400电话也不给安装,-1
102 | 电脑性价比可以,1
103 | 送货的师傅服务超好,1
104 | 拍照效果:拍照效果就是下面那个特别真实清晰,1
105 | 这次装修先预埋,0
106 | 这款颜色漂亮,1
107 | 我严重怀疑他们发的是翻新机,-1
108 | 一个人用大小刚好,1
109 | 二档稍微有点声音,0
110 | 比如刷,0
111 | 手机一直忘了来评价,0
112 | 好看的不得了,1
113 | 而且贵的壶和便宜的区别不大,-1
114 | 用了很长时间才来评论的,0
115 | 但是你要是戴上眼镜就不能识别了,-1
116 | 尺度也够用,1
117 | 好在很清晰,1
118 | 节省了地方,1
119 | 妈妈换手机很方便,1
120 | 运行速度:运行速度比其它手机更优秀,1
121 | 师傅送货也快,1
122 | 就拍了,0
123 | 自己买配件升级拖地,0
124 | 另外充电器有点大不太方便携带,-1
125 | 指示灯都是准确完好,1
126 | 外形外观:高端大气配置较高,1
127 | 非常失望的一次购物,-1
128 | 东西真心一般,0
129 | 还赠送了手机卡,1
130 | 忘记拍照就不上图了,0
131 | 事很多,0
132 | 比L580画面清晰,1
133 | 音质很好画质也很清晰,1
134 | 极具个性,1
135 | 作为入伙礼物买的,0
136 | 再到指纹锁的使用,0
137 | 售前售后都好,1
138 | 热情服务完,1
139 | 产品吸尘很好,1
140 | 这个价格买的很值得,1
141 | 这次买的电器很满意,1
142 | 和我同事华为的比快多了,1
143 | 以后买手机还来这买,1
144 | 中度使用也能一天,1
145 | 外形外观:手机,0
146 | 最最重要的是拿到手已经贴了原装手机屏幕膜,1
147 | 金属后盖,0
148 | 855puls稳得一批,1
149 | 办公用挺好,1
150 | 买来运动计步用的,0
151 | 扫起来很干净,1
152 | 帮朋友公司购买,0
153 | 却是没有水雾,-1
154 | 这款ⅵvo手机配骁龙855芯片,0
155 | 这个试了,0
156 | 主要客服也非常好哦,1
157 | 没想到千元机也这么强大,1
158 | 老婆很喜欢大小正合适,1
159 | 肖正友,0
160 | 发票到了,0
161 | 商家才有这样的底气,0
162 | 运行速度:麒麟980处理器运行速度很快,1
163 | 拍照质量真的超赞,1
164 | 噪音大小:还好,1
165 | 听着没啥感觉唉,0
166 | 来了就直接点亮了,0
167 | 相机算是对得起价格了,1
168 | 大小正好合适,1
169 | 沾了点水,0
170 | 就是插座这总插拔会不会坏,0
171 | 1.照相,0
172 | 卖家检测没问题,1
173 | 还一百多,0
174 | 诱光灯的效果相当不错,1
175 | 开关门声音太大了,-1
176 | 需要找人安装,0
177 | 颜值俱佳,1
178 | 装的时候就开了两三个小时还凉的挺快,1
179 | 所以硬盘都得多分点,0
180 | 值得回购哟,1
181 | 智能锁之初体验,0
182 | w这个配件还可以,1
183 | 然后还是拆开看了下,0
184 | 今年夏天特别的热,0
185 | 希望质量一如既往的好,1
186 | 购买之前对比了很多款式和牌子的电视,0
187 | 小孩子爱看动画片,1
188 | 其他的话有待日后确认,0
189 | 送给长辈用的,0
190 | 不过找遍了京东,0
191 | 好的卖家卖的宝贝,1
192 | 准备洗了,0
193 | 电视50寸以为没多大,0
194 | 600多的加湿器,0
195 | 搞得我水漏了一地,-1
196 | 一会要这样一会要那样,-1
197 | 五六个了,0
198 | 不知道是几手的,0
199 | 至少做工很不错,1
200 | 得亏有它,1
201 | 和电影里一样,0
202 | 新的刚到手就坏了你敢信,-1
203 | 基本每次充电不到半小时就能满,1
204 | 提示无网络,-1
205 | 东西巳收到,0
206 | 运行速度:暂时用起来很快,1
207 | 这两天主要试机,0
208 | 但是在京东买了这个给父亲用,0
209 | 但是图像处理能力很好,1
210 | 到今天一个月了,0
211 | 不要太挑剔,0
212 | 没其他毛病,1
213 | 解压压缩包就很热,-1
214 | 售后告诉我夏天要开6档,0
215 | 不过就是出雾好小,-1
216 | 最重要的是价格优惠,1
217 | 好说歹说收下了,0
218 | 充电二十多分钟就能充满,1
219 | 客服售后态度极差,-1
220 | 用了一天了没有一点水遗留在桌子上,1
221 | 没有蒸菜层的,-1
222 | 小不足,-1
223 | 耳机总是自己断开,-1
224 | 应该不影响使用,1
225 | 手机颜色跟图片有点不同,-1
226 | 自己找,0
227 | 值得购买大品牌,1
228 | 如果性价比可以再高一点的话就好啦,0
229 | 用十年是没问题的,1
230 | 按网上教程试了半天u盘里的都不识别,-1
231 | 不再断线了、超爽,1
232 | 估计我得抱着它睡觉,0
233 | 这次基础版入手,0
234 | 送的屏幕膜,1
235 | 在反复开关机之后才激活成功,-1
236 | 因为比实体店会便宜实惠好多,1
237 | 还没用就先恢复出厂设置了,-1
238 | 体验下,0
239 | 红薯粉色小鳄鱼竿支架子鼓浪屿啊啊啊啊五环之歌名,0
240 | 但是客服连最基本的东西都不懂,-1
241 | 雾气刚开始很小,-1
242 | 其他特色:就是原装充电器充电很慢,-1
243 | 华为10plus,0
244 | 今天安上了,0
245 | 下次再试试看,0
246 | 加水量并不多,1
247 | 说是可以无理由退换,1
248 | 总结:不推荐此型号,-1
249 | 不知道怎么样用段时间在看看,0
250 | 未使用的情况下,0
251 | 还好是练体育的,0
252 | 但是运行速度还行,1
253 | 问了又说去那边问,-1
254 | 也买了一台,0
255 | 收到宝贝特意用了一段时间过来评价,0
256 | 安装的师傅也很尽心,1
257 | 想不想买也就自己看的办吧,0
258 | 安装起来也很简单,1
259 | 还送整机10年保修,1
260 | 其他特色:拍照效果好,1
261 | 放两条就洗不干净了,-1
262 | 也不厚,0
263 | 对用户使用很友好,1
264 | 物流服务都超好,1
265 | 就是最简单的一种,1
266 | 比想象中的要小的多,-1
267 | 没想到这么大声音,-1
268 | 电视语音识别度很高在厨房都可以听清楚你要说的话,1
269 | 但是还是稍微重了一些,-1
270 | 和商家描述得一致,1
271 | 手机外观轻薄款音效非常好,1
272 | 厂家直接给我换一个,1
273 | 放心了不少,1
274 | 这个还没换呢,0
275 | 装上后那水立刻变得清澈透明,1
276 | 质量问题,-1
277 | 可以无线打印,1
278 | 比较热情,1
279 | 这种电器之类的,0
280 | 安装师傅的服务态度非常好,1
281 | 质量也满意,1
282 | 按照他们的方法,0
283 | 今天最后一局还剩百分之11的电我又开了一局,0
284 | 很好正品,1
285 | 亲测待机量30+,0
286 | 但是想起这个屏幕,0
287 | 沉浸式音乐感受&rdquo,1
288 | 京东大家电价格便宜,1
289 | 然后色彩还原度也比较好,1
290 | 1-2秒就能识别,1
291 | 明明是想买路由器的,0
292 | 能带大耳机,0
293 | 商品没啥,0
294 | 出差、旅行携带方便,1
295 | 声明快递员很给力,1
296 | 现在的快递都是下楼自己拿的,0
297 | 还没有我九块九三张的好,-1
298 | 特意把儿子煲好的IE60拿了试了下,0
299 | 快递很差,-1
300 | 为健康观影保驾护航,0
301 | 外形外观:白色的加点彩虹一样的颜色,0
302 | 非常美😊,1
303 | 非常耐心❤️",1
304 | 非常耐斯👌,1
305 | 非常讨人喜欢💕,1
306 | 非常让人喜欢😊,1
307 | 非常贴近我的状👌态,1
308 | 非常赞👍,1
309 | 非常赞👍家里之前的冰箱有点小了,1
310 | 非常迷你😄,1
311 | 非常适合女生👍👍,1
312 | 非常适合我这种臭美的女生😏,1
313 | 非常适合看😜😜😜,1
314 | 非常鄙视👎👎👎👎,-1
315 | 非常长不错😊",1
316 | #😠骗人的吗我投诉,-1
317 | #🤢🤢🤢,-1
318 | (泽米张帆)❤",1
319 | (给租户买的😂)颜值还挺高的,1
320 | )-♡😊喜欢的程度没法用言语表达,1
321 | )_💑啊对充电速度真的很快,1
322 | *****超棒哈哈哈😁,1
323 | **商家、恶心🤢,-1
324 | **让我彻底伤心💔,-1
325 | 日了🐶,-1
326 | 日了🐶了,-1
327 | 2.自己的苏宁物流慢的像一坨💩(第一天早上八点下单,-1
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/hyperparameters.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | import sys
11 | pwd = os.path.dirname(os.path.abspath(__file__))
12 | sys.path.append(pwd)
13 |
14 | from sentiment_analysis_albert_emoji.utils import load_vocabulary
15 |
16 |
17 | class Hyperparamters:
18 | # Train parameters
19 | num_train_epochs = 5
20 | print_step = 10
21 | batch_size = 64
22 | batch_size_eval = 128
23 | summary_step = 10
24 | num_saved_per_epoch = 3
25 | max_to_keep = 100
26 |
27 | # File model
28 | logdir = 'logdir/model_01'
29 | file_save_model = 'model/model_save'
30 | file_load_model = 'model/model_load'
31 |
32 | # Train data and test data
33 | train_data = "sa_train.csv"
34 | test_data = "sa_test.csv"
35 |
36 | # Optimization parameters
37 | warmup_proportion = 0.1
38 | use_tpu = None
39 | do_lower_case = True
40 | learning_rate = 5e-5
41 |
42 | # TextCNN parameters
43 | num_filters = 128
44 | filter_sizes = [2,3,4,5,6,7]
45 | embedding_size = 768
46 | keep_prob = 0.5
47 |
48 | # Sequence and Label
49 | sequence_length = 128
50 | num_labels = 3
51 | dict_label = {
52 | '0': '-1',
53 | '1': '0',
54 | '2': '1'}
55 |
56 | # ALBERT parameters
57 | name = 'albert_base_zh'
58 | bert_path = os.path.join(pwd,name)
59 | data_dir = os.path.join(pwd,'data')
60 | vocab_file = os.path.join(pwd,name,'vocab_chinese.txt')
61 | init_checkpoint = os.path.join(pwd,name,'albert_model.ckpt')
62 | saved_model_path = os.path.join(pwd,'model')
63 |
64 | # Emoji
65 | sequence_length_emoji = 16
66 | file_vocab_emoji = os.path.join(pwd, name, 'vocab_emoji.txt')
67 | vocab_emoji_id2char,vocab_emoji_char2id = load_vocabulary(file_vocab_emoji)
68 | vocab_size_emoji = len(vocab_emoji_char2id)
69 |
70 |
71 | if __name__ == '__main__':
72 | hp = Hyperparamters()
73 | print(hp.batch_size)
74 |
75 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/lamb_optimizer.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | from __future__ import absolute_import
11 | from __future__ import division
12 | from __future__ import print_function
13 |
14 | import re
15 | import six
16 | import tensorflow.compat.v1 as tf
17 | from tensorflow.python.ops import array_ops
18 | from tensorflow.python.ops import linalg_ops
19 | from tensorflow.python.ops import math_ops
20 |
21 |
22 |
23 | class LAMBOptimizer(tf.train.Optimizer):
24 | """LAMB (Layer-wise Adaptive Moments optimizer for Batch training)."""
25 | # A new optimizer that includes correct L2 weight decay, adaptive
26 | # element-wise updating, and layer-wise justification. The LAMB optimizer
27 | # was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,
28 | # James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT
29 | # Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962)
30 |
31 | def __init__(self,
32 | learning_rate,
33 | weight_decay_rate=0.0,
34 | beta_1=0.9,
35 | beta_2=0.999,
36 | epsilon=1e-6,
37 | exclude_from_weight_decay=None,
38 | exclude_from_layer_adaptation=None,
39 | name="LAMBOptimizer"):
40 | """Constructs a LAMBOptimizer."""
41 | super(LAMBOptimizer, self).__init__(False, name)
42 |
43 | self.learning_rate = learning_rate
44 | self.weight_decay_rate = weight_decay_rate
45 | self.beta_1 = beta_1
46 | self.beta_2 = beta_2
47 | self.epsilon = epsilon
48 | self.exclude_from_weight_decay = exclude_from_weight_decay
49 | # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the
50 | # arg is None.
51 | # TODO(jingli): validate if exclude_from_layer_adaptation is necessary.
52 | if exclude_from_layer_adaptation:
53 | self.exclude_from_layer_adaptation = exclude_from_layer_adaptation
54 | else:
55 | self.exclude_from_layer_adaptation = exclude_from_weight_decay
56 |
57 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
58 | """See base class."""
59 | assignments = []
60 | for (grad, param) in grads_and_vars:
61 | if grad is None or param is None:
62 | continue
63 |
64 | param_name = self._get_variable_name(param.name)
65 |
66 | m = tf.get_variable(
67 | name=six.ensure_str(param_name) + "/adam_m",
68 | shape=param.shape.as_list(),
69 | dtype=tf.float32,
70 | trainable=False,
71 | initializer=tf.zeros_initializer())
72 | v = tf.get_variable(
73 | name=six.ensure_str(param_name) + "/adam_v",
74 | shape=param.shape.as_list(),
75 | dtype=tf.float32,
76 | trainable=False,
77 | initializer=tf.zeros_initializer())
78 |
79 | # Standard Adam update.
80 | next_m = (
81 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
82 | next_v = (
83 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
84 | tf.square(grad)))
85 |
86 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
87 |
88 | # Just adding the square of the weights to the loss function is *not*
89 | # the correct way of using L2 regularization/weight decay with Adam,
90 | # since that will interact with the m and v parameters in strange ways.
91 | #
92 | # Instead we want ot decay the weights in a manner that doesn't interact
93 | # with the m/v parameters. This is equivalent to adding the square
94 | # of the weights to the loss with plain (non-momentum) SGD.
95 | if self._do_use_weight_decay(param_name):
96 | update += self.weight_decay_rate * param
97 |
98 | ratio = 1.0
99 | if self._do_layer_adaptation(param_name):
100 | w_norm = linalg_ops.norm(param, ord=2)
101 | g_norm = linalg_ops.norm(update, ord=2)
102 | ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where(
103 | math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0)
104 |
105 | update_with_lr = ratio * self.learning_rate * update
106 |
107 | next_param = param - update_with_lr
108 |
109 | assignments.extend(
110 | [param.assign(next_param),
111 | m.assign(next_m),
112 | v.assign(next_v)])
113 | return tf.group(*assignments, name=name)
114 |
115 | def _do_use_weight_decay(self, param_name):
116 | """Whether to use L2 weight decay for `param_name`."""
117 | if not self.weight_decay_rate:
118 | return False
119 | if self.exclude_from_weight_decay:
120 | for r in self.exclude_from_weight_decay:
121 | if re.search(r, param_name) is not None:
122 | return False
123 | return True
124 |
125 | def _do_layer_adaptation(self, param_name):
126 | """Whether to do layer-wise learning rate adaptation for `param_name`."""
127 | if self.exclude_from_layer_adaptation:
128 | for r in self.exclude_from_layer_adaptation:
129 | if re.search(r, param_name) is not None:
130 | return False
131 | return True
132 |
133 | def _get_variable_name(self, param_name):
134 | """Get the variable name from the tensor name."""
135 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
136 | if m is not None:
137 | param_name = m.group(1)
138 | return param_name
139 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/model/model_load/README.md:
--------------------------------------------------------------------------------
1 | # 推理所用的模型
2 |
3 | ## 模型路径
4 | - model_load/checkpoit
5 | - model_load/model_xx_0.ckpt.data-00000-of-00001
6 | - model_load/model_xx_0.ckpt.index
7 | - model_load/model_xx_0.ckpt.meta
8 |
9 | ## checkpoint内容
10 | ```
11 | model_checkpoint_path: "model_xx_0.ckpt"
12 | ```
13 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/model/model_save/README.md:
--------------------------------------------------------------------------------
1 | # 训练过程所得模型
2 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/modules.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 21:01:45 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import tensorflow as tf
10 | from tensorflow.contrib.rnn import DropoutWrapper
11 |
12 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp
13 |
14 |
15 |
16 | def cell_embedding(inputs):
17 | with tf.device('/cpu:0'):
18 | W = tf.Variable(tf.random_uniform([hp.vocab_size_emoji, hp.embedding_size], -1.0, 1.0), name="W_embedding")
19 | embedded_input = tf.nn.embedding_lookup(W,inputs)
20 | return embedded_input
21 |
22 |
23 | def cell_textcnn_two(inputs,is_training):
24 | # Add a dimension in the end:-1
25 | inputs_expand = tf.expand_dims(inputs, -1)
26 | # Create a convolution + maxpool layer for each filter size
27 | pooled_outputs = []
28 | with tf.name_scope("TextCNN"):
29 | for i, filter_size in enumerate(hp.filter_sizes):
30 | with tf.name_scope("conv-maxpool-%s" % filter_size):
31 | # Convolution Layer
32 | filter_shape = [filter_size, hp.embedding_size, 1, hp.num_filters]
33 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1),dtype=tf.float32, name="W")
34 | b = tf.Variable(tf.constant(0.1, shape=[hp.num_filters]),dtype=tf.float32, name="b")
35 | conv = tf.nn.conv2d(
36 | inputs_expand,
37 | W,
38 | strides=[1, 1, 1, 1],
39 | padding="VALID",
40 | name="conv")
41 | # Apply nonlinearity
42 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
43 | # Maxpooling over the outputs
44 | pooled = tf.nn.max_pool(
45 | h,
46 | ksize=[1, hp.sequence_length + hp.sequence_length_emoji - filter_size + 1, 1, 1],
47 | strides=[1, 1, 1, 1],
48 | padding='VALID',
49 | name="pool")
50 | pooled_outputs.append(pooled)
51 | # Combine all the pooled features
52 | with tf.name_scope("Concat"):
53 | num_filters_total = hp.num_filters * len(hp.filter_sizes)
54 | h_pool = tf.concat(pooled_outputs, 3)
55 | h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])
56 | # Dropout
57 | h_pool_flat_dropout = tf.nn.dropout(h_pool_flat, keep_prob=hp.keep_prob if is_training else 1)
58 | return h_pool_flat_dropout
59 |
60 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/networks.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 20:44:42 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | import tensorflow as tf
11 |
12 | from sentiment_analysis_albert_emoji import modeling,optimization
13 | from sentiment_analysis_albert_emoji.classifier_utils import ClassifyProcessor
14 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp
15 | from sentiment_analysis_albert_emoji.utils import time_now_string
16 | from sentiment_analysis_albert_emoji.modules import cell_textcnn_two
17 | from sentiment_analysis_albert_emoji.modules import cell_embedding
18 |
19 |
20 | num_labels = hp.num_labels
21 | processor = ClassifyProcessor()
22 | bert_config_file = os.path.join(hp.bert_path,'albert_config.json')
23 | bert_config = modeling.AlbertConfig.from_json_file(bert_config_file)
24 |
25 |
26 | def count_model_params():
27 | """
28 | Compte the parameters
29 | """
30 | total_parameters = 0
31 | for variable in tf.trainable_variables():
32 | shape = variable.get_shape()
33 | variable_parameters = 1
34 | for dim in shape:
35 | variable_parameters *= dim.value
36 | total_parameters += variable_parameters
37 | print(' + Number of params: %.2fM' % (total_parameters / 1e6))
38 |
39 |
40 | class NetworkAlbert(object):
41 | def __init__(self,is_training):
42 | self.is_training = is_training
43 |
44 | # Chinese placeholder
45 | self.input_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_ids')
46 | self.input_masks = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_masks')
47 | self.segment_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='segment_ids')
48 | self.label_ids = tf.placeholder(tf.int32, shape=[None], name='label_ids')
49 |
50 |
51 | # Load BERT Pre-training LM
52 | self.model = modeling.AlbertModel(
53 | config=bert_config,
54 | is_training=self.is_training,
55 | input_ids=self.input_ids,
56 | input_mask=self.input_masks,
57 | token_type_ids=self.segment_ids,
58 | use_one_hot_embeddings=False)
59 |
60 | # Get the feature vector with size 3D:(batch_size,sequence_length,hidden_size)
61 | output_layer_init = self.model.get_sequence_output()
62 |
63 | # Emoji placeholder
64 | self.input_emojis = tf.placeholder(tf.int32, shape=[None, hp.sequence_length_emoji], name='input_id_emoji')
65 |
66 | # Cell emoji embedding
67 | self.emoji_embedding = cell_embedding(self.input_emojis)
68 |
69 | # Concat
70 | output_all = tf.concat([output_layer_init,self.emoji_embedding],1)
71 |
72 | # Cell textcnn-emoji
73 | output_layer = cell_textcnn_two(output_all,self.is_training)
74 |
75 | # Hidden size
76 | hidden_size = output_layer.shape[-1].value
77 |
78 | # Full-connection
79 | with tf.name_scope("Full-connection"):
80 | output_weights = tf.get_variable(
81 | "output_weights", [num_labels, hidden_size],
82 | initializer=tf.truncated_normal_initializer(stddev=0.02))
83 | output_bias = tf.get_variable(
84 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
85 | # Logit
86 | self.logits = tf.nn.bias_add(tf.matmul(output_layer, output_weights, transpose_b=True), output_bias)
87 | self.probabilities = tf.nn.softmax(self.logits, axis=-1)
88 |
89 | # Prediction
90 | with tf.variable_scope("Prediction"):
91 | self.preds = tf.argmax(self.logits, axis=-1, output_type=tf.int32)
92 |
93 | # Summary for tensorboard
94 | with tf.variable_scope("Loss"):
95 | if self.is_training:
96 | self.accuracy = tf.reduce_mean(tf.to_float(tf.equal(self.preds, self.label_ids)))
97 | tf.summary.scalar('Accuracy', self.accuracy)
98 |
99 | # Check whether has loaded model
100 | ckpt = tf.train.get_checkpoint_state(hp.saved_model_path)
101 | checkpoint_suffix = ".index"
102 | if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path + checkpoint_suffix):
103 | print('='*10,'Restoring model from checkpoint!','='*10)
104 | print("%s - Restoring model from checkpoint ~%s" % (time_now_string(),
105 | ckpt.model_checkpoint_path))
106 | else:
107 | # Load BERT Pre-training LM
108 | print('First time load BERT model!')
109 | tvars = tf.trainable_variables()
110 | if hp.init_checkpoint:
111 | (assignment_map, initialized_variable_names) = \
112 | modeling.get_assignment_map_from_checkpoint(tvars,
113 | hp.init_checkpoint)
114 | tf.train.init_from_checkpoint(hp.init_checkpoint, assignment_map)
115 |
116 | # Optimization
117 | if self.is_training:
118 | # Global_step
119 | self.global_step = tf.Variable(0, name='global_step', trainable=False)
120 | # Loss
121 | log_probs = tf.nn.log_softmax(self.logits, axis=-1)
122 | one_hot_labels = tf.one_hot(self.label_ids, depth=num_labels, dtype=tf.float32)
123 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
124 | self.loss = tf.reduce_mean(per_example_loss)
125 | # Optimizer
126 | train_examples = processor.get_train_examples(hp.data_dir)
127 | num_train_steps = int(
128 | len(train_examples) / hp.batch_size * hp.num_train_epochs)
129 | num_warmup_steps = int(num_train_steps * hp.warmup_proportion)
130 | self.optimizer = optimization.create_optimizer(self.loss,
131 | hp.learning_rate,
132 | num_train_steps,
133 | num_warmup_steps,
134 | hp.use_tpu,
135 | Global_step=self.global_step,
136 | )
137 | # Summary for tensorboard
138 | tf.summary.scalar('Loss', self.loss)
139 | self.merged = tf.summary.merge_all()
140 |
141 | # Compte the parameters
142 | count_model_params()
143 | vs = tf.trainable_variables()
144 | for l in vs:
145 | print(l)
146 |
147 |
148 | if __name__ == '__main__':
149 | # Load model
150 | albert = NetworkAlbert(is_training=True)
151 |
152 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/optimization.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | """Functions and classes related to optimization (weight updates)."""
11 |
12 | from __future__ import absolute_import
13 | from __future__ import division
14 | from __future__ import print_function
15 | import re
16 | import six
17 | from six.moves import zip
18 | import tensorflow.compat.v1 as tf
19 | from tensorflow.contrib import tpu as contrib_tpu
20 |
21 | from sentiment_analysis_albert_emoji import lamb_optimizer
22 |
23 |
24 |
25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu,Global_step,
26 | optimizer="adamw", poly_power=1.0, start_warmup_step=0):
27 | """Creates an optimizer training op."""
28 | if Global_step:
29 | global_step = Global_step
30 | else:
31 | global_step = tf.train.get_or_create_global_step()
32 |
33 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
34 |
35 | # Implements linear decay of the learning rate.
36 | learning_rate = tf.train.polynomial_decay(
37 | learning_rate,
38 | global_step,
39 | num_train_steps,
40 | end_learning_rate=0.0,
41 | power=poly_power,
42 | cycle=False)
43 |
44 | # Implements linear warmup. I.e., if global_step - start_warmup_step <
45 | # num_warmup_steps, the learning rate will be
46 | # `(global_step - start_warmup_step)/num_warmup_steps * init_lr`.
47 | if num_warmup_steps:
48 | tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step)
49 | + ", for " + str(num_warmup_steps) + " steps ++++++")
50 | global_steps_int = tf.cast(global_step, tf.int32)
51 | start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32)
52 | global_steps_int = global_steps_int - start_warm_int
53 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
54 |
55 | global_steps_float = tf.cast(global_steps_int, tf.float32)
56 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
57 |
58 | warmup_percent_done = global_steps_float / warmup_steps_float
59 | warmup_learning_rate = init_lr * warmup_percent_done
60 |
61 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
62 | learning_rate = (
63 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
64 |
65 | # It is OK that you use this optimizer for finetuning, since this
66 | # is how the model was trained (note that the Adam m/v variables are NOT
67 | # loaded from init_checkpoint.)
68 | # It is OK to use AdamW in the finetuning even the model is trained by LAMB.
69 | # As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune
70 | # is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a
71 | # batch size of 64 in the finetune.
72 | if optimizer == "adamw":
73 | tf.logging.info("using adamw")
74 | optimizer = AdamWeightDecayOptimizer(
75 | learning_rate=learning_rate,
76 | weight_decay_rate=0.01,
77 | beta_1=0.9,
78 | beta_2=0.999,
79 | epsilon=1e-6,
80 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
81 | elif optimizer == "lamb":
82 | tf.logging.info("using lamb")
83 | optimizer = lamb_optimizer.LAMBOptimizer(
84 | learning_rate=learning_rate,
85 | weight_decay_rate=0.01,
86 | beta_1=0.9,
87 | beta_2=0.999,
88 | epsilon=1e-6,
89 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
90 | else:
91 | raise ValueError("Not supported optimizer: ", optimizer)
92 |
93 | if use_tpu:
94 | optimizer = contrib_tpu.CrossShardOptimizer(optimizer)
95 |
96 | tvars = tf.trainable_variables()
97 | grads = tf.gradients(loss, tvars)
98 |
99 | # This is how the model was pre-trained.
100 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
101 |
102 | train_op = optimizer.apply_gradients(
103 | list(zip(grads, tvars)), global_step=global_step)
104 |
105 | # Normally the global step update is done inside of `apply_gradients`.
106 | # However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this.
107 | # But if you use a different optimizer, you should probably take this line
108 | # out.
109 | new_global_step = global_step + 1
110 | train_op = tf.group(train_op, [global_step.assign(new_global_step)])
111 | return train_op
112 |
113 |
114 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
115 | """A basic Adam optimizer that includes "correct" L2 weight decay."""
116 |
117 | def __init__(self,
118 | learning_rate,
119 | weight_decay_rate=0.0,
120 | beta_1=0.9,
121 | beta_2=0.999,
122 | epsilon=1e-6,
123 | exclude_from_weight_decay=None,
124 | name="AdamWeightDecayOptimizer"):
125 | """Constructs a AdamWeightDecayOptimizer."""
126 | super(AdamWeightDecayOptimizer, self).__init__(False, name)
127 |
128 | self.learning_rate = learning_rate
129 | self.weight_decay_rate = weight_decay_rate
130 | self.beta_1 = beta_1
131 | self.beta_2 = beta_2
132 | self.epsilon = epsilon
133 | self.exclude_from_weight_decay = exclude_from_weight_decay
134 |
135 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
136 | """See base class."""
137 | assignments = []
138 | for (grad, param) in grads_and_vars:
139 | if grad is None or param is None:
140 | continue
141 |
142 | param_name = self._get_variable_name(param.name)
143 |
144 | m = tf.get_variable(
145 | name=six.ensure_str(param_name) + "/adam_m",
146 | shape=param.shape.as_list(),
147 | dtype=tf.float32,
148 | trainable=False,
149 | initializer=tf.zeros_initializer())
150 | v = tf.get_variable(
151 | name=six.ensure_str(param_name) + "/adam_v",
152 | shape=param.shape.as_list(),
153 | dtype=tf.float32,
154 | trainable=False,
155 | initializer=tf.zeros_initializer())
156 |
157 | # Standard Adam update.
158 | next_m = (
159 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
160 | next_v = (
161 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
162 | tf.square(grad)))
163 |
164 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
165 |
166 | # Just adding the square of the weights to the loss function is *not*
167 | # the correct way of using L2 regularization/weight decay with Adam,
168 | # since that will interact with the m and v parameters in strange ways.
169 | #
170 | # Instead we want ot decay the weights in a manner that doesn't interact
171 | # with the m/v parameters. This is equivalent to adding the square
172 | # of the weights to the loss with plain (non-momentum) SGD.
173 | if self._do_use_weight_decay(param_name):
174 | update += self.weight_decay_rate * param
175 |
176 | update_with_lr = self.learning_rate * update
177 |
178 | next_param = param - update_with_lr
179 |
180 | assignments.extend(
181 | [param.assign(next_param),
182 | m.assign(next_m),
183 | v.assign(next_v)])
184 | return tf.group(*assignments, name=name)
185 |
186 | def _do_use_weight_decay(self, param_name):
187 | """Whether to use L2 weight decay for `param_name`."""
188 | if not self.weight_decay_rate:
189 | return False
190 | if self.exclude_from_weight_decay:
191 | for r in self.exclude_from_weight_decay:
192 | if re.search(r, param_name) is not None:
193 | return False
194 | return True
195 |
196 | def _get_variable_name(self, param_name):
197 | """Get the variable name from the tensor name."""
198 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name))
199 | if m is not None:
200 | param_name = m.group(1)
201 | return param_name
202 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/predict.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 17:12:37 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | import os
10 | import sys
11 | import tensorflow as tf
12 | pwd = os.path.dirname(os.path.abspath(__file__))
13 | # os.environ["CUDA_VISIBLE_DEVICES"] = '-1'
14 | sys.path.append(os.path.dirname(os.path.dirname(__file__)))
15 |
16 |
17 | from sentiment_analysis_albert_emoji.networks import NetworkAlbert
18 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp
19 | from sentiment_analysis_albert_emoji.classifier_utils import get_feature_test
20 | from sentiment_analysis_albert_emoji.classifier_utils import get_feature_emoji_test
21 |
22 |
23 |
24 | class ModelAlbertTextCNN(object):
25 | """
26 | Load NetworkAlbert TextCNN model
27 | """
28 | def __init__(self,):
29 | self.albert, self.sess = self.load_model()
30 | @staticmethod
31 | def load_model():
32 | with tf.Graph().as_default():
33 | sess = tf.Session()
34 | with sess.as_default():
35 | albert = NetworkAlbert(is_training=False)
36 | saver = tf.train.Saver()
37 | sess.run(tf.global_variables_initializer())
38 | checkpoint_dir = os.path.abspath(os.path.join(pwd,hp.file_load_model))#small-google-gelu
39 | print (checkpoint_dir)
40 | ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
41 | saver.restore(sess, ckpt.model_checkpoint_path)
42 | return albert,sess
43 |
44 | MODEL = ModelAlbertTextCNN()
45 | print('Load model finished!')
46 |
47 |
48 |
49 | def get_sa(sentence):
50 | """
51 | Prediction of the sentence's sentiment.
52 | """
53 | feature = get_feature_test(sentence)
54 | feature_emoji = get_feature_emoji_test(sentence)
55 | fd = {MODEL.albert.input_ids: [feature[0]],
56 | MODEL.albert.input_masks: [feature[1]],
57 | MODEL.albert.segment_ids:[feature[2]],
58 | MODEL.albert.input_emojis:[feature_emoji]
59 | }
60 | output = MODEL.sess.run(MODEL.albert.preds, feed_dict=fd)
61 | return output[0]-1
62 |
63 |
64 |
65 | if __name__ == '__main__':
66 | ##
67 | sentence ='拍照效果:👍'
68 | print ("情感分析结果:",get_sa(sentence))
69 |
70 |
71 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow1.15
2 | sentencepiece
3 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/tokenization.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Nov 12 14:23:12 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | """Tokenization classes."""
11 |
12 | from __future__ import absolute_import
13 | from __future__ import division
14 | from __future__ import print_function
15 |
16 | import collections
17 | import re
18 | import unicodedata
19 | import six
20 | from six.moves import range
21 | import tensorflow.compat.v1 as tf
22 | import sentencepiece as spm
23 |
24 |
25 | SPIECE_UNDERLINE = u"▁".encode("utf-8")
26 |
27 |
28 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint):
29 | """Checks whether the casing config is consistent with the checkpoint name."""
30 |
31 | # The casing has to be passed in by the user and there is no explicit check
32 | # as to whether it matches the checkpoint. The casing information probably
33 | # should have been stored in the bert_config.json file, but it's not, so
34 | # we have to heuristically detect it to validate.
35 |
36 | if not init_checkpoint:
37 | return
38 |
39 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt",
40 | six.ensure_str(init_checkpoint))
41 | if m is None:
42 | return
43 |
44 | model_name = m.group(1)
45 |
46 | lower_models = [
47 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12",
48 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12"
49 | ]
50 |
51 | cased_models = [
52 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16",
53 | "multi_cased_L-12_H-768_A-12"
54 | ]
55 |
56 | is_bad_config = False
57 | if model_name in lower_models and not do_lower_case:
58 | is_bad_config = True
59 | actual_flag = "False"
60 | case_name = "lowercased"
61 | opposite_flag = "True"
62 |
63 | if model_name in cased_models and do_lower_case:
64 | is_bad_config = True
65 | actual_flag = "True"
66 | case_name = "cased"
67 | opposite_flag = "False"
68 |
69 | if is_bad_config:
70 | raise ValueError(
71 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. "
72 | "However, `%s` seems to be a %s model, so you "
73 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches "
74 | "how the model was pre-training. If this error is wrong, please "
75 | "just comment out this check." % (actual_flag, init_checkpoint,
76 | model_name, case_name, opposite_flag))
77 |
78 |
79 | def preprocess_text(inputs, remove_space=True, lower=False):
80 | """preprocess data by removing extra space and normalize data."""
81 | outputs = inputs
82 | if remove_space:
83 | outputs = " ".join(inputs.strip().split())
84 |
85 | if six.PY2 and isinstance(outputs, str):
86 | try:
87 | outputs = six.ensure_text(outputs, "utf-8")
88 | except UnicodeDecodeError:
89 | outputs = six.ensure_text(outputs, "latin-1")
90 |
91 | outputs = unicodedata.normalize("NFKD", outputs)
92 | outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
93 | if lower:
94 | outputs = outputs.lower()
95 |
96 | return outputs
97 |
98 |
99 | def encode_pieces(sp_model, text, return_unicode=True, sample=False):
100 | """turn sentences into word pieces."""
101 |
102 | if six.PY2 and isinstance(text, six.text_type):
103 | text = six.ensure_binary(text, "utf-8")
104 |
105 | if not sample:
106 | pieces = sp_model.EncodeAsPieces(text)
107 | else:
108 | pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
109 | new_pieces = []
110 | for piece in pieces:
111 | piece = printable_text(piece)
112 | if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
113 | cur_pieces = sp_model.EncodeAsPieces(
114 | six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
115 | if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
116 | if len(cur_pieces[0]) == 1:
117 | cur_pieces = cur_pieces[1:]
118 | else:
119 | cur_pieces[0] = cur_pieces[0][1:]
120 | cur_pieces.append(piece[-1])
121 | new_pieces.extend(cur_pieces)
122 | else:
123 | new_pieces.append(piece)
124 |
125 | # note(zhiliny): convert back to unicode for py2
126 | if six.PY2 and return_unicode:
127 | ret_pieces = []
128 | for piece in new_pieces:
129 | if isinstance(piece, str):
130 | piece = six.ensure_text(piece, "utf-8")
131 | ret_pieces.append(piece)
132 | new_pieces = ret_pieces
133 |
134 | return new_pieces
135 |
136 |
137 | def encode_ids(sp_model, text, sample=False):
138 | pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample)
139 | ids = [sp_model.PieceToId(piece) for piece in pieces]
140 | return ids
141 |
142 |
143 | def convert_to_unicode(text):
144 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
145 | if six.PY3:
146 | if isinstance(text, str):
147 | return text
148 | elif isinstance(text, bytes):
149 | return six.ensure_text(text, "utf-8", "ignore")
150 | else:
151 | raise ValueError("Unsupported string type: %s" % (type(text)))
152 | elif six.PY2:
153 | if isinstance(text, str):
154 | return six.ensure_text(text, "utf-8", "ignore")
155 | elif isinstance(text, six.text_type):
156 | return text
157 | else:
158 | raise ValueError("Unsupported string type: %s" % (type(text)))
159 | else:
160 | raise ValueError("Not running on Python2 or Python 3?")
161 |
162 |
163 | def printable_text(text):
164 | """Returns text encoded in a way suitable for print or `tf.logging`."""
165 |
166 | # These functions want `str` for both Python2 and Python3, but in one case
167 | # it's a Unicode string and in the other it's a byte string.
168 | if six.PY3:
169 | if isinstance(text, str):
170 | return text
171 | elif isinstance(text, bytes):
172 | return six.ensure_text(text, "utf-8", "ignore")
173 | else:
174 | raise ValueError("Unsupported string type: %s" % (type(text)))
175 | elif six.PY2:
176 | if isinstance(text, str):
177 | return text
178 | elif isinstance(text, six.text_type):
179 | return six.ensure_binary(text, "utf-8")
180 | else:
181 | raise ValueError("Unsupported string type: %s" % (type(text)))
182 | else:
183 | raise ValueError("Not running on Python2 or Python 3?")
184 |
185 |
186 | def load_vocab(vocab_file):
187 | """Loads a vocabulary file into a dictionary."""
188 | vocab = collections.OrderedDict()
189 | with tf.gfile.GFile(vocab_file, "r") as reader:
190 | while True:
191 | token = convert_to_unicode(reader.readline())
192 | if not token:
193 | break
194 | token = token.strip()#.split()[0]
195 | if token not in vocab:
196 | vocab[token] = len(vocab)
197 | return vocab
198 |
199 |
200 | def convert_by_vocab(vocab, items):
201 | """Converts a sequence of [tokens|ids] using the vocab."""
202 | output = []
203 | for item in items:
204 | output.append(vocab[item])
205 | return output
206 |
207 |
208 | def convert_tokens_to_ids(vocab, tokens):
209 | return convert_by_vocab(vocab, tokens)
210 |
211 |
212 | def convert_ids_to_tokens(inv_vocab, ids):
213 | return convert_by_vocab(inv_vocab, ids)
214 |
215 |
216 | def whitespace_tokenize(text):
217 | """Runs basic whitespace cleaning and splitting on a piece of text."""
218 | text = text.strip()
219 | if not text:
220 | return []
221 | tokens = text.split()
222 | return tokens
223 |
224 |
225 | class FullTokenizer(object):
226 | """Runs end-to-end tokenziation."""
227 |
228 | def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None):
229 | self.vocab = None
230 | self.sp_model = None
231 | if spm_model_file:
232 | self.sp_model = spm.SentencePieceProcessor()
233 | tf.logging.info("loading sentence piece model")
234 | self.sp_model.Load(spm_model_file)
235 | # Note(mingdachen): For the purpose of consisent API, we are
236 | # generating a vocabulary for the sentence piece tokenizer.
237 | self.vocab = {self.sp_model.IdToPiece(i): i for i
238 | in range(self.sp_model.GetPieceSize())}
239 | else:
240 | self.vocab = load_vocab(vocab_file)
241 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
242 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
243 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
244 |
245 | @classmethod
246 | def from_scratch(cls, vocab_file, do_lower_case, spm_model_file):
247 | return FullTokenizer(vocab_file, do_lower_case, spm_model_file)
248 |
249 | # @classmethod
250 | # def from_hub_module(cls, hub_module, spm_model_file):
251 | # """Get the vocab file and casing info from the Hub module."""
252 | # with tf.Graph().as_default():
253 | # albert_module = hub.Module(hub_module)
254 | # tokenization_info = albert_module(signature="tokenization_info",
255 | # as_dict=True)
256 | # with tf.Session() as sess:
257 | # vocab_file, do_lower_case = sess.run(
258 | # [tokenization_info["vocab_file"],
259 | # tokenization_info["do_lower_case"]])
260 | # return FullTokenizer(
261 | # vocab_file=vocab_file, do_lower_case=do_lower_case,
262 | # spm_model_file=spm_model_file)
263 |
264 | def tokenize(self, text):
265 | if self.sp_model:
266 | split_tokens = encode_pieces(self.sp_model, text, return_unicode=False)
267 | else:
268 | split_tokens = []
269 | for token in self.basic_tokenizer.tokenize(text):
270 | for sub_token in self.wordpiece_tokenizer.tokenize(token):
271 | split_tokens.append(sub_token)
272 |
273 | return split_tokens
274 |
275 | def convert_tokens_to_ids(self, tokens):
276 | if self.sp_model:
277 | tf.logging.info("using sentence piece tokenzier.")
278 | return [self.sp_model.PieceToId(
279 | printable_text(token)) for token in tokens]
280 | else:
281 | return convert_by_vocab(self.vocab, tokens)
282 |
283 | def convert_ids_to_tokens(self, ids):
284 | if self.sp_model:
285 | tf.logging.info("using sentence piece tokenzier.")
286 | return [self.sp_model.IdToPiece(id_) for id_ in ids]
287 | else:
288 | return convert_by_vocab(self.inv_vocab, ids)
289 |
290 |
291 | class BasicTokenizer(object):
292 | """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
293 |
294 | def __init__(self, do_lower_case=True):
295 | """Constructs a BasicTokenizer.
296 |
297 | Args:
298 | do_lower_case: Whether to lower case the input.
299 | """
300 | self.do_lower_case = do_lower_case
301 |
302 | def tokenize(self, text):
303 | """Tokenizes a piece of text."""
304 | text = convert_to_unicode(text)
305 | text = self._clean_text(text)
306 |
307 | # This was added on November 1st, 2018 for the multilingual and Chinese
308 | # models. This is also applied to the English models now, but it doesn't
309 | # matter since the English models were not trained on any Chinese data
310 | # and generally don't have any Chinese data in them (there are Chinese
311 | # characters in the vocabulary because Wikipedia does have some Chinese
312 | # words in the English Wikipedia.).
313 | text = self._tokenize_chinese_chars(text)
314 |
315 | orig_tokens = whitespace_tokenize(text)
316 | split_tokens = []
317 | for token in orig_tokens:
318 | if self.do_lower_case:
319 | token = token.lower()
320 | token = self._run_strip_accents(token)
321 | split_tokens.extend(self._run_split_on_punc(token))
322 |
323 | output_tokens = whitespace_tokenize(" ".join(split_tokens))
324 | return output_tokens
325 |
326 | def _run_strip_accents(self, text):
327 | """Strips accents from a piece of text."""
328 | text = unicodedata.normalize("NFD", text)
329 | output = []
330 | for char in text:
331 | cat = unicodedata.category(char)
332 | if cat == "Mn":
333 | continue
334 | output.append(char)
335 | return "".join(output)
336 |
337 | def _run_split_on_punc(self, text):
338 | """Splits punctuation on a piece of text."""
339 | chars = list(text)
340 | i = 0
341 | start_new_word = True
342 | output = []
343 | while i < len(chars):
344 | char = chars[i]
345 | if _is_punctuation(char):
346 | output.append([char])
347 | start_new_word = True
348 | else:
349 | if start_new_word:
350 | output.append([])
351 | start_new_word = False
352 | output[-1].append(char)
353 | i += 1
354 |
355 | return ["".join(x) for x in output]
356 |
357 | def _tokenize_chinese_chars(self, text):
358 | """Adds whitespace around any CJK character."""
359 | output = []
360 | for char in text:
361 | cp = ord(char)
362 | if self._is_chinese_char(cp):
363 | output.append(" ")
364 | output.append(char)
365 | output.append(" ")
366 | else:
367 | output.append(char)
368 | return "".join(output)
369 |
370 | def _is_chinese_char(self, cp):
371 | """Checks whether CP is the codepoint of a CJK character."""
372 | # This defines a "chinese character" as anything in the CJK Unicode block:
373 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
374 | #
375 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
376 | # despite its name. The modern Korean Hangul alphabet is a different block,
377 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write
378 | # space-separated words, so they are not treated specially and handled
379 | # like the all of the other languages.
380 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
381 | (cp >= 0x3400 and cp <= 0x4DBF) or #
382 | (cp >= 0x20000 and cp <= 0x2A6DF) or #
383 | (cp >= 0x2A700 and cp <= 0x2B73F) or #
384 | (cp >= 0x2B740 and cp <= 0x2B81F) or #
385 | (cp >= 0x2B820 and cp <= 0x2CEAF) or
386 | (cp >= 0xF900 and cp <= 0xFAFF) or #
387 | (cp >= 0x2F800 and cp <= 0x2FA1F)): #
388 | return True
389 |
390 | return False
391 |
392 | def _clean_text(self, text):
393 | """Performs invalid character removal and whitespace cleanup on text."""
394 | output = []
395 | for char in text:
396 | cp = ord(char)
397 | if cp == 0 or cp == 0xfffd or _is_control(char):
398 | continue
399 | if _is_whitespace(char):
400 | output.append(" ")
401 | else:
402 | output.append(char)
403 | return "".join(output)
404 |
405 |
406 | class WordpieceTokenizer(object):
407 | """Runs WordPiece tokenziation."""
408 |
409 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
410 | self.vocab = vocab
411 | self.unk_token = unk_token
412 | self.max_input_chars_per_word = max_input_chars_per_word
413 |
414 | def tokenize(self, text):
415 | """Tokenizes a piece of text into its word pieces.
416 |
417 | This uses a greedy longest-match-first algorithm to perform tokenization
418 | using the given vocabulary.
419 |
420 | For example:
421 | input = "unaffable"
422 | output = ["un", "##aff", "##able"]
423 |
424 | Args:
425 | text: A single token or whitespace separated tokens. This should have
426 | already been passed through `BasicTokenizer.
427 |
428 | Returns:
429 | A list of wordpiece tokens.
430 | """
431 |
432 | text = convert_to_unicode(text)
433 |
434 | output_tokens = []
435 | for token in whitespace_tokenize(text):
436 | chars = list(token)
437 | if len(chars) > self.max_input_chars_per_word:
438 | output_tokens.append(self.unk_token)
439 | continue
440 |
441 | is_bad = False
442 | start = 0
443 | sub_tokens = []
444 | while start < len(chars):
445 | end = len(chars)
446 | cur_substr = None
447 | while start < end:
448 | substr = "".join(chars[start:end])
449 | if start > 0:
450 | substr = "##" + six.ensure_str(substr)
451 | if substr in self.vocab:
452 | cur_substr = substr
453 | break
454 | end -= 1
455 | if cur_substr is None:
456 | is_bad = True
457 | break
458 | sub_tokens.append(cur_substr)
459 | start = end
460 |
461 | if is_bad:
462 | output_tokens.append(self.unk_token)
463 | else:
464 | output_tokens.extend(sub_tokens)
465 | return output_tokens
466 |
467 |
468 | def _is_whitespace(char):
469 | """Checks whether `chars` is a whitespace character."""
470 | # \t, \n, and \r are technically control characters but we treat them
471 | # as whitespace since they are generally considered as such.
472 | if char == " " or char == "\t" or char == "\n" or char == "\r":
473 | return True
474 | cat = unicodedata.category(char)
475 | if cat == "Zs":
476 | return True
477 | return False
478 |
479 |
480 | def _is_control(char):
481 | """Checks whether `chars` is a control character."""
482 | # These are technically control characters but we count them as whitespace
483 | # characters.
484 | if char == "\t" or char == "\n" or char == "\r":
485 | return False
486 | cat = unicodedata.category(char)
487 | if cat in ("Cc", "Cf"):
488 | return True
489 | return False
490 |
491 |
492 | def _is_punctuation(char):
493 | """Checks whether `chars` is a punctuation character."""
494 | cp = ord(char)
495 | # We treat all non-letter/number ASCII as punctuation.
496 | # Characters such as "^", "$", and "`" are not in the Unicode
497 | # Punctuation class but we treat them as punctuation anyways, for
498 | # consistency.
499 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
500 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
501 | return True
502 | cat = unicodedata.category(char)
503 | if cat.startswith("P"):
504 | return True
505 | return False
506 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/train.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 30 21:42:07 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | import os
11 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1'
12 | import numpy as np
13 | import tensorflow as tf
14 |
15 | from sentiment_analysis_albert_emoji.classifier_utils import get_features
16 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_test
17 | from sentiment_analysis_albert_emoji.networks import NetworkAlbert
18 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp
19 | from sentiment_analysis_albert_emoji.utils import shuffle_one,select
20 | from sentiment_analysis_albert_emoji.utils import time_now_string
21 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_emoji
22 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_emoji_test
23 |
24 |
25 |
26 | # Load Model
27 | pwd = os.path.dirname(os.path.abspath(__file__))
28 | MODEL = NetworkAlbert(is_training=True)
29 |
30 | # Get data features
31 | input_ids,input_masks,segment_ids,label_ids = get_features()
32 | input_ids_test,input_masks_test,segment_ids_test,label_ids_test = get_features_test()
33 | num_train_samples = len(input_ids)
34 | arr = np.arange(num_train_samples)
35 | num_batchs = int((num_train_samples - 1)/hp.batch_size) + 1
36 | print('number of batch:',num_batchs)
37 | ids_test = np.arange(len(input_ids_test))
38 |
39 |
40 | # Get emoji features
41 | input_emojis = get_features_emoji()
42 | input_emojis_test = get_features_emoji_test()
43 |
44 |
45 | # Set up the graph
46 | saver = tf.train.Saver(max_to_keep=hp.max_to_keep)
47 | sess = tf.Session()
48 | sess.run(tf.global_variables_initializer())
49 |
50 | # Load model saved before
51 | MODEL_SAVE_PATH = os.path.join(pwd, hp.file_save_model)
52 | ckpt = tf.train.get_checkpoint_state(MODEL_SAVE_PATH)
53 | if ckpt and ckpt.model_checkpoint_path:
54 | saver.restore(sess, ckpt.model_checkpoint_path)
55 | print('Restored model!')
56 |
57 |
58 | with sess.as_default():
59 | # Tensorboard writer
60 | writer = tf.summary.FileWriter(hp.logdir, sess.graph)
61 | for i in range(hp.num_train_epochs):
62 | indexs = shuffle_one(arr)
63 | for j in range(num_batchs-1):
64 | i1 = indexs[j * hp.batch_size:min((j + 1) * hp.batch_size, num_train_samples)]
65 |
66 | # Get features
67 | input_id_ = select(input_ids,i1)
68 | input_mask_ = select(input_masks,i1)
69 | segment_id_ = select(segment_ids,i1)
70 | label_id_ = select(label_ids,i1)
71 |
72 | # Get features emoji
73 | input_emoji_ = select(input_emojis,i1)
74 |
75 | # Feed dict
76 | fd = {MODEL.input_ids: input_id_,
77 | MODEL.input_masks: input_mask_,
78 | MODEL.segment_ids:segment_id_,
79 | MODEL.label_ids:label_id_,
80 | MODEL.input_emojis:input_emoji_}
81 |
82 | # Optimizer
83 | sess.run(MODEL.optimizer, feed_dict = fd)
84 |
85 | # Tensorboard
86 | if j%hp.summary_step==0:
87 | summary,glolal_step = sess.run([MODEL.merged,MODEL.global_step], feed_dict = fd)
88 | writer.add_summary(summary, glolal_step)
89 |
90 | # Save Model
91 | if j%(num_batchs//hp.num_saved_per_epoch)==0:
92 | if not os.path.exists(os.path.join(pwd, hp.file_save_model)):
93 | os.makedirs(os.path.join(pwd, hp.file_save_model))
94 | saver.save(sess, os.path.join(pwd, hp.file_save_model, 'model_%s_%s.ckpt'%(str(i),str(j))))
95 |
96 | # Log
97 | if j % hp.print_step == 0:
98 | loss = sess.run(MODEL.loss, feed_dict = fd)
99 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss:%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss)))
100 | # Loss of Test data
101 | indexs_test = shuffle_one(ids_test)[:hp.batch_size_eval]
102 | input_id_test = select(input_ids_test,indexs_test)
103 | input_mask_test = select(input_masks_test,indexs_test)
104 | segment_id_test = select(segment_ids_test,indexs_test)
105 | label_id_test = select(label_ids_test,indexs_test)
106 |
107 | # Get features emoji
108 | input_emoji_test = select(input_emojis_test,indexs_test)
109 |
110 | fd_test = {MODEL.input_ids:input_id_test,
111 | MODEL.input_masks:input_mask_test ,
112 | MODEL.segment_ids:segment_id_test,
113 | MODEL.label_ids:label_id_test,
114 | MODEL.input_emojis:input_emoji_test}
115 | loss = sess.run(MODEL.loss, feed_dict = fd_test)
116 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss(test):%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss)))
117 | print('Optimization finished')
118 |
119 |
120 |
121 |
122 |
123 |
124 |
--------------------------------------------------------------------------------
/sentiment_analysis_albert_emoji/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu May 29 20:40:40 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 |
10 | import time
11 | import emoji
12 | import pandas as pd
13 | import numpy as np
14 |
15 |
16 |
17 | def time_now_string():
18 | return time.strftime("%Y-%m-%d %H:%M:%S",time.localtime( time.time() ))
19 |
20 |
21 | def get_emoji(sentence):
22 | emoji_list = emoji.emoji_lis(sentence)
23 | return ''.join([l['emoji'] for l in emoji_list])
24 |
25 |
26 | def load_csv(file,header=0,encoding="utf-8"):
27 | return pd.read_csv(file,
28 | encoding=encoding,
29 | header=header,
30 | error_bad_lines=False)
31 |
32 |
33 | def save_csv(dataframe,file,header=True,index=None,encoding="utf-8"):
34 | return dataframe.to_csv(file,
35 | mode='w+',
36 | header=header,
37 | index=index,
38 | encoding=encoding)
39 |
40 |
41 | def save_excel(dataframe,file,header=True,sheetname='Sheet1'):
42 | return dataframe.to_excel(file,
43 | header=header,
44 | sheet_name=sheetname)
45 |
46 |
47 |
48 | def load_excel(file,header=0,sheetname=None):
49 | dfs = pd.read_excel(file,
50 | header=header,
51 | sheet_name=sheetname)
52 | sheet_names = list(dfs.keys())
53 | print('Name of first sheet:',sheet_names[0])
54 | df = dfs[sheet_names[0]]
55 | print('Load excel data finished!')
56 | return df.fillna("")
57 |
58 |
59 | def load_txt(file):
60 | with open(file,encoding='utf-8',errors='ignore') as fp:
61 | lines = fp.readlines()
62 | lines = [l.strip() for l in lines]
63 | return lines
64 |
65 |
66 | def save_txt(file,lines):
67 | lines = [l+'\n' for l in lines]
68 | with open(file,'w+',encoding='utf-8') as fp:#a+添加
69 | fp.writelines(lines)
70 |
71 |
72 | def shuffle_two(a1,a2):
73 | """
74 | 随机打乱a1和a2两个
75 | """
76 | ran = np.arange(len(a1))
77 | np.random.shuffle(ran)
78 | a1_ = [a1[l] for l in ran]
79 | a2_ = [a2[l] for l in ran]
80 | return a1_, a2_
81 |
82 |
83 | def load_vocabulary(file_vocabulary_label):
84 | """
85 | Load vocabulary to dict
86 | """
87 | vocabulary = load_txt(file_vocabulary_label)
88 | dict_id2char,dict_char2id = {},{}
89 | for i,l in enumerate(vocabulary):
90 | dict_id2char[i] = str(l)
91 | dict_char2id[str(l)] = i
92 | return dict_id2char,dict_char2id
93 |
94 |
95 | def get_word_sequence(words,vocabulary,Reverse=True,k=1000):
96 | """
97 | words: a list of word or string
98 | """
99 | words = [l.lower() for l in words]
100 | dic = {}
101 | for word in words:
102 | if word not in dic:
103 | dic[word] = 1
104 | else:
105 | dic[word] = dic[word] + 1
106 | return sorted(dic.items(),key = lambda x:x[0],reverse = Reverse)[:k]
107 |
108 |
109 | def select(data,ids):
110 | return [data[i] for i in ids]
111 |
112 |
113 | def shuffle_one(a1):
114 | ran = np.arange(len(a1))
115 | np.random.shuffle(ran)
116 | return np.array(a1)[ran].tolist()
117 |
118 |
119 | def cut_list(data,size):
120 | """
121 | data: a list
122 | size: the size of cut
123 | """
124 | return [data[i * size:min((i + 1) * size, len(data))] for i in range(int(len(data)-1)//size + 1)]
125 |
126 |
127 | def cut_list_by_size(data,lengths):
128 | """
129 | data: a list
130 | lengths: the different sizes of cut
131 | """
132 | list_block = []
133 | for l in lengths:
134 | list_block.append(data[:l])
135 | data = data[l:]
136 | return list_block
137 |
138 |
139 |
140 | if __name__ == "__main__":
141 | print(time_now_string())
142 |
143 | #
144 | string = '☕️🥂'
145 | print(get_emoji(string))
146 | #
147 | file_vocab_emoji ='albert_base_zh/vocab_emoji.txt'
148 | vocab = load_txt(file_vocab_emoji)
149 | print(vocab[-3:])
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/README.md:
--------------------------------------------------------------------------------
1 |
2 | # Sentiment analysis bayes
3 |
4 | 1、vocabulary_pearson_40000.txt是通过pearson计算词汇和情感的相关性,得到的TOP40000个词汇,并做了排序。
5 | 2、训练中取了对情感影响最的前2000个词。
6 |
7 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/bayes.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Thu Jun 7 13:40:26 2018
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 | import numpy as np
10 | from sentiment_analysis_bayes.utils import load_txt, save_txt
11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp
12 |
13 | pwd = os.path.dirname(os.path.abspath(__file__))
14 |
15 |
16 | def classify(vec2classify, p0, p1, class_):
17 | """
18 | Classifier function of Bayes.
19 | """
20 | p1 = sum(vec2classify * p1) + np.log(class_)
21 | p0 = sum(vec2classify * p0) + np.log(1 - class_)
22 | if p1 > p0:
23 | return 1
24 | else:
25 | return 0
26 |
27 |
28 | def load_model():
29 | """
30 | Load bayes parameters
31 | """
32 | p0 = np.array([float(l) for l in load_txt(hp.file_p0)])
33 | p1 = np.array([float(l) for l in load_txt(hp.file_p1)])
34 | class_ = float(load_txt(hp.file_class)[0])
35 | return p0, p1, class_
36 |
37 |
38 | def save_model(p0, p1, class_, file_p0, file_p1, file_class):
39 | """
40 | Save bayes parameters
41 | """
42 | save_txt(file_p0, [str(l) for l in p0])
43 | save_txt(file_p1, [str(l) for l in p1])
44 | save_txt(file_class, [str(class_)])
45 | print('Save model finished!')
46 |
47 |
48 | if __name__ == '__main__':
49 | # Predict
50 | print('我爱武汉')
51 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/test.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/test_feature.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test_feature.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/test_label.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test_label.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/train.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/train_feature.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train_feature.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/data/train_label.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train_label.zip
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/dict/stopwords.txt:
--------------------------------------------------------------------------------
1 | '
2 | '
3 |
4 | '
5 | :
6 | )
7 | ,
8 | .
9 | :
10 | ;
11 | ]
12 | }
13 | ¢
14 | '
15 | "
16 | 、
17 | 。
18 | 〉
19 | 》
20 | 」
21 | 』
22 | 】
23 | 〕
24 | 〗
25 | 〞
26 | ︰
27 | ︱
28 | ︳
29 | ﹐
30 | 、
31 | ﹒
32 | ﹔
33 | ﹕
34 | ﹚
35 | ﹜
36 | ﹞
37 | )
38 | ,
39 | .
40 | :
41 | ;
42 | |
43 | }
44 | ︴
45 | ︶
46 | ︸
47 | ︺
48 | ︼
49 | ︾
50 | ﹀
51 | ﹂
52 | ﹄
53 | ﹏
54 | 、
55 | ~
56 | ¢
57 | 々
58 | ‖
59 | •
60 | ·
61 | ˇ
62 |
63 | ′
64 | ’
65 | ”
66 | (
67 | [
68 | {
69 |
70 | £¥
71 | '
72 | "
73 | ‵
74 | 〈
75 | 《
76 |
77 | 「
78 | 『
79 | 【
80 | 〔
81 | 〖
82 | (
83 | [
84 | {
85 | £
86 | ¥
87 | 〝
88 | ︵
89 | ︷
90 | ︹
91 | ︻
92 | ︽
93 | ︿
94 | ﹁
95 | ﹃
96 | ﹙
97 | ﹛
98 | ﹝
99 | (
100 | {
101 | “
102 | ‘
103 | …
104 | '
105 | '
106 | '
107 | \
108 | /
109 | /
110 | ×
111 | Π
112 | Δ
113 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/hyperparameters.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Jun 22 10:42:41 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 |
10 | pwd = os.path.dirname(os.path.abspath(__file__))
11 |
12 |
13 | class Hyperparamters:
14 | # Parameters
15 | feature_size = 2000
16 |
17 | # Stopwords
18 | file_stopwords = os.path.join(pwd, 'dict/stopwords.txt')
19 | file_vocabulary = os.path.join(pwd, 'dict/vocabulary_pearson_40000.txt')
20 |
21 | # Train data
22 | file_train_data = os.path.join(pwd, 'data/train.csv')
23 | file_test_data = os.path.join(pwd, 'data/test.csv')
24 | #
25 | file_train_feature = os.path.join(pwd, 'data/train_feature.txt')
26 | file_train_label = os.path.join(pwd, 'data/train_label.txt')
27 | #
28 | file_test_feature = os.path.join(pwd, 'data/test_feature.txt')
29 | file_test_label = os.path.join(pwd, 'data/test_label.txt')
30 |
31 | # model file
32 | file_p0 = os.path.join(pwd, 'model/p0.txt')
33 | file_p1 = os.path.join(pwd, 'model/p1.txt')
34 | file_class = os.path.join(pwd, 'model/class.txt')
35 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/load.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Jul 21 16:12:10 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import numpy as np
9 |
10 | from sentiment_analysis_bayes.utils import load_txt, drop_stopwords
11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp
12 |
13 | # Load data
14 | vocabulary = [str(w.replace('\n', '')) for w in load_txt(hp.file_vocabulary)][:hp.feature_size]
15 | stopwords = set(load_txt(hp.file_stopwords))
16 |
17 |
18 | def get_sentence_feature(sentence):
19 | """
20 | Transform a sentence to one-hot vector.
21 | """
22 | words = drop_stopwords(sentence, stopwords)
23 | return [int(words.count(w)) for w in vocabulary]
24 |
25 |
26 | def load_label(file_train_label):
27 | """
28 | Load data label.
29 | """
30 | return np.array([int(line) for line in load_txt(file_train_label)])
31 |
32 |
33 | def load_feature(file_train_feature):
34 | """
35 | Load data one-hot feature.
36 | """
37 | return np.array([eval(line) for line in load_txt(file_train_feature)])
38 |
39 |
40 | if __name__ == '__main__':
41 | #
42 | train_label = load_label(hp.file_train_label)
43 | print(train_label[:5])
44 | #
45 | train_feature = load_feature(hp.file_test_feature)
46 | print(train_feature[0][:20])
47 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/model/class.txt:
--------------------------------------------------------------------------------
1 | 0.5214
2 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/predict.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon May 13 10:46:08 2019
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 | import sys
10 |
11 | pwd = os.path.dirname(os.path.abspath(__file__))
12 | sys.path.append(pwd)
13 | from sentiment_analysis_bayes.bayes import classify
14 | from sentiment_analysis_bayes.bayes import load_model
15 | from sentiment_analysis_bayes.load import get_sentence_feature
16 |
17 | p0Vec, p1Vec, pClass1 = load_model()
18 |
19 |
20 | def sa(sentence):
21 | """
22 | Predict a sentence's sentiment.
23 | """
24 | vector = get_sentence_feature(sentence)
25 | point = classify(vector, p0Vec, p1Vec, pClass1)
26 | if point == 1:
27 | return 'Positif'
28 | elif point == 0:
29 | return 'Negitif'
30 |
31 |
32 | if __name__ == '__main__':
33 | # Test
34 | content = '我喜欢武汉'
35 | content = '我讨厌武汉'
36 | print(sa(content))
37 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/prepare.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Jul 21 16:38:31 2020
4 |
5 | @author: cm
6 | """
7 |
8 | from sentiment_analysis_bayes.utils import drop_stopwords
9 | from sentiment_analysis_bayes.utils import load_txt, load_csv, save_txt
10 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp
11 |
12 | # Load data
13 | vocabulary = [str(w.replace('\n', '')) for w in load_txt(hp.file_vocabulary)][:hp.feature_size]
14 | stopwords = set(load_txt(hp.file_stopwords))
15 |
16 |
17 | def sentence2feature(sentence):
18 | """
19 | Transform a sentence to a One-hot vector.
20 | """
21 | words = drop_stopwords(sentence, stopwords)
22 | return [words.count(w) for w in vocabulary]
23 |
24 |
25 | if __name__ == '__main__':
26 | # Train data
27 | df = load_csv(hp.file_train_data)
28 | contents = df['content'].tolist()
29 | labels = df['label'].tolist()
30 | train_features = [str(sentence2feature(l)) for l in contents]
31 | save_txt('data/train_feature.txt', train_features)
32 | train_labels = [str(l) for l in labels]
33 | save_txt('data/train_label.txt', train_labels)
34 |
35 | # Test data
36 | df = load_csv(hp.file_test_data)
37 | contents = df['content'].tolist()
38 | labels = df['label'].tolist()
39 | test_features = [str(sentence2feature(l)) for l in contents]
40 | save_txt('data/test_feature.txt', test_features)
41 | test_labels = [str(l) for l in labels]
42 | save_txt('data/test_label.txt', test_labels)
43 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/train.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Jul 21 15:50:55 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import numpy as np
9 | from sentiment_analysis_bayes.bayes import save_model
10 | from sentiment_analysis_bayes.load import load_feature, load_label
11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp
12 |
13 |
14 | def main(x, y):
15 | """
16 | Training
17 | """
18 | numTrainDocs = len(x)
19 | numWords = len(x[0])
20 | pAbusive = sum(y) / float(numTrainDocs)
21 | p0Num, p1Num = np.ones(numWords), np.ones(numWords)
22 | p0Deom, p1Deom = 2, 2
23 | for i in range(numTrainDocs):
24 | if y[i] == 1:
25 | p1Num = p1Num + x[i]
26 | p1Deom = p1Deom + sum(x[i])
27 | else:
28 | p0Num = p0Num + x[i]
29 | p0Deom = p0Deom + sum(x[i])
30 | if i % 100 == 0:
31 | print(i)
32 | p1Vect = p1Num / p1Deom
33 | p0Vect = p0Num / p0Deom
34 | p1VectLog = np.zeros(len(p1Vect))
35 | for i in range(len(p1Vect)):
36 | p1VectLog[i] = np.log(p1Vect[i])
37 | p0VectLog = np.zeros(len(p0Vect))
38 | for i in range(len(p0Vect)):
39 | p0VectLog[i] = np.log(p0Vect[i])
40 | return p0VectLog, p1VectLog, pAbusive
41 |
42 |
43 | if __name__ == '__main__':
44 | # Train
45 | train_data, train_label = load_feature(hp.file_train_feature), load_label(hp.file_train_label)
46 | p0, p1, class_ = main(train_data, train_label)
47 | # Save model
48 | f1 = 'model/p0.txt'
49 | f2 = 'model/p1.txt'
50 | f3 = 'model/class.txt'
51 | save_model(p0, p1, class_, f1, f2, f3)
52 |
--------------------------------------------------------------------------------
/sentiment_analysis_bayes/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Jul 21 15:45:17 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import time
9 | import jieba
10 | import numpy as np
11 | import pandas as pd
12 |
13 |
14 | def time_now_string():
15 | """
16 | Time now.
17 | """
18 | return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
19 |
20 |
21 | def cut_list(data, size):
22 | """
23 | data: a list
24 | size: the size of cut
25 | """
26 | return [data[i * size:min((i + 1) * size, len(data))] for i in range(int(len(data) - 1) // size + 1)]
27 |
28 |
29 | def load_txt(file):
30 | """
31 | load a txt.
32 | """
33 | with open(file, encoding='utf-8', errors='ignore') as fp:
34 | lines = fp.readlines()
35 | lines = [l.strip() for l in lines]
36 | return lines
37 |
38 |
39 | def save_txt(file, lines):
40 | """
41 | Save a txt.
42 | """
43 | lines = [l + '\n' for l in lines]
44 | with open(file, 'w+', encoding='utf-8') as fp:
45 | fp.writelines(lines)
46 |
47 |
48 | def drop_stopwords(sentence, stopwords):
49 | """
50 | Delete stopwords that we don't need.
51 | """
52 | return [l for l in jieba.lcut(str(sentence)) if l not in stopwords]
53 |
54 |
55 | def load_csv(file, header=0, encoding="utf-8-sig"):
56 | """
57 | Load a Data-frame from a csv.
58 | """
59 | return pd.read_csv(file,
60 | encoding=encoding,
61 | header=header,
62 | error_bad_lines=False)
63 |
64 |
65 | def save_csv(dataframe, file, header=True, index=None, encoding="utf-8-sig"):
66 | """
67 | Save a Data-frame by a csv.
68 | """
69 | return dataframe.to_csv(file,
70 | mode='w+',
71 | header=header,
72 | index=index,
73 | encoding=encoding)
74 |
75 |
76 | def shuffle_two(a1, a2):
77 | """
78 | Shuffle two lists by the same index.
79 | """
80 | ran = np.arange(len(a1))
81 | np.random.shuffle(ran)
82 | return [a1[l] for l in ran], [a2[l] for l in ran]
83 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/README.md:
--------------------------------------------------------------------------------
1 | # 简介
2 | 1、本项目是在python3.6基础上做的。
3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。
4 | 3、欢迎大家联系我 www.hellonlp.com
5 |
6 | # 使用方法
7 | 1、预测
8 | python preidict.py
9 |
10 | # 知乎代码解读
11 | https://zhuanlan.zhihu.com/p/142011031
12 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/insufficiently.txt:
--------------------------------------------------------------------------------
1 | 半点
2 | 不大
3 | 不丁点儿
4 | 不甚
5 | 不怎么
6 | 聊
7 | 没怎么
8 | 轻度
9 | 弱
10 | 丝毫
11 | 微
12 | 相对
13 | 不那么
14 | 不是那么
15 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/inverse.txt:
--------------------------------------------------------------------------------
1 | 不
2 | 不是
3 | 没
4 | 没有
5 | 无
6 | 非
7 | 莫
8 | 弗
9 | 毋
10 | 未
11 | 否
12 | 别
13 | 无
14 | 不够
15 | 不是
16 | 不曾
17 | 未必
18 | 不要
19 | 难以
20 | 未曾
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/ish.txt:
--------------------------------------------------------------------------------
1 | 点点滴滴
2 | 多多少少
3 | 怪
4 | 好生
5 | 还
6 | 或多或少
7 | 略
8 | 略加
9 | 略略
10 | 略微
11 | 略为
12 | 蛮
13 | 稍
14 | 稍稍
15 | 稍微
16 | 稍为
17 | 稍许
18 | 挺
19 | 未免
20 | 相当
21 | 些
22 | 些微
23 | 些小
24 | 一点
25 | 一点儿
26 | 一些
27 | 有点
28 | 有点儿
29 | 有些
30 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/more.txt:
--------------------------------------------------------------------------------
1 | 大不了
2 | 多
3 | 更
4 | 更加
5 | 更进一步
6 | 更为
7 | 还
8 | 还要
9 | 较
10 | 较比
11 | 较为
12 | 进一步
13 | 那般
14 | 那么
15 | 那样
16 | 强
17 | 如斯
18 | 益
19 | 益发
20 | 尤甚
21 | 逾
22 | 愈
23 | 愈 ... 愈
24 | 愈发
25 | 愈加
26 | 愈来愈
27 | 愈益
28 | 远远
29 | 越 ... 越
30 | 越发
31 | 越加
32 | 越来越
33 | 越是
34 | 这般
35 | 这样
36 | 足
37 | 足足
38 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/most.txt:
--------------------------------------------------------------------------------
1 | 百分之百
2 | 倍加
3 | 备至
4 | 不得了
5 | 不堪
6 | 不可开交
7 | 不亦乐乎
8 | 不折不扣
9 | 彻头彻尾
10 | 充分
11 | 到头
12 | 地地道道
13 | 极
14 | 极度
15 | 极其
16 | 极为
17 | 截然
18 | 尽
19 | 惊人地
20 | 绝
21 | 绝顶
22 | 绝对
23 | 绝对化
24 | 刻骨
25 | 酷
26 | 满
27 | 满贯
28 | 满心
29 | 莫大
30 | 奇
31 | 入骨
32 | 甚为
33 | 十二分
34 | 十分
35 | 十足
36 | 滔天
37 | 透
38 | 完全
39 | 完完全全
40 | 万
41 | 万般
42 | 万分
43 | 万万
44 | 无比
45 | 无度
46 | 无可估量
47 | 无以复加
48 | 无以伦比
49 | 要命
50 | 要死
51 | 已极
52 | 已甚
53 | 异常
54 | 逾常
55 | 贼
56 | 之极
57 | 之至
58 | 至极
59 | 卓绝
60 | 最为
61 | 佼佼
62 | 最
63 | 相当
64 | 非常
65 | 超级
66 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/not.txt:
--------------------------------------------------------------------------------
1 | 不
2 | 没
3 | 无
4 | 非
5 | 莫
6 | 弗
7 | 勿
8 | 毋
9 | 未
10 | 否
11 | 别
12 | 無
13 | 休
14 | 难道
15 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/over.txt:
--------------------------------------------------------------------------------
1 | 不为过
2 | 超
3 | 超额
4 | 超外差
5 | 超微结构
6 | 超物质
7 | 出头
8 | 多
9 | 浮
10 | 过
11 | 过度
12 | 过分
13 | 过火
14 | 过劲
15 | 过了头
16 | 过猛
17 | 过热
18 | 过甚
19 | 过头
20 | 过于
21 | 过逾
22 | 何止
23 | 何啻
24 | 开外
25 | 苦
26 | 老
27 | 偏
28 | 强
29 | 溢
30 | 忒
31 | 极端
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/ponctuation_sentiment.txt:
--------------------------------------------------------------------------------
1 | '
2 | '
3 |
4 | '
5 | :
6 | )
7 | ,
8 | .
9 | :
10 | ;
11 | ]
12 | }
13 | ¢
14 | '
15 | "
16 | 、
17 | 。
18 | 〉
19 | 》
20 | 」
21 | 』
22 | 】
23 | 〕
24 | 〗
25 | 〞
26 | ︰
27 | ︱
28 | ︳
29 | ﹐
30 | 、
31 | ﹒
32 | ﹔
33 | ﹕
34 | ﹚
35 | ﹜
36 | ﹞
37 | )
38 | ,
39 | .
40 | :
41 | ;
42 | |
43 | }
44 | ︴
45 | ︶
46 | ︸
47 | ︺
48 | ︼
49 | ︾
50 | ﹀
51 | ﹂
52 | ﹄
53 | ﹏
54 | 、
55 | ~
56 | ¢
57 | 々
58 | ‖
59 | •
60 | ·
61 | ˇ
62 |
63 | ′
64 | ’
65 | ”
66 | (
67 | [
68 | {
69 |
70 | £¥
71 | '
72 | "
73 | ‵
74 | 〈
75 | 《
76 |
77 | 「
78 | 『
79 | 【
80 | 〔
81 | 〖
82 | (
83 | [
84 | {
85 | £
86 | ¥
87 | 〝
88 | ︵
89 | ︷
90 | ︹
91 | ︻
92 | ︽
93 | ︿
94 | ﹁
95 | ﹃
96 | ﹙
97 | ﹛
98 | ﹝
99 | (
100 | {
101 | “
102 | ‘
103 | …
104 | '
105 | '
106 | '
107 | \
108 | /
109 | /
110 | ×
111 | Π
112 | Δ
113 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/dict/very.txt:
--------------------------------------------------------------------------------
1 | 不过
2 | 不少
3 | 不胜
4 | 惨
5 | 沉
6 | 沉沉
7 | 出奇
8 | 大为
9 | 多
10 | 多多
11 | 多加
12 | 多么
13 | 分外
14 | 格外
15 | 够瞧的
16 | 够戗
17 | 好
18 | 好不
19 | 何等
20 | 很多
21 | 很是
22 | 很
23 | 坏
24 | 可
25 | 老
26 | 老大
27 | 良
28 | 颇
29 | 颇为
30 | 甚
31 | 实在
32 | 太甚
33 | 特
34 | 特别
35 | 尤
36 | 尤其
37 | 尤为
38 | 尤以
39 | 远
40 | 着实
41 | 曷
42 | 碜
--------------------------------------------------------------------------------
/sentiment_analysis_dict/hyperparameters.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Jan 6 20:44:08 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 | from sentiment_analysis_dict.utils import ToolGeneral
10 |
11 |
12 | pwd = os.path.dirname(os.path.abspath(__file__))
13 | tool = ToolGeneral()
14 |
15 |
16 | class Hyperparams:
17 | '''Hyper parameters'''
18 | # Load sentiment dictionary
19 | deny_word = tool.load_dict(os.path.join(pwd,'dict','not.txt'))
20 | posdict = tool.load_dict(os.path.join(pwd,'dict','positive.txt'))
21 | negdict = tool.load_dict(os.path.join(pwd,'dict', 'negative.txt'))
22 | pos_neg_dict = posdict|negdict
23 | # Load adverb dictionary
24 | mostdict = tool.load_dict(os.path.join(pwd,'dict','most.txt'))
25 | verydict = tool.load_dict(os.path.join(pwd,'dict','very.txt'))
26 | moredict = tool.load_dict(os.path.join(pwd,'dict','more.txt'))
27 | ishdict = tool.load_dict(os.path.join(pwd,'dict','ish.txt'))
28 | insufficientlydict = tool.load_dict(os.path.join(pwd,'dict','insufficiently.txt'))
29 | overdict = tool.load_dict(os.path.join(pwd,'dict','over.txt'))
30 | inversedict = tool.load_dict(os.path.join(pwd,'dict','inverse.txt'))
31 |
32 |
33 |
34 |
35 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/networks.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Wed Oct 25 11:48:24 2017
4 |
5 | @author: cm
6 | """
7 |
8 | import os
9 | import sys
10 | import jieba
11 | import numpy as np
12 | sys.path.append(os.path.dirname(os.path.dirname(__file__)))
13 |
14 | from sentiment_analysis_dict.utils import ToolGeneral
15 | from sentiment_analysis_dict.hyperparameters import Hyperparams as hp
16 |
17 |
18 | tool = ToolGeneral()
19 | jieba.load_userdict(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'dict','jieba_sentiment.txt'))
20 |
21 |
22 | class SentimentAnalysis():
23 | """
24 | Sentiment Analysis with some dictionarys
25 | """
26 | def sentiment_score_list(self,dataset):
27 | seg_sentence = tool.sentence_split_regex(dataset)
28 | count1,count2 = [],[]
29 | for sentence in seg_sentence:
30 | words = jieba.lcut(sentence, cut_all=False)
31 | i = 0
32 | a = 0
33 | for word in words:
34 | """
35 | poscount 积极词的第一次分值;
36 | poscount2 积极反转后的分值;
37 | poscount3 积极词的最后分值(包括叹号的分值)
38 | """
39 | poscount,negcount,poscount2,negcount2,poscount3,negcount3 = 0,0,0,0,0,0
40 | if word in hp.posdict :
41 | if word in ['好','真','实在'] and words[min(i+1,len(words)-1)] in hp.pos_neg_dict and words[min(i+1,len(words)-1)] != word:
42 | continue
43 | else:
44 | poscount +=1
45 | c = 0
46 | for w in words[a:i]: # 扫描情感词前的程度词
47 | if w in hp.mostdict:
48 | poscount *= 4
49 | elif w in hp.verydict:
50 | poscount *= 3
51 | elif w in hp.moredict:
52 | poscount *= 2
53 | elif w in hp.ishdict:
54 | poscount *= 0.5
55 | elif w in hp.insufficientlydict:
56 | poscount *= -0.3
57 | elif w in hp.overdict:
58 | poscount *= -0.5
59 | elif w in hp.inversedict:
60 | c+= 1
61 | else:
62 | poscount *= 1
63 | if tool.is_odd(c) == 'odd': # 扫描情感词前的否定词数
64 | poscount *= -1.0
65 | poscount2 += poscount
66 | poscount = 0
67 | poscount3 = poscount + poscount2 + poscount3
68 | poscount2 = 0
69 | else:
70 | poscount3 = poscount + poscount2 + poscount3
71 | poscount = 0
72 | a = i+1
73 | elif word in hp.negdict: # 消极情感的分析,与上面一致
74 | if word in ['好','真','实在'] and words[min(i+1,len(words)-1)] in hp.pos_neg_dict and words[min(i+1,len(words)-1)] != word:
75 | continue
76 | else:
77 | negcount += 1
78 | d = 0
79 | for w in words[a:i]:
80 | if w in hp.mostdict:
81 | negcount *= 4
82 | elif w in hp.verydict:
83 | negcount *= 3
84 | elif w in hp.moredict:
85 | negcount *= 2
86 | elif w in hp.ishdict:
87 | negcount *= 0.5
88 | elif w in hp.insufficientlydict:
89 | negcount *= -0.3
90 | elif w in hp.overdict:
91 | negcount *= -0.5
92 | elif w in hp.inversedict:
93 | d += 1
94 | else:
95 | negcount *= 1
96 | if tool.is_odd(d) == 'odd':
97 | negcount *= -1.0
98 | negcount2 += negcount
99 | negcount = 0
100 | negcount3 = negcount + negcount2 + negcount3
101 | negcount2 = 0
102 | else:
103 | negcount3 = negcount + negcount2 + negcount3
104 | negcount = 0
105 | a = i + 1
106 | i += 1
107 | pos_count = poscount3
108 | neg_count = negcount3
109 | count1.append([pos_count,neg_count])
110 | if words[-1] in ['!','!']:# 扫描感叹号前的情感词,发现后权值*2
111 | count1 = [[j*2 for j in c] for c in count1]
112 |
113 | for w_im in ['但是','但']:
114 | if w_im in words : # 扫描但是后面的情感词,发现后权值*5
115 | ind = words.index(w_im)
116 | count1_head = count1[:ind]
117 | count1_tail = count1[ind:]
118 | count1_tail_new = [[j*5 for j in c] for c in count1_tail]
119 | count1 = []
120 | count1.extend(count1_head)
121 | count1.extend(count1_tail_new)
122 | break
123 | if words[-1] in ['?','?']:# 扫描是否有问好,发现后为负面
124 | count1 = [[0,2]]
125 |
126 | count2.append(count1)
127 | count1=[]
128 | return count2
129 |
130 | def sentiment_score(self,s):
131 | senti_score_list = self.sentiment_score_list(s)
132 | if senti_score_list != []:
133 | negatives=[]
134 | positives=[]
135 | for review in senti_score_list:
136 | score_array = np.array(review)
137 | AvgPos = np.sum(score_array[:,0])
138 | AvgNeg = np.sum(score_array[:,1])
139 | negatives.append(AvgNeg)
140 | positives.append(AvgPos)
141 | pos_score = np.mean(positives)
142 | neg_score = np.mean(negatives)
143 | if pos_score >=0 and neg_score<=0:
144 | pos_score = pos_score
145 | neg_score = abs(neg_score)
146 | elif pos_score >=0 and neg_score>=0:
147 | pos_score = pos_score
148 | neg_score = neg_score
149 | else:
150 | pos_score,neg_score=0,0
151 | return pos_score,neg_score
152 |
153 | def normalization_score(self,sent):
154 | score1,score0 = self.sentiment_score(sent)
155 | if score1 > 4 and score0 > 4:
156 | if score1 >= score0:
157 | _score1 = 1
158 | _score0 = score0/score1
159 | elif score1 < score0:
160 | _score0 = 1
161 | _score1 = score1/score0
162 | else :
163 | if score1 >= 4 :
164 | _score1 = 1
165 | elif score1 < 4 :
166 | _score1 = score1/4
167 | if score0 >= 4 :
168 | _score0 = 1
169 | elif score0 < 4 :
170 | _score0 = score0/4
171 | return _score1,_score0
172 |
173 |
174 | if __name__ =='__main__':
175 | sa = SentimentAnalysis()
176 | text = '我妈说明儿不让出去玩'
177 | print(sa.normalization_score(text))
178 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/preidict.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Jan 7 10:28:41 2020
4 |
5 | @author: cm
6 | """
7 |
8 |
9 | from sentiment_analysis_dict.networks import SentimentAnalysis
10 |
11 | SA = SentimentAnalysis()
12 |
13 |
14 | def predict(sent):
15 | """
16 | 1: positif
17 | 0: neutral
18 | -1: negatif
19 | """
20 | score1,score0 = SA.normalization_score(sent)
21 | if score1 == score0:
22 | result = 0
23 | elif score1 > score0:
24 | result = 1
25 | elif score1 < score0:
26 | result = -1
27 | return result
28 |
29 |
30 | if __name__ =='__main__':
31 | text = '对你不满意'
32 | text = '大美女'
33 | text = '帅哥'
34 | text = '我妈说明儿不让出去玩'
35 | print(predict(text))
36 |
--------------------------------------------------------------------------------
/sentiment_analysis_dict/utils.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Jan 6 20:47:37 2020
4 |
5 | @author: cm
6 | """
7 |
8 | import re
9 |
10 |
11 | class ToolGeneral():
12 | """
13 | Tool function
14 | """
15 | def is_odd(self,num):
16 | if num % 2 == 0:
17 | return 'even'
18 | else:
19 | return 'odd'
20 |
21 | def load_dict(self,file):
22 | """
23 | Load dictionary
24 | """
25 | with open(file,encoding='utf-8',errors='ignore') as fp:
26 | lines = fp.readlines()
27 | lines = [l.strip() for l in lines]
28 | print("Load data from file (%s) finished !"%file)
29 | dictionary = [word.strip() for word in lines]
30 | return set(dictionary)
31 |
32 |
33 | def sentence_split_regex(self,sentence):
34 | """
35 | Segmentation of sentence
36 | """
37 | if sentence is not None:
38 | sentence = re.sub(r"–+|—+", "-", sentence)
39 | sub_sentence = re.split(r"[。,,!!??;;\s…~~]+|\.{2,}|…+| +|_n|_t", sentence)
40 | sub_sentence = [s for s in sub_sentence if s != '']
41 | if sub_sentence != []:
42 | return sub_sentence
43 | else:
44 | return [sentence]
45 | return []
46 |
47 |
48 | if __name__ == "__main__":
49 | #
50 | tool = ToolGeneral()
51 | #
52 | s = '我今天。昨天上午,还有现在'
53 | ls = tool.sentence_split_regex(s)
54 | print(ls)
55 |
56 |
57 |
58 |
--------------------------------------------------------------------------------