├── README.md ├── imgs ├── 01.png ├── 02.png ├── HELLONLP.png └── README.md ├── sentiment_analysis_albert ├── README.md ├── __init__.py ├── albert_small_zh_google │ └── README.md ├── classifier_utils.py ├── data │ ├── README.md │ ├── sa_test.csv │ └── sa_train.csv ├── hyperparameters.py ├── image │ └── model.png ├── lamb_optimizer.py ├── logdir │ └── model_01 │ │ ├── events.out.tfevents.1592553634.DESKTOP-QC1A83I │ │ └── events.out.tfevents.1592553671.DESKTOP-QC1A83I ├── model │ ├── model_load │ │ └── README.md │ └── model_save │ │ └── README.md ├── modeling.py ├── modules.py ├── networks.py ├── optimization.py ├── predict.py ├── requirements.txt ├── tokenization.py ├── train.py └── utils.py ├── sentiment_analysis_albert_emoji ├── README.md ├── __init__.py ├── albert_base_zh │ └── README.md ├── albert_small_zh_google │ └── README.md ├── classifier_utils.py ├── data │ ├── README.md │ ├── sa_test.csv │ └── sa_train.csv ├── hyperparameters.py ├── lamb_optimizer.py ├── model │ ├── model_load │ │ └── README.md │ └── model_save │ │ └── README.md ├── modeling.py ├── modules.py ├── networks.py ├── optimization.py ├── predict.py ├── requirements.txt ├── tokenization.py ├── train.py └── utils.py ├── sentiment_analysis_bayes ├── README.md ├── bayes.py ├── data │ ├── test.zip │ ├── test_feature.zip │ ├── test_label.zip │ ├── train.zip │ ├── train_feature.zip │ └── train_label.zip ├── dict │ ├── stopwords.txt │ └── vocabulary_pearson_40000.txt ├── hyperparameters.py ├── load.py ├── model │ ├── class.txt │ ├── p0.txt │ └── p1.txt ├── predict.py ├── prepare.py ├── train.py └── utils.py └── sentiment_analysis_dict ├── README.md ├── dict ├── insufficiently.txt ├── inverse.txt ├── ish.txt ├── jieba_sentiment.txt ├── more.txt ├── most.txt ├── negative.txt ├── not.txt ├── over.txt ├── ponctuation_sentiment.txt ├── positive.txt └── very.txt ├── hyperparameters.py ├── networks.py ├── preidict.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Sentiment Analysis: 情感分析 2 | 3 | [![Python](https://img.shields.io/badge/python-3.7.6-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-376/) 4 | 5 | 6 | 7 | 8 |
9 | 10 | ## 一、简介 11 | ### 1. 文本分类 12 | 文本分类是自然语言处理(NLP)最基础核心的任务,或者换句话说,几乎所有的任务都是「分类」任务,或者涉及到「分类」这个概念。情感分析是文本分类中非常主要的一个方向,应用场景非常广泛。 13 | ### 2. 情感分析 14 | 我们将中文文本情感分析分为三大类型,第一个是应用情感词典和句式结构方法来做的;第二个是使用传统机器学习方法来做的,例如Bayes、SVM等;第三个是应用深度学习的方法来做的,例如LSTM、CNN、LSTM+CNN、BERT+CNN等。 15 | 这三种方法中,第一种不需要人工标注,也不需要训练,第二种和第三种方法都需要人工标注大量的数据,然后做有监督的模型训练。 16 | 17 |
18 | 19 | ## 二、算法 20 | 21 | **4种实现方法** 22 | ``` 23 | ├── sentiment-analysis 24 | └── sentiment_analysis_dict 25 | └── sentiment_analysis_bayes 26 | └── sentiment_analysis_albert 27 | └── sentiment_analysis_albert_emoji 28 | ``` 29 | 30 | ### 1. sentiment_analysis_dict 31 | 基于词典的方法。 32 | 33 | ### 2. sentiment_analysis_bayes 34 | 基于传统机器学习**bayes**的方法。 35 | 36 | ### 3. sentiment_analysis_albert 37 | 基于深度学习的方法,使用了语言模型**ALBERT**和下游任务框架**TextCNN**。 38 | 39 | ### 4. sentiment_analysis_albert_emoji 40 | 基于深度学习的方法,使用了语言模型**ALBERT**和下游任务框架**TextCNN**。 41 | 引入**未知token**(emoji是其中的一种),在微调过程的同时学习未知token的语义向量,从而达到识别未知token情感语义的目的。 42 | 43 |
44 | 45 | ## 参考 46 | [基于词典的文本情感分析(附代码)](https://zhuanlan.zhihu.com/p/142011031) 47 | [文本分类 [ALBERT+TextCNN] [中文情感分析](附代码)](https://zhuanlan.zhihu.com/p/149491055) 48 | [中文情感分析 [emoji 表情符号]](https://zhuanlan.zhihu.com/p/338806367) 49 | -------------------------------------------------------------------------------- /imgs/01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/01.png -------------------------------------------------------------------------------- /imgs/02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/02.png -------------------------------------------------------------------------------- /imgs/HELLONLP.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/imgs/HELLONLP.png -------------------------------------------------------------------------------- /imgs/README.md: -------------------------------------------------------------------------------- 1 | # imgs 2 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/README.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 1、本项目是在tensorflow版本1.15.0的基础上做的训练和测试。 3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。 4 | 3、欢迎大家联系我 www.hellonlp.com 5 | 4、albert_small_zh_google对应的百度云下载地址: 6 | 链接:https://pan.baidu.com/s/1RKzGJTazlZ7y12YRbAWvyA 7 | 提取码:wuxw 8 | 9 | # 使用方法 10 | 1、准备数据 11 | 数据格式为:sentiment_analysis_albert/data/sa_test.csv 12 | 2、参数设置 13 | 参考脚本 hyperparameters.py,直接修改里面的数值即可。 14 | 3、训练 15 | python train.py 16 | 4、推理 17 | python predict.py 18 | 19 | # 知乎代码解读 20 | https://zhuanlan.zhihu.com/p/149491055 21 | 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/__init__.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/albert_small_zh_google/README.md: -------------------------------------------------------------------------------- 1 | # 语言模型 2 | 3 | ## ALBERT small Chinese 4 | ``` 5 | albert_small_zh_google/albert_config.json 6 | albert_small_zh_google/albert_model.ckpt.data-00000-of-00001 7 | albert_small_zh_google/albert_model.ckpt.index 8 | albert_small_zh_google/albert_model.ckpt.meta 9 | albert_small_zh_google/checkpoint 10 | albert_small_zh_google/vocab_chinese.txt 11 | ``` 12 | 13 | ## 下载路径 14 | 链接:https://pan.baidu.com/s/1rfnQewbUXUl5c4jvXEQFFg?pwd=igey 15 | 提取码:igey 16 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/classifier_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2018 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | import csv 11 | import random 12 | import collections 13 | import tensorflow_hub as hub 14 | import tensorflow.compat.v1 as tf 15 | from tensorflow.contrib import tpu as contrib_tpu 16 | from tensorflow.contrib import data as contrib_data 17 | from tensorflow.contrib import metrics as contrib_metrics 18 | 19 | from sentiment_analysis_albert import modeling 20 | from sentiment_analysis_albert import optimization 21 | from sentiment_analysis_albert import tokenization 22 | from sentiment_analysis_albert.utils import load_csv 23 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp 24 | 25 | 26 | def index2label(index): 27 | return hp.dict_label[str(index)] 28 | 29 | 30 | def read_csv(input_file): 31 | """Reads a tab separated value file.""" 32 | df = load_csv(input_file,header=0) 33 | jobcontent = df['content'].tolist() 34 | jlabel = df['label'].tolist() 35 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))] 36 | print('Read csv finished!(1)') 37 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str] 38 | random.shuffle(lines2) 39 | return lines2 40 | 41 | 42 | class InputExample(object): 43 | """A single training/test example for simple sequence classification.""" 44 | 45 | def __init__(self, guid, text_a, text_b=None, label=None): 46 | """Constructs a InputExample. 47 | 48 | Args: 49 | guid: Unique id for the example. 50 | text_a: string. The untokenized text of the first sequence. For single 51 | sequence tasks, only this sequence must be specified. 52 | text_b: (Optional) string. The untokenized text of the second sequence. 53 | Only must be specified for sequence pair tasks. 54 | label: (Optional) string. The label of the example. This should be 55 | specified for train and dev examples, but not for test examples. 56 | """ 57 | self.guid = guid 58 | self.text_a = text_a 59 | self.text_b = text_b 60 | self.label = label 61 | 62 | 63 | class PaddingInputExample(object): 64 | """Fake example so the num input examples is a multiple of the batch size. 65 | 66 | When running eval/predict on the TPU, we need to pad the number of examples 67 | to be a multiple of the batch size, because the TPU requires a fixed batch 68 | size. The alternative is to drop the last batch, which is bad because it means 69 | the entire output data won't be generated. 70 | 71 | We use this class instead of `None` because treating `None` as padding 72 | battches could cause silent errors. 73 | """ 74 | 75 | 76 | class InputFeatures(object): 77 | """A single set of features of data.""" 78 | 79 | def __init__(self, 80 | input_ids, 81 | input_mask, 82 | segment_ids, 83 | label_id, 84 | guid=None, 85 | example_id=None, 86 | is_real_example=True): 87 | self.input_ids = input_ids 88 | self.input_mask = input_mask 89 | self.segment_ids = segment_ids 90 | self.label_id = label_id 91 | self.example_id = example_id 92 | self.guid = guid 93 | self.is_real_example = is_real_example 94 | 95 | 96 | class DataProcessor(object): 97 | """Base class for data converters for sequence classification data sets.""" 98 | 99 | def __init__(self, use_spm, do_lower_case): 100 | super(DataProcessor, self).__init__() 101 | self.use_spm = use_spm 102 | self.do_lower_case = do_lower_case 103 | 104 | def get_train_examples(self, data_dir): 105 | """Gets a collection of `InputExample`s for the train set.""" 106 | raise NotImplementedError() 107 | 108 | def get_dev_examples(self, data_dir): 109 | """Gets a collection of `InputExample`s for the dev set.""" 110 | raise NotImplementedError() 111 | 112 | def get_test_examples(self, data_dir): 113 | """Gets a collection of `InputExample`s for prediction.""" 114 | raise NotImplementedError() 115 | 116 | def get_labels(self): 117 | """Gets the list of labels for this data set.""" 118 | raise NotImplementedError() 119 | 120 | @classmethod 121 | def _read_tsv(cls, input_file, quotechar=None): 122 | """Reads a tab separated value file.""" 123 | with tf.gfile.Open(input_file, "r") as f: 124 | reader = csv.reader(f, delimiter="\t", quotechar=quotechar) 125 | lines = [] 126 | for line in reader: 127 | lines.append(line) 128 | return lines 129 | 130 | @classmethod 131 | def _read_csv(cls,input_file):# 项目 132 | """Reads a tab separated value file.""" 133 | df = load_csv(input_file,header=0) 134 | jobcontent = df['content'].tolist() 135 | jlabel = df['label'].tolist() 136 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))] 137 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str] 138 | random.shuffle(lines2) 139 | return lines2 140 | 141 | class ClassifyProcessor(DataProcessor): 142 | """Processor for the MRPC data set (GLUE version).""" 143 | 144 | def __init__(self): 145 | self.labels = set() 146 | 147 | def get_train_examples(self, data_dir): 148 | """See base class.""" 149 | print('*'*30) 150 | return self._create_examples( 151 | self._read_csv(os.path.join(data_dir, hp.train_data)), "train") 152 | 153 | def get_dev_examples(self, data_dir): 154 | """See base class.""" 155 | return self._create_examples( 156 | self._read_csv(os.path.join(data_dir, hp.test_data)), "dev") 157 | 158 | def get_test_examples(self, data_dir): 159 | """See base class.""" 160 | return self._create_examples( 161 | self._read_csv(os.path.join(data_dir, hp.test_data)), "test") 162 | 163 | def get_labels(self): 164 | """See base class.""" 165 | return ['0','1','2'] 166 | 167 | def _create_examples(self, lines, set_type): 168 | """Creates examples for the training and dev sets.""" 169 | examples = [] 170 | for (i, line) in enumerate(lines): 171 | guid = "%s-%s" % (set_type, i) 172 | text_a = tokenization.convert_to_unicode(line[1]) 173 | label = tokenization.convert_to_unicode(line[0]) 174 | self.labels.add(label) 175 | # print(self.labels) 176 | examples.append( 177 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 178 | # print(examples) 179 | return examples 180 | 181 | 182 | def convert_single_example(ex_index, example, label_list, max_seq_length, 183 | tokenizer, task_name): 184 | """Converts a single `InputExample` into a single `InputFeatures`.""" 185 | 186 | if isinstance(example, PaddingInputExample): 187 | return InputFeatures( 188 | input_ids=[0] * max_seq_length, 189 | input_mask=[0] * max_seq_length, 190 | segment_ids=[0] * max_seq_length, 191 | label_id=0, 192 | is_real_example=False) 193 | 194 | if task_name != "sts-b": 195 | label_map = {} 196 | for (i, label) in enumerate(label_list): 197 | label_map[label] = i 198 | 199 | tokens_a = tokenizer.tokenize(example.text_a) 200 | tokens_b = None 201 | if example.text_b: 202 | tokens_b = tokenizer.tokenize(example.text_b) 203 | 204 | if tokens_b: 205 | # Modifies `tokens_a` and `tokens_b` in place so that the total 206 | # length is less than the specified length. 207 | # Account for [CLS], [SEP], [SEP] with "- 3" 208 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 209 | else: 210 | # Account for [CLS] and [SEP] with "- 2" 211 | if len(tokens_a) > max_seq_length - 2: 212 | tokens_a = tokens_a[0:(max_seq_length - 2)] 213 | 214 | # The convention in ALBERT is: 215 | # (a) For sequence pairs: 216 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 217 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 218 | # (b) For single sequences: 219 | # tokens: [CLS] the dog is hairy . [SEP] 220 | # type_ids: 0 0 0 0 0 0 0 221 | # 222 | # Where "type_ids" are used to indicate whether this is the first 223 | # sequence or the second sequence. The embedding vectors for `type=0` and 224 | # `type=1` were learned during pre-training and are added to the 225 | # embedding vector (and position vector). This is not *strictly* necessary 226 | # since the [SEP] token unambiguously separates the sequences, but it makes 227 | # it easier for the model to learn the concept of sequences. 228 | # 229 | # For classification tasks, the first vector (corresponding to [CLS]) is 230 | # used as the "sentence vector". Note that this only makes sense because 231 | # the entire model is fine-tuned. 232 | tokens = [] 233 | segment_ids = [] 234 | tokens.append("[CLS]") 235 | segment_ids.append(0) 236 | for token in tokens_a: 237 | tokens.append(token) 238 | segment_ids.append(0) 239 | tokens.append("[SEP]") 240 | segment_ids.append(0) 241 | 242 | if tokens_b: 243 | for token in tokens_b: 244 | tokens.append(token) 245 | segment_ids.append(1) 246 | tokens.append("[SEP]") 247 | segment_ids.append(1) 248 | 249 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 250 | 251 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 252 | # tokens are attended to. 253 | input_mask = [1] * len(input_ids) 254 | 255 | # Zero-pad up to the sequence length. 256 | while len(input_ids) < max_seq_length: 257 | input_ids.append(0) 258 | input_mask.append(0) 259 | segment_ids.append(0) 260 | 261 | assert len(input_ids) == max_seq_length 262 | assert len(input_mask) == max_seq_length 263 | assert len(segment_ids) == max_seq_length 264 | 265 | if task_name != "sts-b": 266 | label_id = label_map[example.label] 267 | else: 268 | label_id = example.label 269 | 270 | feature = InputFeatures( 271 | input_ids=input_ids, 272 | input_mask=input_mask, 273 | segment_ids=segment_ids, 274 | label_id=label_id, 275 | is_real_example=True) 276 | return feature 277 | 278 | 279 | def file_based_convert_examples_to_features( 280 | examples, label_list, max_seq_length, tokenizer, output_file, task_name): 281 | """Convert a set of `InputExample`s to a TFRecord file.""" 282 | 283 | writer = tf.python_io.TFRecordWriter(output_file) 284 | 285 | for (ex_index, example) in enumerate(examples): 286 | if ex_index % 10000 == 0: 287 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 288 | 289 | feature = convert_single_example(ex_index, example, label_list, 290 | max_seq_length, tokenizer, task_name) 291 | 292 | def create_int_feature(values): 293 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 294 | return f 295 | 296 | def create_float_feature(values): 297 | f = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) 298 | return f 299 | 300 | features = collections.OrderedDict() 301 | features["input_ids"] = create_int_feature(feature.input_ids) 302 | features["input_mask"] = create_int_feature(feature.input_mask) 303 | features["segment_ids"] = create_int_feature(feature.segment_ids) 304 | features["label_ids"] = create_float_feature([feature.label_id])\ 305 | if task_name == "sts-b" else create_int_feature([feature.label_id]) 306 | features["is_real_example"] = create_int_feature( 307 | [int(feature.is_real_example)]) 308 | 309 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 310 | writer.write(tf_example.SerializeToString()) 311 | writer.close() 312 | 313 | 314 | def file_based_input_fn_builder(input_file, seq_length, is_training, 315 | drop_remainder, task_name, use_tpu, bsz, 316 | multiple=1): 317 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 318 | labeltype = tf.float32 if task_name == "sts-b" else tf.int64 319 | 320 | name_to_features = { 321 | "input_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64), 322 | "input_mask": tf.FixedLenFeature([seq_length * multiple], tf.int64), 323 | "segment_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64), 324 | "label_ids": tf.FixedLenFeature([], labeltype), 325 | "is_real_example": tf.FixedLenFeature([], tf.int64), 326 | } 327 | 328 | def _decode_record(record, name_to_features): 329 | """Decodes a record to a TensorFlow example.""" 330 | example = tf.parse_single_example(record, name_to_features) 331 | 332 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 333 | # So cast all int64 to int32. 334 | for name in list(example.keys()): 335 | t = example[name] 336 | if t.dtype == tf.int64: 337 | t = tf.to_int32(t) 338 | example[name] = t 339 | 340 | return example 341 | 342 | def input_fn(params): 343 | """The actual input function.""" 344 | if use_tpu: 345 | batch_size = params["batch_size"] 346 | else: 347 | batch_size = bsz 348 | 349 | # For training, we want a lot of parallel reading and shuffling. 350 | # For eval, we want no shuffling and parallel reading doesn't matter. 351 | d = tf.data.TFRecordDataset(input_file) 352 | if is_training: 353 | d = d.repeat() 354 | d = d.shuffle(buffer_size=100) 355 | 356 | d = d.apply( 357 | contrib_data.map_and_batch( 358 | lambda record: _decode_record(record, name_to_features), 359 | batch_size=batch_size, 360 | drop_remainder=drop_remainder)) 361 | 362 | return d 363 | 364 | return input_fn 365 | 366 | 367 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 368 | """Truncates a sequence pair in place to the maximum length.""" 369 | 370 | # This is a simple heuristic which will always truncate the longer sequence 371 | # one token at a time. This makes more sense than truncating an equal percent 372 | # of tokens from each, since if one sequence is very short then each token 373 | # that's truncated likely contains more information than a longer sequence. 374 | while True: 375 | total_length = len(tokens_a) + len(tokens_b) 376 | if total_length <= max_length: 377 | break 378 | if len(tokens_a) > len(tokens_b): 379 | tokens_a.pop() 380 | else: 381 | tokens_b.pop() 382 | 383 | 384 | def _create_model_from_hub(hub_module, is_training, input_ids, input_mask, 385 | segment_ids): 386 | """Creates an ALBERT model from TF-Hub.""" 387 | tags = set() 388 | if is_training: 389 | tags.add("train") 390 | albert_module = hub.Module(hub_module, tags=tags, trainable=True) 391 | albert_inputs = dict( 392 | input_ids=input_ids, 393 | input_mask=input_mask, 394 | segment_ids=segment_ids) 395 | albert_outputs = albert_module( 396 | inputs=albert_inputs, 397 | signature="tokens", 398 | as_dict=True) 399 | output_layer = albert_outputs["pooled_output"] 400 | return output_layer 401 | 402 | 403 | def _create_model_from_scratch(albert_config, is_training, input_ids, 404 | input_mask, segment_ids, use_one_hot_embeddings): 405 | """Creates an ALBERT model from scratch (as opposed to hub).""" 406 | model = modeling.AlbertModel( 407 | config=albert_config, 408 | is_training=is_training, 409 | input_ids=input_ids, 410 | input_mask=input_mask, 411 | token_type_ids=segment_ids, 412 | use_one_hot_embeddings=use_one_hot_embeddings) 413 | output_layer = model.get_pooled_output() 414 | return output_layer 415 | 416 | 417 | def create_model(albert_config, is_training, input_ids, input_mask, segment_ids, 418 | labels, num_labels, use_one_hot_embeddings, task_name, 419 | hub_module): 420 | """Creates a classification model.""" 421 | if hub_module: 422 | tf.logging.info("creating model from hub_module: %s", hub_module) 423 | output_layer = _create_model_from_hub(hub_module, is_training, input_ids, 424 | input_mask, segment_ids) 425 | else: 426 | tf.logging.info("creating model from albert_config") 427 | output_layer = _create_model_from_scratch(albert_config, is_training, 428 | input_ids, input_mask, 429 | segment_ids, 430 | use_one_hot_embeddings) 431 | 432 | hidden_size = output_layer.shape[-1].value 433 | 434 | output_weights = tf.get_variable( 435 | "output_weights", [num_labels, hidden_size], 436 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 437 | 438 | output_bias = tf.get_variable( 439 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 440 | 441 | with tf.variable_scope("loss"): 442 | if is_training: 443 | # I.e., 0.1 dropout 444 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 445 | 446 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 447 | logits = tf.nn.bias_add(logits, output_bias) 448 | if task_name != "sts-b": 449 | probabilities = tf.nn.softmax(logits, axis=-1) 450 | predictions = tf.argmax(probabilities, axis=-1, output_type=tf.int32) 451 | log_probs = tf.nn.log_softmax(logits, axis=-1) 452 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 453 | 454 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 455 | else: 456 | probabilities = logits 457 | logits = tf.squeeze(logits, [-1]) 458 | predictions = logits 459 | per_example_loss = tf.square(logits - labels) 460 | loss = tf.reduce_mean(per_example_loss) 461 | 462 | return (loss, per_example_loss, probabilities, logits, predictions) 463 | 464 | 465 | def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate, 466 | num_train_steps, num_warmup_steps, use_tpu, 467 | use_one_hot_embeddings, task_name, hub_module=None, 468 | optimizer="adamw"): 469 | """Returns `model_fn` closure for TPUEstimator.""" 470 | 471 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 472 | """The `model_fn` for TPUEstimator.""" 473 | 474 | tf.logging.info("*** Features ***") 475 | for name in sorted(features.keys()): 476 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 477 | 478 | input_ids = features["input_ids"] 479 | input_mask = features["input_mask"] 480 | segment_ids = features["segment_ids"] 481 | label_ids = features["label_ids"] 482 | is_real_example = None 483 | if "is_real_example" in features: 484 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 485 | else: 486 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 487 | 488 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 489 | 490 | (total_loss, per_example_loss, probabilities, logits, predictions) = \ 491 | create_model(albert_config, is_training, input_ids, input_mask, 492 | segment_ids, label_ids, num_labels, 493 | use_one_hot_embeddings, task_name, hub_module) 494 | 495 | tvars = tf.trainable_variables() 496 | initialized_variable_names = {} 497 | scaffold_fn = None 498 | if init_checkpoint: 499 | (assignment_map, initialized_variable_names 500 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 501 | if use_tpu: 502 | 503 | def tpu_scaffold(): 504 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 505 | return tf.train.Scaffold() 506 | 507 | scaffold_fn = tpu_scaffold 508 | else: 509 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 510 | 511 | tf.logging.info("**** Trainable Variables ****") 512 | for var in tvars: 513 | init_string = "" 514 | if var.name in initialized_variable_names: 515 | init_string = ", *INIT_FROM_CKPT*" 516 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 517 | init_string) 518 | 519 | output_spec = None 520 | if mode == tf.estimator.ModeKeys.TRAIN: 521 | 522 | train_op = optimization.create_optimizer( 523 | total_loss, learning_rate, num_train_steps, num_warmup_steps, 524 | use_tpu, optimizer) 525 | 526 | output_spec = contrib_tpu.TPUEstimatorSpec( 527 | mode=mode, 528 | loss=total_loss, 529 | train_op=train_op, 530 | scaffold_fn=scaffold_fn) 531 | elif mode == tf.estimator.ModeKeys.EVAL: 532 | if task_name not in ["sts-b", "cola"]: 533 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 534 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 535 | accuracy = tf.metrics.accuracy( 536 | labels=label_ids, predictions=predictions, 537 | weights=is_real_example) 538 | loss = tf.metrics.mean( 539 | values=per_example_loss, weights=is_real_example) 540 | return { 541 | "eval_accuracy": accuracy, 542 | "eval_loss": loss, 543 | } 544 | elif task_name == "sts-b": 545 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 546 | """Compute Pearson correlations for STS-B.""" 547 | # Display labels and predictions 548 | concat1 = contrib_metrics.streaming_concat(logits) 549 | concat2 = contrib_metrics.streaming_concat(label_ids) 550 | 551 | # Compute Pearson correlation 552 | pearson = contrib_metrics.streaming_pearson_correlation( 553 | logits, label_ids, weights=is_real_example) 554 | 555 | # Compute MSE 556 | # mse = tf.metrics.mean(per_example_loss) 557 | mse = tf.metrics.mean_squared_error( 558 | label_ids, logits, weights=is_real_example) 559 | 560 | loss = tf.metrics.mean( 561 | values=per_example_loss, 562 | weights=is_real_example) 563 | 564 | return {"pred": concat1, "label_ids": concat2, "pearson": pearson, 565 | "MSE": mse, "eval_loss": loss,} 566 | elif task_name == "cola": 567 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 568 | """Compute Matthew's correlations for STS-B.""" 569 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 570 | # https://en.wikipedia.org/wiki/Matthews_correlation_coefficient 571 | tp, tp_op = tf.metrics.true_positives( 572 | predictions, label_ids, weights=is_real_example) 573 | tn, tn_op = tf.metrics.true_negatives( 574 | predictions, label_ids, weights=is_real_example) 575 | fp, fp_op = tf.metrics.false_positives( 576 | predictions, label_ids, weights=is_real_example) 577 | fn, fn_op = tf.metrics.false_negatives( 578 | predictions, label_ids, weights=is_real_example) 579 | 580 | # Compute Matthew's correlation 581 | mcc = tf.div_no_nan( 582 | tp * tn - fp * fn, 583 | tf.pow((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn), 0.5)) 584 | 585 | # Compute accuracy 586 | accuracy = tf.metrics.accuracy( 587 | labels=label_ids, predictions=predictions, 588 | weights=is_real_example) 589 | 590 | loss = tf.metrics.mean( 591 | values=per_example_loss, 592 | weights=is_real_example) 593 | 594 | return {"matthew_corr": (mcc, tf.group(tp_op, tn_op, fp_op, fn_op)), 595 | "eval_accuracy": accuracy, "eval_loss": loss,} 596 | 597 | eval_metrics = (metric_fn, 598 | [per_example_loss, label_ids, logits, is_real_example]) 599 | output_spec = contrib_tpu.TPUEstimatorSpec( 600 | mode=mode, 601 | loss=total_loss, 602 | eval_metrics=eval_metrics, 603 | scaffold_fn=scaffold_fn) 604 | else: 605 | output_spec = contrib_tpu.TPUEstimatorSpec( 606 | mode=mode, 607 | predictions={ 608 | "probabilities": probabilities, 609 | "predictions": predictions 610 | }, 611 | scaffold_fn=scaffold_fn) 612 | return output_spec 613 | 614 | return model_fn 615 | 616 | 617 | # This function is not used by this file but is still used by the Colab and 618 | # people who depend on it. 619 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 620 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 621 | 622 | all_input_ids = [] 623 | all_input_mask = [] 624 | all_segment_ids = [] 625 | all_label_ids = [] 626 | 627 | for feature in features: 628 | all_input_ids.append(feature.input_ids) 629 | all_input_mask.append(feature.input_mask) 630 | all_segment_ids.append(feature.segment_ids) 631 | all_label_ids.append(feature.label_id) 632 | 633 | def input_fn(params): 634 | """The actual input function.""" 635 | batch_size = params["batch_size"] 636 | 637 | num_examples = len(features) 638 | 639 | # This is for demo purposes and does NOT scale to large data sets. We do 640 | # not use Dataset.from_generator() because that uses tf.py_func which is 641 | # not TPU compatible. The right way to load data is with TFRecordReader. 642 | d = tf.data.Dataset.from_tensor_slices({ 643 | "input_ids": 644 | tf.constant( 645 | all_input_ids, shape=[num_examples, seq_length], 646 | dtype=tf.int32), 647 | "input_mask": 648 | tf.constant( 649 | all_input_mask, 650 | shape=[num_examples, seq_length], 651 | dtype=tf.int32), 652 | "segment_ids": 653 | tf.constant( 654 | all_segment_ids, 655 | shape=[num_examples, seq_length], 656 | dtype=tf.int32), 657 | "label_ids": 658 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 659 | }) 660 | 661 | if is_training: 662 | d = d.repeat() 663 | d = d.shuffle(buffer_size=100) 664 | 665 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 666 | return d 667 | 668 | return input_fn 669 | 670 | 671 | # This function is not used by this file but is still used by the Colab and 672 | # people who depend on it. 673 | def convert_examples_to_features(examples, label_list, max_seq_length, 674 | tokenizer, task_name): 675 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 676 | 677 | features = [] 678 | print('Length of examples:',len(examples)) 679 | for (ex_index, example) in enumerate(examples): 680 | if ex_index % 10000 == 0: 681 | #tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 682 | print("Writing example %d of %d" % (ex_index, len(examples))) 683 | feature = convert_single_example(ex_index, example, label_list, 684 | max_seq_length, tokenizer, task_name) 685 | 686 | features.append(feature) 687 | return features 688 | 689 | 690 | # Load parameters 691 | max_seq_length = hp.sequence_length 692 | do_lower_case = hp.do_lower_case 693 | vocab_file = hp.vocab_file 694 | tokenizer = tokenization.FullTokenizer.from_scratch(vocab_file=vocab_file, 695 | do_lower_case=do_lower_case, 696 | spm_model_file=None) 697 | processor = ClassifyProcessor() 698 | label_list = processor.get_labels() 699 | 700 | 701 | def get_features(): 702 | # Load train data 703 | train_examples = processor.get_train_examples(hp.data_dir) 704 | # Get onehot feature 705 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify') 706 | input_ids = [f.input_ids for f in features] 707 | input_masks = [f.input_mask for f in features] 708 | segment_ids = [f.segment_ids for f in features] 709 | label_ids = [f.label_id for f in features] 710 | print('Get features finished!') 711 | return input_ids,input_masks,segment_ids,label_ids 712 | 713 | def get_features_test(): 714 | # Load test data 715 | train_examples = processor.get_test_examples(hp.data_dir) 716 | # Get onehot feature 717 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify_test') 718 | input_ids = [f.input_ids for f in features] 719 | input_masks = [f.input_mask for f in features] 720 | segment_ids = [f.segment_ids for f in features] 721 | label_ids = [f.label_id for f in features] 722 | print('Get features(test) finished!') 723 | return input_ids,input_masks,segment_ids,label_ids 724 | 725 | 726 | def create_example(line,set_type): 727 | """Creates examples for the training and dev sets.""" 728 | guid = "%s-%s" % (set_type, 1) 729 | text_a = tokenization.convert_to_unicode(line[1]) 730 | label = tokenization.convert_to_unicode(line[0]) 731 | example = InputExample(guid=guid, text_a=text_a, text_b=None, label=label) 732 | return example 733 | 734 | 735 | def get_feature_test(sentence): 736 | example = create_example(['0',sentence],'test') 737 | feature = convert_single_example(0, example, label_list,max_seq_length, tokenizer,task_name='classify') 738 | return feature.input_ids,feature.input_mask,feature.segment_ids,feature.label_id 739 | 740 | 741 | if __name__ == '__main__': 742 | # Get feature 743 | sentence = '天天向上' 744 | feature = get_feature_test(sentence) 745 | print('feature.input_ids',feature[0]) 746 | print('feature.input_mask',feature[1]) 747 | print('feature.segment_ids',feature[2]) 748 | print('feature.label_id',feature[3]) 749 | 750 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/data/README.md: -------------------------------------------------------------------------------- 1 | # 数据集 2 | - data/sa_train.csv 3 | - data/sa_test.csv 4 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/data/sa_test.csv: -------------------------------------------------------------------------------- 1 | content,label 2 | 新配色喜欢,1 3 | 一星也不想给,-1 4 | 我就想说上个苹果X我媳妇儿用了差不多两年,0 5 | 大部分人还是被iso的流畅度吸引,1 6 | 一共买了7台,0 7 | 全金属身、超薄机身,1 8 | 还是值得了,1 9 | 紫色真的很好看,1 10 | 下次介绍朋友买,1 11 | 四个档位的水雾都拍了一张图,0 12 | 外观我觉得还可以,1 13 | 而且快递竟然打电话说下午不送了,-1 14 | 比不多一个晚上,0 15 | 这说明原创上电量使用是吹了牛+的,-1 16 | 放客厅非常合适,1 17 | 可惜没那么多钱了,0 18 | 因为没有防抖,-1 19 | 一个办用,0 20 | 电陶炉不挑锅具真是太方便啦,1 21 | 安装师傅也很尽心尽力,1 22 | 但是就是等得太久了,-1 23 | 网络速度快了很多,1 24 | 非标快哦,1 25 | 目前体验了下,0 26 | 不带音频,0 27 | 让自己想办法,0 28 | 但是时间长了就好了,0 29 | 客服人员服务态度也很好,1 30 | 买来给老父亲的,0 31 | 简直就是卡死,-1 32 | 最烦的就是这个充电头了,-1 33 | 女生也能单手操作,1 34 | 总是乱来,-1 35 | 国际品牌,0 36 | 虽然四个滤芯儿是送的,0 37 | 商品跟原装的一样灵活,1 38 | 安装师傅态度还不错.第二天打了客服电话后,1 39 | 这次给女儿新房又买一款,1 40 | 家里基本上都是小米的产品,1 41 | 两个人用完全没问题,1 42 | 超级超级好,1 43 | 在4米范围内小声说话都能收到音,1 44 | 水质过滤后改善了好多,1 45 | 都是屏幕和外框接口有缝隙,-1 46 | 开始想买康佳用的是LG萍屏,0 47 | 说月底盘点月初发,0 48 | 70寸还带主机音响,1 49 | 主要功能用于看时间跟测各种健康指标,1 50 | 布料也比较结实,1 51 | 但是整体体验很好,1 52 | 可以洗好多碗具,1 53 | 看着就非常的爽,1 54 | 摁一下自动匹配,0 55 | 才想起来没有评论,0 56 | 手感较好,1 57 | 电量也只用了一格,1 58 | 用过两个了,0 59 | 你们自己慢慢品,0 60 | 整体的效果不错,1 61 | 就像地摊20块钱以下的东西,-1 62 | 水容器大,1 63 | 讲真的以后真的不想再买苹果了,-1 64 | 外形外观:非常的好看,1 65 | 屏幕音效:质还是不错的,1 66 | 然后才发现原来的手机是移动定制机,0 67 | 就直接放在保安室,0 68 | 外观在能接受的范围,1 69 | 电池续航能力也不错,1 70 | 噪音比以前的小太多,1 71 | 价格贵点,-1 72 | 老公也说好,1 73 | 再清澈一些就好了,0 74 | 然后快递师傅还挺好的,1 75 | 一路给我打电话送到家的,1 76 | 适合小设计师们,1 77 | 用了空气泡沫包装,0 78 | 送给对象很喜欢,1 79 | 这款Lenovo,0 80 | 只当给自己起心理安慰吧,1 81 | 这两天热的时候还能降降温,1 82 | 我家阳台小尺寸刚刚好,1 83 | 送个老公的生日礼物,0 84 | 虽说不是JD自己的物流(日日顺快递公司),0 85 | 屏幕音效:l很好的,1 86 | 垃圾的东西,-1 87 | 唯一的缺点应该是拍照和想象中的差点,-1 88 | 商家送的备用PP棉芯很好,1 89 | 望继续做好售后服务,0 90 | 还好客服售后耐心与我沟通消除误会,1 91 | 荣耀品质做工越来越好了,1 92 | 我买的是86寸的电视,0 93 | 国产机越来越好了,1 94 | 使用起来相当流畅,1 95 | 京东买的有保障,1 96 | 智能反应超级迟钝,-1 97 | 用起来键盘还不错,1 98 | 居然还是坏的,-1 99 | 甚至可以用评价高达模型的角度来评价他,0 100 | 给远程安装,0 101 | 同时存电话号码时拼音输不全,-1 -------------------------------------------------------------------------------- /sentiment_analysis_albert/data/sa_train.csv: -------------------------------------------------------------------------------- 1 | content,label 2 | 超划算的,1 3 | iPhone的性价比之王,1 4 | 一套不同国家电源转换插头,-1 5 | 由于安装师傅上门安装时,0 6 | 55英寸的这个创维,0 7 | 我更喜欢那个的感觉,1 8 | 拿到手机第二天莫名其妙保护膜就裂了,-1 9 | 全部试了一遍都不行,-1 10 | 送的装备也很齐全,1 11 | 原来买的是内存32G的我在用好用,1 12 | 因为修剪器充电口处有一凸起,-1 13 | 提升幸福感必备,1 14 | 9号送人,0 15 | 紫色骚气,1 16 | 搭配手持吸尘器,0 17 | 包装的很精细,1 18 | 买了好几个华为手机了,1 19 | 当时还一下子买了3台,0 20 | 很喜欢这个牌子的啊,1 21 | 运行速度:除了反应慢,-1 22 | 关键是买的别的加湿器都是水干自动断电,-1 23 | 我只想问你们一个电脑利润就这么高,-1 24 | 用用还可以暂时没有发现问题,1 25 | 在京东购物总是让我高兴,1 26 | 这个门卡使用非常灵敏、方便,1 27 | 包装充分完好,1 28 | 华为的路由器好,1 29 | 希望以后也可以的屏幕跟音效都不错的,1 30 | 无论是正面还是侧面,0 31 | 高大上的东西,1 32 | 我的收据没帮开过来,-1 33 | 外形外观:和x一样,0 34 | 商家能积极处理客户碰到的问题,1 35 | 确实不会买苹果的手机,-1 36 | 目前使用一切正常,1 37 | 唯一觉得不足的就是app,-1 38 | 其他特色:现在我京东的速度很快,1 39 | 直接不回,-1 40 | 两个不足:,-1 41 | 很好支持国货,1 42 | 从内到外都很棒,1 43 | 的质量很好,1 44 | 待机时间:能从早用到晚,1 45 | 待机时间:续航是个鸡肋,-1 46 | 看看没动手,0 47 | 说不能直接放在地板上,0 48 | 后续体验再来追评,0 49 | 而且我还是联通卡,0 50 | 我们生活使用日常杠杠滴,1 51 | 充电慢费电快,-1 52 | 自从知道评论之后京豆可以抵现金了,0 53 | 运行速度:效果很不错,1 54 | 中午送到马上装好,0 55 | 然后Windows10和以前的系统有点不同,0 56 | 到的也快,1 57 | 感觉没有我的5T好用,-1 58 | 相信京东没问题,1 59 | 晚上睡觉开起好睡多了,1 60 | 华为是首选,1 61 | 还好可以,1 62 | 适合单身租房的,1 63 | 但是只给5月24以后买的保价,0 64 | 昨晚定,0 65 | 电池比较耐用、不再担心一天一次冲,1 66 | 已经离不开戴森了,1 67 | 物流速度很快电脑到了没有任何瑕疵,1 68 | 感觉音质很棒,1 69 | 待机时间没使用的情况下4天左右可以的,0 70 | 拍照效果:基本不用,0 71 | 这次活动价,0 72 | 东西用着很好,1 73 | 好不好用还没试,0 74 | 后悔换苹果了,-1 75 | 看起来挺不错,1 76 | 以后还找你购买,1 77 | 总算是没有让人失望,1 78 | 待机时间:全是很久,1 79 | 和v20犹豫了一番,0 80 | 扫拖一起干,1 81 | 这次买净水器滤芯也是很坚决,0 82 | 买的也都是256G,0 83 | 反复装卡重启都不行,-1 84 | 我觉得有点厚,-1 85 | 结果京东物流速度很快,1 86 | 公司常年采购,0 87 | 到货速度稍慢些,-1 88 | 宝宝们都很喜欢,1 89 | 外观看上去赞,1 90 | 这个电水壶确实好用,1 91 | 物流又给力,1 92 | 在门上有档次解锁识别速度,1 93 | 商家非常诚信,1 94 | 半个月过去了,0 95 | 容量大小:3口之家正合适,1 96 | 而且京东快递小姐姐服务还好,1 97 | 平时折叠放着特别的小,0 98 | 屏幕音效:屏幕a屏,0 99 | 由于是双十一买的,0 100 | 也优惠不了几块钱,-1 101 | 打400电话也不给安装,-1 102 | 电脑性价比可以,1 103 | 送货的师傅服务超好,1 104 | 拍照效果:拍照效果就是下面那个特别真实清晰,1 105 | 这次装修先预埋,0 106 | 这款颜色漂亮,1 107 | 我严重怀疑他们发的是翻新机,-1 108 | 一个人用大小刚好,1 109 | 二档稍微有点声音,0 110 | 比如刷,0 111 | 手机一直忘了来评价,0 112 | 好看的不得了,1 113 | 而且贵的壶和便宜的区别不大,-1 114 | 用了很长时间才来评论的,0 115 | 但是你要是戴上眼镜就不能识别了,-1 116 | 尺度也够用,1 117 | 好在很清晰,1 118 | 节省了地方,1 119 | 妈妈换手机很方便,1 120 | 运行速度:运行速度比其它手机更优秀,1 121 | 师傅送货也快,1 122 | 就拍了,0 123 | 自己买配件升级拖地,0 124 | 另外充电器有点大不太方便携带,-1 125 | 指示灯都是准确完好,1 126 | 外形外观:高端大气配置较高,1 127 | 非常失望的一次购物,-1 128 | 东西真心一般,0 129 | 还赠送了手机卡,1 130 | 忘记拍照就不上图了,0 131 | 事很多,0 132 | 比L580画面清晰,1 133 | 音质很好画质也很清晰,1 134 | 极具个性,1 135 | 作为入伙礼物买的,0 136 | 再到指纹锁的使用,0 137 | 售前售后都好,1 138 | 热情服务完,1 139 | 产品吸尘很好,1 140 | 这个价格买的很值得,1 141 | 这次买的电器很满意,1 142 | 和我同事华为的比快多了,1 143 | 以后买手机还来这买,1 144 | 中度使用也能一天,1 145 | 外形外观:手机,0 146 | 最最重要的是拿到手已经贴了原装手机屏幕膜,1 147 | 金属后盖,0 148 | 855puls稳得一批,1 149 | 办公用挺好,1 150 | 买来运动计步用的,0 151 | 扫起来很干净,1 152 | 帮朋友公司购买,0 153 | 却是没有水雾,-1 154 | 这款ⅵvo手机配骁龙855芯片,0 155 | 这个试了,0 156 | 主要客服也非常好哦,1 157 | 没想到千元机也这么强大,1 158 | 老婆很喜欢大小正合适,1 159 | 肖正友,0 160 | 发票到了,0 161 | 商家才有这样的底气,0 162 | 运行速度:麒麟980处理器运行速度很快,1 163 | 拍照质量真的超赞,1 164 | 噪音大小:还好,1 165 | 听着没啥感觉唉,0 166 | 来了就直接点亮了,0 167 | 相机算是对得起价格了,1 168 | 大小正好合适,1 169 | 沾了点水,0 170 | 就是插座这总插拔会不会坏,0 171 | 1.照相,0 172 | 卖家检测没问题,1 173 | 还一百多,0 174 | 诱光灯的效果相当不错,1 175 | 开关门声音太大了,-1 176 | 需要找人安装,0 177 | 颜值俱佳,1 178 | 装的时候就开了两三个小时还凉的挺快,1 179 | 所以硬盘都得多分点,0 180 | 值得回购哟,1 181 | 智能锁之初体验,0 182 | w这个配件还可以,1 183 | 然后还是拆开看了下,0 184 | 今年夏天特别的热,0 185 | 希望质量一如既往的好,1 186 | 购买之前对比了很多款式和牌子的电视,0 187 | 小孩子爱看动画片,1 188 | 其他的话有待日后确认,0 189 | 送给长辈用的,0 190 | 不过找遍了京东,0 191 | 好的卖家卖的宝贝,1 192 | 准备洗了,0 193 | 电视50寸以为没多大,0 194 | 600多的加湿器,0 195 | 搞得我水漏了一地,-1 196 | 一会要这样一会要那样,-1 197 | 五六个了,0 198 | 不知道是几手的,0 199 | 至少做工很不错,1 200 | 得亏有它,1 201 | 和电影里一样,0 202 | 新的刚到手就坏了你敢信,-1 203 | 基本每次充电不到半小时就能满,1 204 | 提示无网络,-1 205 | 东西巳收到,0 206 | 运行速度:暂时用起来很快,1 207 | 这两天主要试机,0 208 | 但是在京东买了这个给父亲用,0 209 | 但是图像处理能力很好,1 210 | 到今天一个月了,0 211 | 不要太挑剔,0 212 | 没其他毛病,1 213 | 解压压缩包就很热,-1 214 | 售后告诉我夏天要开6档,0 215 | 不过就是出雾好小,-1 216 | 最重要的是价格优惠,1 217 | 好说歹说收下了,0 218 | 充电二十多分钟就能充满,1 219 | 客服售后态度极差,-1 220 | 用了一天了没有一点水遗留在桌子上,1 221 | 没有蒸菜层的,-1 222 | 小不足,-1 223 | 耳机总是自己断开,-1 224 | 应该不影响使用,1 225 | 手机颜色跟图片有点不同,-1 226 | 自己找,0 227 | 值得购买大品牌,1 228 | 如果性价比可以再高一点的话就好啦,0 229 | 用十年是没问题的,1 230 | 按网上教程试了半天u盘里的都不识别,-1 231 | 不再断线了、超爽,1 232 | 估计我得抱着它睡觉,0 233 | 这次基础版入手,0 234 | 送的屏幕膜,1 235 | 在反复开关机之后才激活成功,-1 236 | 因为比实体店会便宜实惠好多,1 237 | 还没用就先恢复出厂设置了,-1 238 | 体验下,0 239 | 红薯粉色小鳄鱼竿支架子鼓浪屿啊啊啊啊五环之歌名,0 240 | 但是客服连最基本的东西都不懂,-1 241 | 雾气刚开始很小,-1 242 | 其他特色:就是原装充电器充电很慢,-1 243 | 华为10plus,0 244 | 今天安上了,0 245 | 下次再试试看,0 246 | 加水量并不多,1 247 | 说是可以无理由退换,1 248 | 总结:不推荐此型号,-1 249 | 不知道怎么样用段时间在看看,0 250 | 未使用的情况下,0 251 | 还好是练体育的,0 252 | 但是运行速度还行,1 253 | 问了又说去那边问,-1 254 | 也买了一台,0 255 | 收到宝贝特意用了一段时间过来评价,0 256 | 安装的师傅也很尽心,1 257 | 想不想买也就自己看的办吧,0 258 | 安装起来也很简单,1 259 | 还送整机10年保修,1 260 | 其他特色:拍照效果好,1 261 | 放两条就洗不干净了,-1 262 | 也不厚,0 263 | 对用户使用很友好,1 264 | 物流服务都超好,1 265 | 就是最简单的一种,1 266 | 比想象中的要小的多,-1 267 | 没想到这么大声音,-1 268 | 电视语音识别度很高在厨房都可以听清楚你要说的话,1 269 | 但是还是稍微重了一些,-1 270 | 和商家描述得一致,1 271 | 手机外观轻薄款音效非常好,1 272 | 厂家直接给我换一个,1 273 | 放心了不少,1 274 | 这个还没换呢,0 275 | 装上后那水立刻变得清澈透明,1 276 | 质量问题,-1 277 | 可以无线打印,1 278 | 比较热情,1 279 | 这种电器之类的,0 280 | 安装师傅的服务态度非常好,1 281 | 质量也满意,1 282 | 按照他们的方法,0 283 | 今天最后一局还剩百分之11的电我又开了一局,0 284 | 很好正品,1 285 | 亲测待机量30+,0 286 | 但是想起这个屏幕,0 287 | 沉浸式音乐感受&rdquo,1 288 | 京东大家电价格便宜,1 289 | 然后色彩还原度也比较好,1 290 | 1-2秒就能识别,1 291 | 明明是想买路由器的,0 292 | 能带大耳机,0 293 | 商品没啥,0 294 | 出差、旅行携带方便,1 295 | 声明快递员很给力,1 296 | 现在的快递都是下楼自己拿的,0 297 | 还没有我九块九三张的好,-1 298 | 特意把儿子煲好的IE60拿了试了下,0 299 | 快递很差,-1 300 | 为健康观影保驾护航,0 301 | 外形外观:白色的加点彩虹一样的颜色,0 -------------------------------------------------------------------------------- /sentiment_analysis_albert/hyperparameters.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2018 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | import sys 11 | pwd = os.path.dirname(os.path.abspath(__file__)) 12 | sys.path.append(pwd) 13 | 14 | 15 | class Hyperparamters: 16 | # Train parameters 17 | num_train_epochs = 5 18 | print_step = 10 19 | batch_size = 64 20 | batch_size_eval = 128 21 | summary_step = 10 22 | num_saved_per_epoch = 3 23 | max_to_keep = 100 24 | 25 | # Model paths 26 | logdir = 'logdir/model_01' 27 | file_save_model = 'model/model_save' 28 | file_load_model = 'model/model_load' 29 | 30 | # Train data and test data 31 | train_data = "sa_train.csv" 32 | test_data = "sa_test.csv" 33 | 34 | # Optimization parameters 35 | warmup_proportion = 0.1 36 | use_tpu = None 37 | do_lower_case = True 38 | learning_rate = 5e-5 39 | 40 | # TextCNN parameters 41 | num_filters = 128 42 | filter_sizes = [2,3,4,5,6,7] 43 | embedding_size = 384 44 | keep_prob = 0.5 45 | 46 | # Sequence and Label 47 | sequence_length = 60 48 | num_labels = 3 49 | dict_label = { 50 | '0': '-1', 51 | '1': '0', 52 | '2': '1'} 53 | 54 | # ALBERT parameters 55 | name = 'albert_small_zh_google' 56 | bert_path = os.path.join(pwd,name) 57 | data_dir = os.path.join(pwd,'data') 58 | vocab_file = os.path.join(pwd,name,'vocab_chinese.txt') 59 | init_checkpoint = os.path.join(pwd,name,'albert_model.ckpt') 60 | saved_model_path = os.path.join(pwd,'model') 61 | 62 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/image/model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/image/model.png -------------------------------------------------------------------------------- /sentiment_analysis_albert/lamb_optimizer.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # Lint as: python2, python3 16 | """Functions and classes related to optimization (weight updates).""" 17 | 18 | from __future__ import absolute_import 19 | from __future__ import division 20 | from __future__ import print_function 21 | 22 | import re 23 | import six 24 | import tensorflow.compat.v1 as tf 25 | 26 | # pylint: disable=g-direct-tensorflow-import 27 | from tensorflow.python.ops import array_ops 28 | from tensorflow.python.ops import linalg_ops 29 | from tensorflow.python.ops import math_ops 30 | # pylint: enable=g-direct-tensorflow-import 31 | 32 | 33 | class LAMBOptimizer(tf.train.Optimizer): 34 | """LAMB (Layer-wise Adaptive Moments optimizer for Batch training).""" 35 | # A new optimizer that includes correct L2 weight decay, adaptive 36 | # element-wise updating, and layer-wise justification. The LAMB optimizer 37 | # was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, 38 | # James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT 39 | # Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962) 40 | 41 | def __init__(self, 42 | learning_rate, 43 | weight_decay_rate=0.0, 44 | beta_1=0.9, 45 | beta_2=0.999, 46 | epsilon=1e-6, 47 | exclude_from_weight_decay=None, 48 | exclude_from_layer_adaptation=None, 49 | name="LAMBOptimizer"): 50 | """Constructs a LAMBOptimizer.""" 51 | super(LAMBOptimizer, self).__init__(False, name) 52 | 53 | self.learning_rate = learning_rate 54 | self.weight_decay_rate = weight_decay_rate 55 | self.beta_1 = beta_1 56 | self.beta_2 = beta_2 57 | self.epsilon = epsilon 58 | self.exclude_from_weight_decay = exclude_from_weight_decay 59 | # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the 60 | # arg is None. 61 | # TODO(jingli): validate if exclude_from_layer_adaptation is necessary. 62 | if exclude_from_layer_adaptation: 63 | self.exclude_from_layer_adaptation = exclude_from_layer_adaptation 64 | else: 65 | self.exclude_from_layer_adaptation = exclude_from_weight_decay 66 | 67 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 68 | """See base class.""" 69 | assignments = [] 70 | for (grad, param) in grads_and_vars: 71 | if grad is None or param is None: 72 | continue 73 | 74 | param_name = self._get_variable_name(param.name) 75 | 76 | m = tf.get_variable( 77 | name=six.ensure_str(param_name) + "/adam_m", 78 | shape=param.shape.as_list(), 79 | dtype=tf.float32, 80 | trainable=False, 81 | initializer=tf.zeros_initializer()) 82 | v = tf.get_variable( 83 | name=six.ensure_str(param_name) + "/adam_v", 84 | shape=param.shape.as_list(), 85 | dtype=tf.float32, 86 | trainable=False, 87 | initializer=tf.zeros_initializer()) 88 | 89 | # Standard Adam update. 90 | next_m = ( 91 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 92 | next_v = ( 93 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 94 | tf.square(grad))) 95 | 96 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 97 | 98 | # Just adding the square of the weights to the loss function is *not* 99 | # the correct way of using L2 regularization/weight decay with Adam, 100 | # since that will interact with the m and v parameters in strange ways. 101 | # 102 | # Instead we want ot decay the weights in a manner that doesn't interact 103 | # with the m/v parameters. This is equivalent to adding the square 104 | # of the weights to the loss with plain (non-momentum) SGD. 105 | if self._do_use_weight_decay(param_name): 106 | update += self.weight_decay_rate * param 107 | 108 | ratio = 1.0 109 | if self._do_layer_adaptation(param_name): 110 | w_norm = linalg_ops.norm(param, ord=2) 111 | g_norm = linalg_ops.norm(update, ord=2) 112 | ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where( 113 | math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0) 114 | 115 | update_with_lr = ratio * self.learning_rate * update 116 | 117 | next_param = param - update_with_lr 118 | 119 | assignments.extend( 120 | [param.assign(next_param), 121 | m.assign(next_m), 122 | v.assign(next_v)]) 123 | return tf.group(*assignments, name=name) 124 | 125 | def _do_use_weight_decay(self, param_name): 126 | """Whether to use L2 weight decay for `param_name`.""" 127 | if not self.weight_decay_rate: 128 | return False 129 | if self.exclude_from_weight_decay: 130 | for r in self.exclude_from_weight_decay: 131 | if re.search(r, param_name) is not None: 132 | return False 133 | return True 134 | 135 | def _do_layer_adaptation(self, param_name): 136 | """Whether to do layer-wise learning rate adaptation for `param_name`.""" 137 | if self.exclude_from_layer_adaptation: 138 | for r in self.exclude_from_layer_adaptation: 139 | if re.search(r, param_name) is not None: 140 | return False 141 | return True 142 | 143 | def _get_variable_name(self, param_name): 144 | """Get the variable name from the tensor name.""" 145 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) 146 | if m is not None: 147 | param_name = m.group(1) 148 | return param_name 149 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553634.DESKTOP-QC1A83I: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553634.DESKTOP-QC1A83I -------------------------------------------------------------------------------- /sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553671.DESKTOP-QC1A83I: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_albert/logdir/model_01/events.out.tfevents.1592553671.DESKTOP-QC1A83I -------------------------------------------------------------------------------- /sentiment_analysis_albert/model/model_load/README.md: -------------------------------------------------------------------------------- 1 | # 推理所用的模型 2 | - model_load/checkpoit 3 | - model_load/model_1_0.ckpt.data-00000-of-00001 4 | - model_load/model_1_0.ckpt.index 5 | - model_load/model_1_0.ckpt.meta 6 | 7 | **checkpoint内容** 8 | ``` 9 | model_checkpoint_path: "model_1_0.ckpt" 10 | ``` 11 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/model/model_save/README.md: -------------------------------------------------------------------------------- 1 | # 训练过程所得模型 2 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/modules.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 21:01:45 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import tensorflow as tf 10 | from tensorflow.contrib.rnn import DropoutWrapper 11 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp 12 | 13 | 14 | 15 | def cell_textcnn(inputs,is_training): 16 | # Add a dimension in final shape 17 | inputs_expand = tf.expand_dims(inputs, -1) 18 | # Create a convolution + maxpool layer for each filter size 19 | pooled_outputs = [] 20 | with tf.name_scope("TextCNN"): 21 | for i, filter_size in enumerate(hp.filter_sizes): 22 | with tf.name_scope("conv-maxpool-%s" % filter_size): 23 | # Convolution Layer 24 | filter_shape = [filter_size, hp.embedding_size, 1, hp.num_filters] 25 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1),dtype=tf.float32, name="W") 26 | b = tf.Variable(tf.constant(0.1, shape=[hp.num_filters]),dtype=tf.float32, name="b") 27 | conv = tf.nn.conv2d( 28 | inputs_expand, 29 | W, 30 | strides=[1, 1, 1, 1], 31 | padding="VALID", 32 | name="conv") 33 | # Apply nonlinearity 34 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 35 | # Maxpooling over the outputs 36 | pooled = tf.nn.max_pool( 37 | h, 38 | ksize=[1, hp.sequence_length - filter_size + 1, 1, 1], 39 | strides=[1, 1, 1, 1], 40 | padding='VALID', 41 | name="pool") 42 | pooled_outputs.append(pooled) 43 | # Combine all the pooled features 44 | num_filters_total = hp.num_filters * len(hp.filter_sizes) 45 | h_pool = tf.concat(pooled_outputs, 3) 46 | h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total]) 47 | # Dropout 48 | h_pool_flat_dropout = tf.nn.dropout(h_pool_flat, keep_prob=hp.keep_prob if is_training else 1) 49 | return h_pool_flat_dropout 50 | 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/networks.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 20:44:42 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | import tensorflow as tf 10 | from sentiment_analysis_albert import modeling,optimization 11 | from sentiment_analysis_albert.classifier_utils import ClassifyProcessor 12 | from sentiment_analysis_albert.modules import cell_textcnn 13 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp 14 | from sentiment_analysis_albert.utils import time_now_string 15 | 16 | 17 | num_labels = hp.num_labels 18 | processor = ClassifyProcessor() 19 | bert_config_file = os.path.join(hp.bert_path,'albert_config.json') 20 | bert_config = modeling.AlbertConfig.from_json_file(bert_config_file) 21 | 22 | 23 | def count_model_params(): 24 | """ 25 | Compte the parameters 26 | """ 27 | total_parameters = 0 28 | for variable in tf.trainable_variables(): 29 | shape = variable.get_shape() 30 | variable_parameters = 1 31 | for dim in shape: 32 | variable_parameters *= dim.value 33 | total_parameters += variable_parameters 34 | print(' + Number of params: %.2fM' % (total_parameters / 1e6)) 35 | 36 | 37 | class NetworkAlbert(object): 38 | def __init__(self,is_training): 39 | self.is_training = is_training 40 | self.input_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_ids') 41 | self.input_masks = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_masks') 42 | self.segment_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='segment_ids') 43 | self.label_ids = tf.placeholder(tf.int32, shape=[None], name='label_ids') 44 | # Load BERT Pre-training LM 45 | self.model = modeling.AlbertModel( 46 | config=bert_config, 47 | is_training=self.is_training, 48 | input_ids=self.input_ids, 49 | input_mask=self.input_masks, 50 | token_type_ids=self.segment_ids, 51 | use_one_hot_embeddings=False) 52 | 53 | # Get the feature vector with size 3D:(batch_size,sequence_length,hidden_size) 54 | output_layer_init = self.model.get_sequence_output() 55 | # Cell textcnn 56 | output_layer = cell_textcnn(output_layer_init,self.is_training) 57 | # Hidden size 58 | hidden_size = output_layer.shape[-1].value 59 | # Dense 60 | with tf.name_scope("Full-connection"): 61 | output_weights = tf.get_variable( 62 | "output_weights", [num_labels, hidden_size], 63 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 64 | 65 | output_bias = tf.get_variable( 66 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 67 | # Logit 68 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 69 | self.logits = tf.nn.bias_add(logits, output_bias) 70 | self.probabilities = tf.nn.softmax(self.logits, axis=-1) 71 | # Prediction 72 | with tf.variable_scope("Prediction"): 73 | self.preds = tf.argmax(self.logits, axis=-1, output_type=tf.int32) 74 | # Summary for tensorboard 75 | with tf.variable_scope("Loss"): 76 | if self.is_training: 77 | self.accuracy = tf.reduce_mean(tf.to_float(tf.equal(self.preds, self.label_ids))) 78 | tf.summary.scalar('Accuracy', self.accuracy) 79 | 80 | # Check whether has loaded model 81 | ckpt = tf.train.get_checkpoint_state(hp.saved_model_path) 82 | checkpoint_suffix = ".index" 83 | if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path + checkpoint_suffix): 84 | print('='*10,'Restoring model from checkpoint!','='*10) 85 | print("%s - Restoring model from checkpoint ~%s" % (time_now_string(), 86 | ckpt.model_checkpoint_path)) 87 | else: 88 | # Load BERT Pre-training LM 89 | print('='*10,'First time load BERT model!','='*10) 90 | tvars = tf.trainable_variables() 91 | if hp.init_checkpoint: 92 | (assignment_map, initialized_variable_names) = \ 93 | modeling.get_assignment_map_from_checkpoint(tvars, 94 | hp.init_checkpoint) 95 | tf.train.init_from_checkpoint(hp.init_checkpoint, assignment_map) 96 | 97 | # Optimization 98 | if self.is_training: 99 | # Global_step 100 | self.global_step = tf.Variable(0, name='global_step', trainable=False) 101 | # Loss 102 | log_probs = tf.nn.log_softmax(self.logits, axis=-1) 103 | one_hot_labels = tf.one_hot(self.label_ids, depth=num_labels, dtype=tf.float32) 104 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 105 | self.loss = tf.reduce_mean(per_example_loss) 106 | # Optimizer 107 | train_examples = processor.get_train_examples(hp.data_dir) 108 | num_train_steps = int( 109 | len(train_examples) / hp.batch_size * hp.num_train_epochs) 110 | num_warmup_steps = int(num_train_steps * hp.warmup_proportion) 111 | self.optimizer = optimization.create_optimizer(self.loss, 112 | hp.learning_rate, 113 | num_train_steps, 114 | num_warmup_steps, 115 | hp.use_tpu, 116 | Global_step=self.global_step, 117 | ) 118 | # Summary for tensorboard 119 | tf.summary.scalar('loss', self.loss) 120 | self.merged = tf.summary.merge_all() 121 | 122 | # Compte the parameters 123 | count_model_params() 124 | vs = tf.trainable_variables() 125 | for l in vs: 126 | print(l) 127 | 128 | 129 | 130 | 131 | if __name__ == '__main__': 132 | # Load model 133 | albert = NetworkAlbert(is_training=True) 134 | 135 | 136 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/optimization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # Lint as: python2, python3 16 | """Functions and classes related to optimization (weight updates).""" 17 | 18 | from __future__ import absolute_import 19 | from __future__ import division 20 | from __future__ import print_function 21 | import re 22 | import lamb_optimizer 23 | import six 24 | from six.moves import zip 25 | import tensorflow.compat.v1 as tf 26 | from tensorflow.contrib import tpu as contrib_tpu 27 | 28 | 29 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu,Global_step, 30 | optimizer="adamw", poly_power=1.0, start_warmup_step=0): 31 | #optimizer="adamw", poly_power=1.0, start_warmup_step=0): 32 | """Creates an optimizer training op.""" 33 | if Global_step: 34 | global_step = Global_step 35 | else: 36 | global_step = tf.train.get_or_create_global_step() 37 | 38 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 39 | 40 | # Implements linear decay of the learning rate. 41 | learning_rate = tf.train.polynomial_decay( 42 | learning_rate, 43 | global_step, 44 | num_train_steps, 45 | end_learning_rate=0.0, 46 | power=poly_power, 47 | cycle=False) 48 | 49 | # Implements linear warmup. I.e., if global_step - start_warmup_step < 50 | # num_warmup_steps, the learning rate will be 51 | # `(global_step - start_warmup_step)/num_warmup_steps * init_lr`. 52 | if num_warmup_steps: 53 | tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step) 54 | + ", for " + str(num_warmup_steps) + " steps ++++++") 55 | global_steps_int = tf.cast(global_step, tf.int32) 56 | start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32) 57 | global_steps_int = global_steps_int - start_warm_int 58 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 59 | 60 | global_steps_float = tf.cast(global_steps_int, tf.float32) 61 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 62 | 63 | warmup_percent_done = global_steps_float / warmup_steps_float 64 | warmup_learning_rate = init_lr * warmup_percent_done 65 | 66 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 67 | learning_rate = ( 68 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 69 | 70 | # It is OK that you use this optimizer for finetuning, since this 71 | # is how the model was trained (note that the Adam m/v variables are NOT 72 | # loaded from init_checkpoint.) 73 | # It is OK to use AdamW in the finetuning even the model is trained by LAMB. 74 | # As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune 75 | # is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a 76 | # batch size of 64 in the finetune. 77 | if optimizer == "adamw": 78 | tf.logging.info("using adamw") 79 | optimizer = AdamWeightDecayOptimizer( 80 | learning_rate=learning_rate, 81 | weight_decay_rate=0.01, 82 | beta_1=0.9, 83 | beta_2=0.999, 84 | epsilon=1e-6, 85 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 86 | elif optimizer == "lamb": 87 | tf.logging.info("using lamb") 88 | optimizer = lamb_optimizer.LAMBOptimizer( 89 | learning_rate=learning_rate, 90 | weight_decay_rate=0.01, 91 | beta_1=0.9, 92 | beta_2=0.999, 93 | epsilon=1e-6, 94 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 95 | else: 96 | raise ValueError("Not supported optimizer: ", optimizer) 97 | 98 | if use_tpu: 99 | optimizer = contrib_tpu.CrossShardOptimizer(optimizer) 100 | 101 | tvars = tf.trainable_variables() 102 | grads = tf.gradients(loss, tvars) 103 | 104 | # This is how the model was pre-trained. 105 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 106 | 107 | train_op = optimizer.apply_gradients( 108 | list(zip(grads, tvars)), global_step=global_step) 109 | 110 | # Normally the global step update is done inside of `apply_gradients`. 111 | # However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this. 112 | # But if you use a different optimizer, you should probably take this line 113 | # out. 114 | new_global_step = global_step + 1 115 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 116 | return train_op 117 | 118 | 119 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 120 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 121 | 122 | def __init__(self, 123 | learning_rate, 124 | weight_decay_rate=0.0, 125 | beta_1=0.9, 126 | beta_2=0.999, 127 | epsilon=1e-6, 128 | exclude_from_weight_decay=None, 129 | name="AdamWeightDecayOptimizer"): 130 | """Constructs a AdamWeightDecayOptimizer.""" 131 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 132 | 133 | self.learning_rate = learning_rate 134 | self.weight_decay_rate = weight_decay_rate 135 | self.beta_1 = beta_1 136 | self.beta_2 = beta_2 137 | self.epsilon = epsilon 138 | self.exclude_from_weight_decay = exclude_from_weight_decay 139 | 140 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 141 | """See base class.""" 142 | assignments = [] 143 | for (grad, param) in grads_and_vars: 144 | if grad is None or param is None: 145 | continue 146 | 147 | param_name = self._get_variable_name(param.name) 148 | 149 | m = tf.get_variable( 150 | name=six.ensure_str(param_name) + "/adam_m", 151 | shape=param.shape.as_list(), 152 | dtype=tf.float32, 153 | trainable=False, 154 | initializer=tf.zeros_initializer()) 155 | v = tf.get_variable( 156 | name=six.ensure_str(param_name) + "/adam_v", 157 | shape=param.shape.as_list(), 158 | dtype=tf.float32, 159 | trainable=False, 160 | initializer=tf.zeros_initializer()) 161 | 162 | # Standard Adam update. 163 | next_m = ( 164 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 165 | next_v = ( 166 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 167 | tf.square(grad))) 168 | 169 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 170 | 171 | # Just adding the square of the weights to the loss function is *not* 172 | # the correct way of using L2 regularization/weight decay with Adam, 173 | # since that will interact with the m and v parameters in strange ways. 174 | # 175 | # Instead we want ot decay the weights in a manner that doesn't interact 176 | # with the m/v parameters. This is equivalent to adding the square 177 | # of the weights to the loss with plain (non-momentum) SGD. 178 | if self._do_use_weight_decay(param_name): 179 | update += self.weight_decay_rate * param 180 | 181 | update_with_lr = self.learning_rate * update 182 | 183 | next_param = param - update_with_lr 184 | 185 | assignments.extend( 186 | [param.assign(next_param), 187 | m.assign(next_m), 188 | v.assign(next_v)]) 189 | return tf.group(*assignments, name=name) 190 | 191 | def _do_use_weight_decay(self, param_name): 192 | """Whether to use L2 weight decay for `param_name`.""" 193 | if not self.weight_decay_rate: 194 | return False 195 | if self.exclude_from_weight_decay: 196 | for r in self.exclude_from_weight_decay: 197 | if re.search(r, param_name) is not None: 198 | return False 199 | return True 200 | 201 | def _get_variable_name(self, param_name): 202 | """Get the variable name from the tensor name.""" 203 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) 204 | if m is not None: 205 | param_name = m.group(1) 206 | return param_name 207 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/predict.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 17:12:37 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | pwd = os.path.dirname(os.path.abspath(__file__)) 11 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1' 12 | import sys 13 | sys.path.append(os.path.dirname(os.path.dirname(__file__))) 14 | import tensorflow as tf 15 | from sentiment_analysis_albert.networks import NetworkAlbert 16 | from sentiment_analysis_albert.classifier_utils import get_feature_test 17 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp 18 | 19 | 20 | 21 | class ModelAlbertTextCNN(object): 22 | """ 23 | Load NetworkAlbert TextCNN model 24 | """ 25 | def __init__(self,): 26 | self.albert, self.sess = self.load_model() 27 | @staticmethod 28 | def load_model(): 29 | with tf.Graph().as_default(): 30 | sess = tf.Session() 31 | out_dir = os.path.join(pwd, "model") 32 | with sess.as_default(): 33 | albert = NetworkAlbert(is_training=False) 34 | saver = tf.train.Saver() 35 | sess.run(tf.global_variables_initializer()) 36 | checkpoint_dir = os.path.abspath(os.path.join(out_dir,hp.file_load_model)) 37 | print (checkpoint_dir) 38 | ckpt = tf.train.get_checkpoint_state(checkpoint_dir) 39 | saver.restore(sess, ckpt.model_checkpoint_path) 40 | return albert,sess 41 | 42 | MODEL = ModelAlbertTextCNN() 43 | print('Load model finished!') 44 | 45 | 46 | 47 | def sa(sentence): 48 | """ 49 | Prediction of the sentence's sentiment. 50 | """ 51 | feature = get_feature_test(sentence) 52 | fd = {MODEL.albert.input_ids: [feature[0]], 53 | MODEL.albert.input_masks: [feature[1]], 54 | MODEL.albert.segment_ids:[feature[2]], 55 | } 56 | output = MODEL.sess.run(MODEL.albert.preds, feed_dict=fd) 57 | return output[0]-1 58 | 59 | 60 | 61 | 62 | if __name__ == '__main__': 63 | ## 64 | import time 65 | start = time.time() 66 | sent = '我喜欢这个地方' 67 | print(sa(sent)) 68 | end = time.time() 69 | print(end-start) 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow1.15 2 | sentencepiece 3 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # Lint as: python2, python3 16 | # coding=utf-8 17 | """Tokenization classes.""" 18 | 19 | from __future__ import absolute_import 20 | from __future__ import division 21 | from __future__ import print_function 22 | 23 | import collections 24 | import re 25 | import unicodedata 26 | import six 27 | from six.moves import range 28 | import tensorflow.compat.v1 as tf 29 | #import tensorflow_hub as hub 30 | import sentencepiece as spm 31 | 32 | SPIECE_UNDERLINE = u"▁".encode("utf-8") 33 | 34 | 35 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): 36 | """Checks whether the casing config is consistent with the checkpoint name.""" 37 | 38 | # The casing has to be passed in by the user and there is no explicit check 39 | # as to whether it matches the checkpoint. The casing information probably 40 | # should have been stored in the bert_config.json file, but it's not, so 41 | # we have to heuristically detect it to validate. 42 | 43 | if not init_checkpoint: 44 | return 45 | 46 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", 47 | six.ensure_str(init_checkpoint)) 48 | if m is None: 49 | return 50 | 51 | model_name = m.group(1) 52 | 53 | lower_models = [ 54 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", 55 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" 56 | ] 57 | 58 | cased_models = [ 59 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", 60 | "multi_cased_L-12_H-768_A-12" 61 | ] 62 | 63 | is_bad_config = False 64 | if model_name in lower_models and not do_lower_case: 65 | is_bad_config = True 66 | actual_flag = "False" 67 | case_name = "lowercased" 68 | opposite_flag = "True" 69 | 70 | if model_name in cased_models and do_lower_case: 71 | is_bad_config = True 72 | actual_flag = "True" 73 | case_name = "cased" 74 | opposite_flag = "False" 75 | 76 | if is_bad_config: 77 | raise ValueError( 78 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " 79 | "However, `%s` seems to be a %s model, so you " 80 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches " 81 | "how the model was pre-training. If this error is wrong, please " 82 | "just comment out this check." % (actual_flag, init_checkpoint, 83 | model_name, case_name, opposite_flag)) 84 | 85 | 86 | def preprocess_text(inputs, remove_space=True, lower=False): 87 | """preprocess data by removing extra space and normalize data.""" 88 | outputs = inputs 89 | if remove_space: 90 | outputs = " ".join(inputs.strip().split()) 91 | 92 | if six.PY2 and isinstance(outputs, str): 93 | try: 94 | outputs = six.ensure_text(outputs, "utf-8") 95 | except UnicodeDecodeError: 96 | outputs = six.ensure_text(outputs, "latin-1") 97 | 98 | outputs = unicodedata.normalize("NFKD", outputs) 99 | outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) 100 | if lower: 101 | outputs = outputs.lower() 102 | 103 | return outputs 104 | 105 | 106 | def encode_pieces(sp_model, text, return_unicode=True, sample=False): 107 | """turn sentences into word pieces.""" 108 | 109 | if six.PY2 and isinstance(text, six.text_type): 110 | text = six.ensure_binary(text, "utf-8") 111 | 112 | if not sample: 113 | pieces = sp_model.EncodeAsPieces(text) 114 | else: 115 | pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1) 116 | new_pieces = [] 117 | for piece in pieces: 118 | piece = printable_text(piece) 119 | if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit(): 120 | cur_pieces = sp_model.EncodeAsPieces( 121 | six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b"")) 122 | if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE: 123 | if len(cur_pieces[0]) == 1: 124 | cur_pieces = cur_pieces[1:] 125 | else: 126 | cur_pieces[0] = cur_pieces[0][1:] 127 | cur_pieces.append(piece[-1]) 128 | new_pieces.extend(cur_pieces) 129 | else: 130 | new_pieces.append(piece) 131 | 132 | # note(zhiliny): convert back to unicode for py2 133 | if six.PY2 and return_unicode: 134 | ret_pieces = [] 135 | for piece in new_pieces: 136 | if isinstance(piece, str): 137 | piece = six.ensure_text(piece, "utf-8") 138 | ret_pieces.append(piece) 139 | new_pieces = ret_pieces 140 | 141 | return new_pieces 142 | 143 | 144 | def encode_ids(sp_model, text, sample=False): 145 | pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample) 146 | ids = [sp_model.PieceToId(piece) for piece in pieces] 147 | return ids 148 | 149 | 150 | def convert_to_unicode(text): 151 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 152 | if six.PY3: 153 | if isinstance(text, str): 154 | return text 155 | elif isinstance(text, bytes): 156 | return six.ensure_text(text, "utf-8", "ignore") 157 | else: 158 | raise ValueError("Unsupported string type: %s" % (type(text))) 159 | elif six.PY2: 160 | if isinstance(text, str): 161 | return six.ensure_text(text, "utf-8", "ignore") 162 | elif isinstance(text, six.text_type): 163 | return text 164 | else: 165 | raise ValueError("Unsupported string type: %s" % (type(text))) 166 | else: 167 | raise ValueError("Not running on Python2 or Python 3?") 168 | 169 | 170 | def printable_text(text): 171 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 172 | 173 | # These functions want `str` for both Python2 and Python3, but in one case 174 | # it's a Unicode string and in the other it's a byte string. 175 | if six.PY3: 176 | if isinstance(text, str): 177 | return text 178 | elif isinstance(text, bytes): 179 | return six.ensure_text(text, "utf-8", "ignore") 180 | else: 181 | raise ValueError("Unsupported string type: %s" % (type(text))) 182 | elif six.PY2: 183 | if isinstance(text, str): 184 | return text 185 | elif isinstance(text, six.text_type): 186 | return six.ensure_binary(text, "utf-8") 187 | else: 188 | raise ValueError("Unsupported string type: %s" % (type(text))) 189 | else: 190 | raise ValueError("Not running on Python2 or Python 3?") 191 | 192 | 193 | def load_vocab(vocab_file): 194 | """Loads a vocabulary file into a dictionary.""" 195 | vocab = collections.OrderedDict() 196 | with tf.gfile.GFile(vocab_file, "r") as reader: 197 | while True: 198 | token = convert_to_unicode(reader.readline()) 199 | if not token: 200 | break 201 | token = token.strip()#.split()[0] 202 | if token not in vocab: 203 | vocab[token] = len(vocab) 204 | return vocab 205 | 206 | 207 | def convert_by_vocab(vocab, items): 208 | """Converts a sequence of [tokens|ids] using the vocab.""" 209 | output = [] 210 | for item in items: 211 | output.append(vocab[item]) 212 | return output 213 | 214 | 215 | def convert_tokens_to_ids(vocab, tokens): 216 | return convert_by_vocab(vocab, tokens) 217 | 218 | 219 | def convert_ids_to_tokens(inv_vocab, ids): 220 | return convert_by_vocab(inv_vocab, ids) 221 | 222 | 223 | def whitespace_tokenize(text): 224 | """Runs basic whitespace cleaning and splitting on a piece of text.""" 225 | text = text.strip() 226 | if not text: 227 | return [] 228 | tokens = text.split() 229 | return tokens 230 | 231 | 232 | class FullTokenizer(object): 233 | """Runs end-to-end tokenziation.""" 234 | 235 | def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None): 236 | self.vocab = None 237 | self.sp_model = None 238 | if spm_model_file: 239 | self.sp_model = spm.SentencePieceProcessor() 240 | tf.logging.info("loading sentence piece model") 241 | self.sp_model.Load(spm_model_file) 242 | # Note(mingdachen): For the purpose of consisent API, we are 243 | # generating a vocabulary for the sentence piece tokenizer. 244 | self.vocab = {self.sp_model.IdToPiece(i): i for i 245 | in range(self.sp_model.GetPieceSize())} 246 | else: 247 | self.vocab = load_vocab(vocab_file) 248 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 249 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 250 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 251 | 252 | @classmethod 253 | def from_scratch(cls, vocab_file, do_lower_case, spm_model_file): 254 | return FullTokenizer(vocab_file, do_lower_case, spm_model_file) 255 | 256 | # @classmethod 257 | # def from_hub_module(cls, hub_module, spm_model_file): 258 | # """Get the vocab file and casing info from the Hub module.""" 259 | # with tf.Graph().as_default(): 260 | # albert_module = hub.Module(hub_module) 261 | # tokenization_info = albert_module(signature="tokenization_info", 262 | # as_dict=True) 263 | # with tf.Session() as sess: 264 | # vocab_file, do_lower_case = sess.run( 265 | # [tokenization_info["vocab_file"], 266 | # tokenization_info["do_lower_case"]]) 267 | # return FullTokenizer( 268 | # vocab_file=vocab_file, do_lower_case=do_lower_case, 269 | # spm_model_file=spm_model_file) 270 | 271 | def tokenize(self, text): 272 | if self.sp_model: 273 | split_tokens = encode_pieces(self.sp_model, text, return_unicode=False) 274 | else: 275 | split_tokens = [] 276 | for token in self.basic_tokenizer.tokenize(text): 277 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 278 | split_tokens.append(sub_token) 279 | 280 | return split_tokens 281 | 282 | def convert_tokens_to_ids(self, tokens): 283 | if self.sp_model: 284 | tf.logging.info("using sentence piece tokenzier.") 285 | return [self.sp_model.PieceToId( 286 | printable_text(token)) for token in tokens] 287 | else: 288 | return convert_by_vocab(self.vocab, tokens) 289 | 290 | def convert_ids_to_tokens(self, ids): 291 | if self.sp_model: 292 | tf.logging.info("using sentence piece tokenzier.") 293 | return [self.sp_model.IdToPiece(id_) for id_ in ids] 294 | else: 295 | return convert_by_vocab(self.inv_vocab, ids) 296 | 297 | 298 | class BasicTokenizer(object): 299 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 300 | 301 | def __init__(self, do_lower_case=True): 302 | """Constructs a BasicTokenizer. 303 | 304 | Args: 305 | do_lower_case: Whether to lower case the input. 306 | """ 307 | self.do_lower_case = do_lower_case 308 | 309 | def tokenize(self, text): 310 | """Tokenizes a piece of text.""" 311 | text = convert_to_unicode(text) 312 | text = self._clean_text(text) 313 | 314 | # This was added on November 1st, 2018 for the multilingual and Chinese 315 | # models. This is also applied to the English models now, but it doesn't 316 | # matter since the English models were not trained on any Chinese data 317 | # and generally don't have any Chinese data in them (there are Chinese 318 | # characters in the vocabulary because Wikipedia does have some Chinese 319 | # words in the English Wikipedia.). 320 | text = self._tokenize_chinese_chars(text) 321 | 322 | orig_tokens = whitespace_tokenize(text) 323 | split_tokens = [] 324 | for token in orig_tokens: 325 | if self.do_lower_case: 326 | token = token.lower() 327 | token = self._run_strip_accents(token) 328 | split_tokens.extend(self._run_split_on_punc(token)) 329 | 330 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 331 | return output_tokens 332 | 333 | def _run_strip_accents(self, text): 334 | """Strips accents from a piece of text.""" 335 | text = unicodedata.normalize("NFD", text) 336 | output = [] 337 | for char in text: 338 | cat = unicodedata.category(char) 339 | if cat == "Mn": 340 | continue 341 | output.append(char) 342 | return "".join(output) 343 | 344 | def _run_split_on_punc(self, text): 345 | """Splits punctuation on a piece of text.""" 346 | chars = list(text) 347 | i = 0 348 | start_new_word = True 349 | output = [] 350 | while i < len(chars): 351 | char = chars[i] 352 | if _is_punctuation(char): 353 | output.append([char]) 354 | start_new_word = True 355 | else: 356 | if start_new_word: 357 | output.append([]) 358 | start_new_word = False 359 | output[-1].append(char) 360 | i += 1 361 | 362 | return ["".join(x) for x in output] 363 | 364 | def _tokenize_chinese_chars(self, text): 365 | """Adds whitespace around any CJK character.""" 366 | output = [] 367 | for char in text: 368 | cp = ord(char) 369 | if self._is_chinese_char(cp): 370 | output.append(" ") 371 | output.append(char) 372 | output.append(" ") 373 | else: 374 | output.append(char) 375 | return "".join(output) 376 | 377 | def _is_chinese_char(self, cp): 378 | """Checks whether CP is the codepoint of a CJK character.""" 379 | # This defines a "chinese character" as anything in the CJK Unicode block: 380 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 381 | # 382 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 383 | # despite its name. The modern Korean Hangul alphabet is a different block, 384 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 385 | # space-separated words, so they are not treated specially and handled 386 | # like the all of the other languages. 387 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 388 | (cp >= 0x3400 and cp <= 0x4DBF) or # 389 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 390 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 391 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 392 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 393 | (cp >= 0xF900 and cp <= 0xFAFF) or # 394 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 395 | return True 396 | 397 | return False 398 | 399 | def _clean_text(self, text): 400 | """Performs invalid character removal and whitespace cleanup on text.""" 401 | output = [] 402 | for char in text: 403 | cp = ord(char) 404 | if cp == 0 or cp == 0xfffd or _is_control(char): 405 | continue 406 | if _is_whitespace(char): 407 | output.append(" ") 408 | else: 409 | output.append(char) 410 | return "".join(output) 411 | 412 | 413 | class WordpieceTokenizer(object): 414 | """Runs WordPiece tokenziation.""" 415 | 416 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): 417 | self.vocab = vocab 418 | self.unk_token = unk_token 419 | self.max_input_chars_per_word = max_input_chars_per_word 420 | 421 | def tokenize(self, text): 422 | """Tokenizes a piece of text into its word pieces. 423 | 424 | This uses a greedy longest-match-first algorithm to perform tokenization 425 | using the given vocabulary. 426 | 427 | For example: 428 | input = "unaffable" 429 | output = ["un", "##aff", "##able"] 430 | 431 | Args: 432 | text: A single token or whitespace separated tokens. This should have 433 | already been passed through `BasicTokenizer. 434 | 435 | Returns: 436 | A list of wordpiece tokens. 437 | """ 438 | 439 | text = convert_to_unicode(text) 440 | 441 | output_tokens = [] 442 | for token in whitespace_tokenize(text): 443 | chars = list(token) 444 | if len(chars) > self.max_input_chars_per_word: 445 | output_tokens.append(self.unk_token) 446 | continue 447 | 448 | is_bad = False 449 | start = 0 450 | sub_tokens = [] 451 | while start < len(chars): 452 | end = len(chars) 453 | cur_substr = None 454 | while start < end: 455 | substr = "".join(chars[start:end]) 456 | if start > 0: 457 | substr = "##" + six.ensure_str(substr) 458 | if substr in self.vocab: 459 | cur_substr = substr 460 | break 461 | end -= 1 462 | if cur_substr is None: 463 | is_bad = True 464 | break 465 | sub_tokens.append(cur_substr) 466 | start = end 467 | 468 | if is_bad: 469 | output_tokens.append(self.unk_token) 470 | else: 471 | output_tokens.extend(sub_tokens) 472 | return output_tokens 473 | 474 | 475 | def _is_whitespace(char): 476 | """Checks whether `chars` is a whitespace character.""" 477 | # \t, \n, and \r are technically control characters but we treat them 478 | # as whitespace since they are generally considered as such. 479 | if char == " " or char == "\t" or char == "\n" or char == "\r": 480 | return True 481 | cat = unicodedata.category(char) 482 | if cat == "Zs": 483 | return True 484 | return False 485 | 486 | 487 | def _is_control(char): 488 | """Checks whether `chars` is a control character.""" 489 | # These are technically control characters but we count them as whitespace 490 | # characters. 491 | if char == "\t" or char == "\n" or char == "\r": 492 | return False 493 | cat = unicodedata.category(char) 494 | if cat in ("Cc", "Cf"): 495 | return True 496 | return False 497 | 498 | 499 | def _is_punctuation(char): 500 | """Checks whether `chars` is a punctuation character.""" 501 | cp = ord(char) 502 | # We treat all non-letter/number ASCII as punctuation. 503 | # Characters such as "^", "$", and "`" are not in the Unicode 504 | # Punctuation class but we treat them as punctuation anyways, for 505 | # consistency. 506 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 507 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 508 | return True 509 | cat = unicodedata.category(char) 510 | if cat.startswith("P"): 511 | return True 512 | return False 513 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 21:42:07 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1' 11 | import numpy as np 12 | import tensorflow as tf 13 | from sentiment_analysis_albert.classifier_utils import get_features,get_features_test 14 | from sentiment_analysis_albert.networks import NetworkAlbert 15 | from sentiment_analysis_albert.hyperparameters import Hyperparamters as hp 16 | from sentiment_analysis_albert.utils import shuffle_one,select,time_now_string 17 | 18 | 19 | # Load Model 20 | pwd = os.path.dirname(os.path.abspath(__file__)) 21 | MODEL = NetworkAlbert(is_training=True ) 22 | 23 | # Get data features 24 | input_ids,input_masks,segment_ids,label_ids = get_features() 25 | input_ids_test,input_masks_test,segment_ids_test,label_ids_test = get_features_test() 26 | num_train_samples = len(input_ids) 27 | arr = np.arange(num_train_samples) 28 | num_batchs = int((num_train_samples - 1)/hp.batch_size) + 1 29 | print('number of batch:',num_batchs) 30 | ids_test = np.arange(len(input_ids_test)) 31 | 32 | # Set up the graph 33 | saver = tf.train.Saver(max_to_keep=hp.max_to_keep) 34 | sess = tf.Session() 35 | sess.run(tf.global_variables_initializer()) 36 | 37 | # Load model saved before 38 | MODEL_SAVE_PATH = os.path.join(pwd, hp.file_save_model) 39 | ckpt = tf.train.get_checkpoint_state(MODEL_SAVE_PATH) 40 | if ckpt and ckpt.model_checkpoint_path: 41 | saver.restore(sess, ckpt.model_checkpoint_path) 42 | print('Restored model!') 43 | 44 | 45 | with sess.as_default(): 46 | # Tensorboard writer 47 | writer = tf.summary.FileWriter(hp.logdir, sess.graph) 48 | for i in range(hp.num_train_epochs): 49 | indexs = shuffle_one(arr) 50 | for j in range(num_batchs-1): 51 | i1 = indexs[j * hp.batch_size:min((j + 1) * hp.batch_size, num_train_samples)] 52 | # Get features 53 | input_id_ = select(input_ids,i1) 54 | input_mask_ = select(input_masks,i1) 55 | segment_id_ = select(segment_ids,i1) 56 | label_id_ = select(label_ids,i1) 57 | # Feed dict 58 | fd = {MODEL.input_ids: input_id_, 59 | MODEL.input_masks: input_mask_, 60 | MODEL.segment_ids:segment_id_, 61 | MODEL.label_ids:label_id_} 62 | # Optimizer 63 | sess.run(MODEL.optimizer, feed_dict = fd) 64 | # Tensorboard 65 | if j%hp.summary_step==0: 66 | summary,glolal_step = sess.run([MODEL.merged,MODEL.global_step], feed_dict = fd) 67 | writer.add_summary(summary, glolal_step) 68 | # Save Model 69 | if j%(num_batchs//hp.num_saved_per_epoch)==0: 70 | if not os.path.exists(os.path.join(pwd, hp.file_save_model)): 71 | os.makedirs(os.path.join(pwd, hp.file_save_model)) 72 | saver.save(sess, os.path.join(pwd, hp.file_save_model, 'model_%s_%s.ckpt'%(str(i),str(j)))) 73 | # Log 74 | if j % hp.print_step == 0: 75 | # Loss of Train data 76 | fd = {MODEL.input_ids: input_id_, 77 | MODEL.input_masks: input_mask_ , 78 | MODEL.segment_ids:segment_id_, 79 | MODEL.label_ids:label_id_} 80 | loss = sess.run(MODEL.loss, feed_dict = fd) 81 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss:%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss))) 82 | # Loss of Test data 83 | indexs_test = shuffle_one(ids_test)[:hp.batch_size_eval] 84 | input_id_test = select(input_ids_test,indexs_test) 85 | input_mask_test = select(input_masks_test,indexs_test) 86 | segment_id_test = select(segment_ids_test,indexs_test) 87 | label_id_test = select(label_ids_test,indexs_test) 88 | fd_test = {MODEL.input_ids:input_id_test, 89 | MODEL.input_masks:input_mask_test , 90 | MODEL.segment_ids:segment_id_test, 91 | MODEL.label_ids:label_id_test} 92 | loss = sess.run(MODEL.loss, feed_dict = fd_test) 93 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss(test):%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss))) 94 | print('Optimization finished') 95 | 96 | 97 | 98 | -------------------------------------------------------------------------------- /sentiment_analysis_albert/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 29 20:40:40 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | import time 9 | import pandas as pd 10 | import numpy as np 11 | 12 | 13 | 14 | def time_now_string(): 15 | return time.strftime("%Y-%m-%d %H:%M:%S",time.localtime( time.time() )) 16 | 17 | 18 | def load_csv(file,header=0,encoding="utf-8"): 19 | return pd.read_csv(file, 20 | encoding=encoding, 21 | header=header, 22 | error_bad_lines=False) 23 | 24 | 25 | def save_csv(dataframe,file,header=True,index=None,encoding="utf-8"): 26 | return dataframe.to_csv(file, 27 | mode='w+', 28 | header=header, 29 | index=index, 30 | encoding=encoding) 31 | 32 | def save_excel(dataframe,file,header=True,sheetname='Sheet1'): 33 | return dataframe.to_excel(file, 34 | header=header, 35 | sheet_name=sheetname) 36 | 37 | def load_excel(file,header=0,sheetname=None): 38 | dfs = pd.read_excel(file, 39 | header=header, 40 | sheet_name=sheetname) 41 | sheet_names = list(dfs.keys()) 42 | print('Name of first sheet:',sheet_names[0]) 43 | df = dfs[sheet_names[0]] 44 | print('Load excel data finished!') 45 | return df.fillna("") 46 | 47 | def load_txt(file): 48 | with open(file,encoding='utf-8',errors='ignore') as fp: 49 | lines = fp.readlines() 50 | lines = [l.strip() for l in lines] 51 | return lines 52 | 53 | 54 | def save_txt(file,lines): 55 | lines = [l+'\n' for l in lines] 56 | with open(file,'w+',encoding='utf-8') as fp:#a+添加 57 | fp.writelines(lines) 58 | 59 | 60 | def select(data,ids): 61 | return [data[i] for i in ids] 62 | 63 | def shuffle_one(a1): 64 | ran = np.arange(len(a1)) 65 | np.random.shuffle(ran) 66 | a1_ = [a1[l] for l in ran] 67 | return a1_ 68 | 69 | 70 | def shuffle_two(a1,a2): 71 | """ 72 | 随机打乱a1和a2两个 73 | """ 74 | ran = np.arange(len(a1)) 75 | np.random.shuffle(ran) 76 | a1_ = [a1[l] for l in ran] 77 | a2_ = [a2[l] for l in ran] 78 | return a1_, a2_ 79 | 80 | def load_vocabulary(file_vocabulary_label): 81 | """ 82 | Load vocabulary to dict 83 | """ 84 | vocabulary = load_txt(file_vocabulary_label) 85 | dic = {} 86 | for i,l in enumerate(vocabulary): 87 | dic[str(i)] = str(l) 88 | return dic 89 | 90 | 91 | if __name__ == "__main__": 92 | print(time_now_string()) 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/README.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 1、本项目是在tensorflow版本1.15.0的基础上做的训练和测试。 3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。 4 | 3、albert_small_zh_google对应的百度云下载地址: 5 | 链接:https://pan.baidu.com/s/1RKzGJTazlZ7y12YRbAWvyA 6 | 提取码:wuxw 7 | 8 | # 使用方法 9 | 1、准备数据 10 | 数据格式为:data/sa_train.csv(训练), data/sa_test.csv(测试) 11 | 2、参数设置 12 | 参考脚本 hyperparameters.py,直接修改里面的数值即可。 13 | 3、训练 14 | python train.py 15 | 4、推理 python predict.py 16 | 17 | # 知乎代码解读 18 | https://zhuanlan.zhihu.com/p/338806367 19 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/__init__.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/albert_base_zh/README.md: -------------------------------------------------------------------------------- 1 | # 语言模型 2 | 3 | ## ALBERT Base Chinese 4 | - albert_base_zh/albert_config.json 5 | - albert_base_zh/albert_model.ckpt.data-00000-of-00001 6 | - albert_base_zh/albert_model.ckpt.index 7 | - albert_base_zh/albert_model.ckpt.meta 8 | - albert_base_zh/checkpoint 9 | - albert_base_zh/vocab_chinese.txt 10 | - albert_base_zh/vocab_emoji.txt 11 | 12 | ## 下载路径 13 | 链接:https://pan.baidu.com/s/1BuXZyj1VmlvX60agv0cE5A?pwd=b9aq 14 | 提取码:b9aq 15 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/albert_small_zh_google/README.md: -------------------------------------------------------------------------------- 1 | # 语言模型: ALBERT Base Chinese 2 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/classifier_utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | import os 11 | import collections 12 | import tensorflow_hub as hub 13 | import tensorflow.compat.v1 as tf 14 | from tensorflow.contrib import tpu as contrib_tpu 15 | from tensorflow.contrib import data as contrib_data 16 | from tensorflow.contrib import metrics as contrib_metrics 17 | 18 | from sentiment_analysis_albert_emoji import modeling 19 | from sentiment_analysis_albert_emoji import optimization 20 | from sentiment_analysis_albert_emoji import tokenization 21 | from sentiment_analysis_albert_emoji.utils import load_csv 22 | from sentiment_analysis_albert_emoji.utils import get_emoji 23 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp 24 | 25 | 26 | 27 | def index2label(index): 28 | return hp.dict_label[str(index)] 29 | 30 | 31 | def emoji2id(string_emoji): 32 | """ 33 | 将emoji转为id形式,设置sequence_length=hp.sequence_length 34 | """ 35 | length = len(string_emoji) 36 | if length <= hp.sequence_length_emoji: 37 | return [hp.vocab_emoji_char2id.get(l,0) for l in string_emoji]+[0]*(hp.sequence_length_emoji-length) 38 | elif length > hp.sequence_length_emoji: 39 | return [hp.vocab_emoji_char2id.get(l,0) for l in string_emoji[:hp.sequence_length_emoji]] 40 | 41 | 42 | def read_csv(input_file): 43 | """Reads a tab separated value file.""" 44 | df = load_csv(input_file) 45 | jobcontent = df['content'].tolist() 46 | jlabel = df['label'].tolist() 47 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))] 48 | print('Read csv finished!(1)') 49 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str] 50 | return lines2 51 | 52 | 53 | 54 | class InputExample(object): 55 | """A single training/test example for simple sequence classification.""" 56 | 57 | def __init__(self, guid, text_a, text_b=None, label=None): 58 | """Constructs a InputExample. 59 | 60 | Args: 61 | guid: Unique id for the example. 62 | text_a: string. The untokenized text of the first sequence. For single 63 | sequence tasks, only this sequence must be specified. 64 | text_b: (Optional) string. The untokenized text of the second sequence. 65 | Only must be specified for sequence pair tasks. 66 | label: (Optional) string. The label of the example. This should be 67 | specified for train and dev examples, but not for test examples. 68 | """ 69 | self.guid = guid 70 | self.text_a = text_a 71 | self.text_b = text_b 72 | self.label = label 73 | 74 | 75 | class PaddingInputExample(object): 76 | """Fake example so the num input examples is a multiple of the batch size. 77 | 78 | When running eval/predict on the TPU, we need to pad the number of examples 79 | to be a multiple of the batch size, because the TPU requires a fixed batch 80 | size. The alternative is to drop the last batch, which is bad because it means 81 | the entire output data won't be generated. 82 | 83 | We use this class instead of `None` because treating `None` as padding 84 | battches could cause silent errors. 85 | """ 86 | 87 | 88 | class InputFeatures(object): 89 | """A single set of features of data.""" 90 | 91 | def __init__(self, 92 | input_ids, 93 | input_mask, 94 | segment_ids, 95 | label_id, 96 | guid=None, 97 | example_id=None, 98 | is_real_example=True): 99 | self.input_ids = input_ids 100 | self.input_mask = input_mask 101 | self.segment_ids = segment_ids 102 | self.label_id = label_id 103 | self.example_id = example_id 104 | self.guid = guid 105 | self.is_real_example = is_real_example 106 | 107 | 108 | class DataProcessor(object): 109 | """Base class for data converters for sequence classification data sets.""" 110 | 111 | def __init__(self, use_spm, do_lower_case): 112 | super(DataProcessor, self).__init__() 113 | self.use_spm = use_spm 114 | self.do_lower_case = do_lower_case 115 | 116 | def get_train_examples(self, data_dir): 117 | """Gets a collection of `InputExample`s for the train set.""" 118 | raise NotImplementedError() 119 | 120 | def get_dev_examples(self, data_dir): 121 | """Gets a collection of `InputExample`s for the dev set.""" 122 | raise NotImplementedError() 123 | 124 | def get_test_examples(self, data_dir): 125 | """Gets a collection of `InputExample`s for prediction.""" 126 | raise NotImplementedError() 127 | 128 | def get_labels(self): 129 | """Gets the list of labels for this data set.""" 130 | raise NotImplementedError() 131 | 132 | @classmethod 133 | def _read_csv(cls,input_file): 134 | """Reads a tab separated value file.""" 135 | df = load_csv(input_file) 136 | jobcontent = df['content'].tolist() 137 | jlabel = df['label'].tolist() 138 | lines = [[str(jlabel[i]),str(jobcontent[i])] for i in range(len(jobcontent))] 139 | print('Length of data:',len(lines)) 140 | lines2 = [ [list(hp.dict_label.keys())[list(hp.dict_label.values()).index( l[0])], l[1]] for l in lines if type(l[1])==str] 141 | return lines2 142 | 143 | 144 | class ClassifyProcessor(DataProcessor): 145 | """Processor for the MRPC data set (GLUE version).""" 146 | 147 | def __init__(self): 148 | self.labels = set() 149 | 150 | def get_train_examples(self, data_dir): 151 | """See base class.""" 152 | return self._create_examples( 153 | self._read_csv(os.path.join(data_dir, hp.train_data)), "train") 154 | 155 | def get_dev_examples(self, data_dir): 156 | """See base class.""" 157 | return self._create_examples( 158 | self._read_csv(os.path.join(data_dir, hp.test_data)), "dev") 159 | 160 | def get_test_examples(self, data_dir): 161 | """See base class.""" 162 | return self._create_examples( 163 | self._read_csv(os.path.join(data_dir, hp.test_data)), "test") 164 | 165 | def get_labels(self): 166 | """See base class.""" 167 | return ['0','1','2'] 168 | 169 | def _create_examples(self, lines, set_type): 170 | """Creates examples for the training and dev sets.""" 171 | examples = [] 172 | for (i, line) in enumerate(lines): 173 | guid = "%s-%s" % (set_type, i) 174 | text_a = tokenization.convert_to_unicode(line[1]) 175 | label = tokenization.convert_to_unicode(line[0]) 176 | self.labels.add(label) 177 | examples.append( 178 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 179 | return examples 180 | 181 | 182 | def convert_single_example(ex_index, example, label_list, max_seq_length, 183 | tokenizer, task_name): 184 | """Converts a single `InputExample` into a single `InputFeatures`.""" 185 | 186 | if isinstance(example, PaddingInputExample): 187 | return InputFeatures( 188 | input_ids=[0] * max_seq_length, 189 | input_mask=[0] * max_seq_length, 190 | segment_ids=[0] * max_seq_length, 191 | label_id=0, 192 | is_real_example=False) 193 | 194 | if task_name != "sts-b": 195 | label_map = {} 196 | for (i, label) in enumerate(label_list): 197 | label_map[label] = i 198 | 199 | tokens_a = tokenizer.tokenize(example.text_a) 200 | tokens_b = None 201 | if example.text_b: 202 | tokens_b = tokenizer.tokenize(example.text_b) 203 | 204 | if tokens_b: 205 | # Modifies `tokens_a` and `tokens_b` in place so that the total 206 | # length is less than the specified length. 207 | # Account for [CLS], [SEP], [SEP] with "- 3" 208 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 209 | else: 210 | # Account for [CLS] and [SEP] with "- 2" 211 | if len(tokens_a) > max_seq_length - 2: 212 | tokens_a = tokens_a[0:(max_seq_length - 2)] 213 | 214 | # The convention in ALBERT is: 215 | # (a) For sequence pairs: 216 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 217 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 218 | # (b) For single sequences: 219 | # tokens: [CLS] the dog is hairy . [SEP] 220 | # type_ids: 0 0 0 0 0 0 0 221 | # 222 | # Where "type_ids" are used to indicate whether this is the first 223 | # sequence or the second sequence. The embedding vectors for `type=0` and 224 | # `type=1` were learned during pre-training and are added to the 225 | # embedding vector (and position vector). This is not *strictly* necessary 226 | # since the [SEP] token unambiguously separates the sequences, but it makes 227 | # it easier for the model to learn the concept of sequences. 228 | # 229 | # For classification tasks, the first vector (corresponding to [CLS]) is 230 | # used as the "sentence vector". Note that this only makes sense because 231 | # the entire model is fine-tuned. 232 | tokens = [] 233 | segment_ids = [] 234 | tokens.append("[CLS]") 235 | segment_ids.append(0) 236 | for token in tokens_a: 237 | tokens.append(token) 238 | segment_ids.append(0) 239 | tokens.append("[SEP]") 240 | segment_ids.append(0) 241 | 242 | if tokens_b: 243 | for token in tokens_b: 244 | tokens.append(token) 245 | segment_ids.append(1) 246 | tokens.append("[SEP]") 247 | segment_ids.append(1) 248 | 249 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 250 | 251 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 252 | # tokens are attended to. 253 | input_mask = [1] * len(input_ids) 254 | 255 | # Zero-pad up to the sequence length. 256 | while len(input_ids) < max_seq_length: 257 | input_ids.append(0) 258 | input_mask.append(0) 259 | segment_ids.append(0) 260 | 261 | assert len(input_ids) == max_seq_length 262 | assert len(input_mask) == max_seq_length 263 | assert len(segment_ids) == max_seq_length 264 | 265 | if task_name != "sts-b": 266 | label_id = label_map[example.label] 267 | else: 268 | label_id = example.label 269 | 270 | # if ex_index < 5: 271 | # tf.logging.info("*** Example ***") 272 | # tf.logging.info("guid: %s" % (example.guid)) 273 | # tf.logging.info("tokens: %s" % " ".join( 274 | # [tokenization.printable_text(x) for x in tokens])) 275 | # tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 276 | # tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 277 | # tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 278 | # tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 279 | 280 | feature = InputFeatures( 281 | input_ids=input_ids, 282 | input_mask=input_mask, 283 | segment_ids=segment_ids, 284 | label_id=label_id, 285 | is_real_example=True) 286 | return feature 287 | 288 | 289 | def file_based_convert_examples_to_features( 290 | examples, label_list, max_seq_length, tokenizer, output_file, task_name): 291 | """Convert a set of `InputExample`s to a TFRecord file.""" 292 | 293 | writer = tf.python_io.TFRecordWriter(output_file) 294 | 295 | for (ex_index, example) in enumerate(examples): 296 | if ex_index % 10000 == 0: 297 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 298 | 299 | feature = convert_single_example(ex_index, example, label_list, 300 | max_seq_length, tokenizer, task_name) 301 | 302 | def create_int_feature(values): 303 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 304 | return f 305 | 306 | def create_float_feature(values): 307 | f = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) 308 | return f 309 | 310 | features = collections.OrderedDict() 311 | features["input_ids"] = create_int_feature(feature.input_ids) 312 | features["input_mask"] = create_int_feature(feature.input_mask) 313 | features["segment_ids"] = create_int_feature(feature.segment_ids) 314 | features["label_ids"] = create_float_feature([feature.label_id])\ 315 | if task_name == "sts-b" else create_int_feature([feature.label_id]) 316 | features["is_real_example"] = create_int_feature( 317 | [int(feature.is_real_example)]) 318 | 319 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 320 | writer.write(tf_example.SerializeToString()) 321 | writer.close() 322 | 323 | 324 | def file_based_input_fn_builder(input_file, seq_length, is_training, 325 | drop_remainder, task_name, use_tpu, bsz, 326 | multiple=1): 327 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 328 | labeltype = tf.float32 if task_name == "sts-b" else tf.int64 329 | 330 | name_to_features = { 331 | "input_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64), 332 | "input_mask": tf.FixedLenFeature([seq_length * multiple], tf.int64), 333 | "segment_ids": tf.FixedLenFeature([seq_length * multiple], tf.int64), 334 | "label_ids": tf.FixedLenFeature([], labeltype), 335 | "is_real_example": tf.FixedLenFeature([], tf.int64), 336 | } 337 | 338 | def _decode_record(record, name_to_features): 339 | """Decodes a record to a TensorFlow example.""" 340 | example = tf.parse_single_example(record, name_to_features) 341 | 342 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 343 | # So cast all int64 to int32. 344 | for name in list(example.keys()): 345 | t = example[name] 346 | if t.dtype == tf.int64: 347 | t = tf.to_int32(t) 348 | example[name] = t 349 | 350 | return example 351 | 352 | def input_fn(params): 353 | """The actual input function.""" 354 | if use_tpu: 355 | batch_size = params["batch_size"] 356 | else: 357 | batch_size = bsz 358 | 359 | # For training, we want a lot of parallel reading and shuffling. 360 | # For eval, we want no shuffling and parallel reading doesn't matter. 361 | d = tf.data.TFRecordDataset(input_file) 362 | if is_training: 363 | d = d.repeat() 364 | d = d.shuffle(buffer_size=100) 365 | 366 | d = d.apply( 367 | contrib_data.map_and_batch( 368 | lambda record: _decode_record(record, name_to_features), 369 | batch_size=batch_size, 370 | drop_remainder=drop_remainder)) 371 | 372 | return d 373 | 374 | return input_fn 375 | 376 | 377 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 378 | """Truncates a sequence pair in place to the maximum length.""" 379 | 380 | # This is a simple heuristic which will always truncate the longer sequence 381 | # one token at a time. This makes more sense than truncating an equal percent 382 | # of tokens from each, since if one sequence is very short then each token 383 | # that's truncated likely contains more information than a longer sequence. 384 | while True: 385 | total_length = len(tokens_a) + len(tokens_b) 386 | if total_length <= max_length: 387 | break 388 | if len(tokens_a) > len(tokens_b): 389 | tokens_a.pop() 390 | else: 391 | tokens_b.pop() 392 | 393 | 394 | def _create_model_from_hub(hub_module, is_training, input_ids, input_mask, 395 | segment_ids): 396 | """Creates an ALBERT model from TF-Hub.""" 397 | tags = set() 398 | if is_training: 399 | tags.add("train") 400 | albert_module = hub.Module(hub_module, tags=tags, trainable=True) 401 | albert_inputs = dict( 402 | input_ids=input_ids, 403 | input_mask=input_mask, 404 | segment_ids=segment_ids) 405 | albert_outputs = albert_module( 406 | inputs=albert_inputs, 407 | signature="tokens", 408 | as_dict=True) 409 | output_layer = albert_outputs["pooled_output"] 410 | return output_layer 411 | 412 | 413 | def _create_model_from_scratch(albert_config, is_training, input_ids, 414 | input_mask, segment_ids, use_one_hot_embeddings): 415 | """Creates an ALBERT model from scratch (as opposed to hub).""" 416 | model = modeling.AlbertModel( 417 | config=albert_config, 418 | is_training=is_training, 419 | input_ids=input_ids, 420 | input_mask=input_mask, 421 | token_type_ids=segment_ids, 422 | use_one_hot_embeddings=use_one_hot_embeddings) 423 | output_layer = model.get_pooled_output() 424 | return output_layer 425 | 426 | 427 | def create_model(albert_config, is_training, input_ids, input_mask, segment_ids, 428 | labels, num_labels, use_one_hot_embeddings, task_name, 429 | hub_module): 430 | """Creates a classification model.""" 431 | if hub_module: 432 | tf.logging.info("creating model from hub_module: %s", hub_module) 433 | output_layer = _create_model_from_hub(hub_module, is_training, input_ids, 434 | input_mask, segment_ids) 435 | else: 436 | tf.logging.info("creating model from albert_config") 437 | output_layer = _create_model_from_scratch(albert_config, is_training, 438 | input_ids, input_mask, 439 | segment_ids, 440 | use_one_hot_embeddings) 441 | 442 | hidden_size = output_layer.shape[-1].value 443 | 444 | output_weights = tf.get_variable( 445 | "output_weights", [num_labels, hidden_size], 446 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 447 | 448 | output_bias = tf.get_variable( 449 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 450 | 451 | with tf.variable_scope("loss"): 452 | if is_training: 453 | # I.e., 0.1 dropout 454 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 455 | 456 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 457 | logits = tf.nn.bias_add(logits, output_bias) 458 | if task_name != "sts-b": 459 | probabilities = tf.nn.softmax(logits, axis=-1) 460 | predictions = tf.argmax(probabilities, axis=-1, output_type=tf.int32) 461 | log_probs = tf.nn.log_softmax(logits, axis=-1) 462 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 463 | 464 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 465 | else: 466 | probabilities = logits 467 | logits = tf.squeeze(logits, [-1]) 468 | predictions = logits 469 | per_example_loss = tf.square(logits - labels) 470 | loss = tf.reduce_mean(per_example_loss) 471 | 472 | return (loss, per_example_loss, probabilities, logits, predictions) 473 | 474 | 475 | def model_fn_builder(albert_config, num_labels, init_checkpoint, learning_rate, 476 | num_train_steps, num_warmup_steps, use_tpu, 477 | use_one_hot_embeddings, task_name, hub_module=None, 478 | optimizer="adamw"): 479 | """Returns `model_fn` closure for TPUEstimator.""" 480 | 481 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 482 | """The `model_fn` for TPUEstimator.""" 483 | 484 | tf.logging.info("*** Features ***") 485 | for name in sorted(features.keys()): 486 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 487 | 488 | input_ids = features["input_ids"] 489 | input_mask = features["input_mask"] 490 | segment_ids = features["segment_ids"] 491 | label_ids = features["label_ids"] 492 | is_real_example = None 493 | if "is_real_example" in features: 494 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 495 | else: 496 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 497 | 498 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 499 | 500 | (total_loss, per_example_loss, probabilities, logits, predictions) = \ 501 | create_model(albert_config, is_training, input_ids, input_mask, 502 | segment_ids, label_ids, num_labels, 503 | use_one_hot_embeddings, task_name, hub_module) 504 | 505 | tvars = tf.trainable_variables() 506 | initialized_variable_names = {} 507 | scaffold_fn = None 508 | if init_checkpoint: 509 | (assignment_map, initialized_variable_names 510 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 511 | if use_tpu: 512 | 513 | def tpu_scaffold(): 514 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 515 | return tf.train.Scaffold() 516 | 517 | scaffold_fn = tpu_scaffold 518 | else: 519 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 520 | 521 | tf.logging.info("**** Trainable Variables ****") 522 | for var in tvars: 523 | init_string = "" 524 | if var.name in initialized_variable_names: 525 | init_string = ", *INIT_FROM_CKPT*" 526 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 527 | init_string) 528 | 529 | output_spec = None 530 | if mode == tf.estimator.ModeKeys.TRAIN: 531 | 532 | train_op = optimization.create_optimizer( 533 | total_loss, learning_rate, num_train_steps, num_warmup_steps, 534 | use_tpu, optimizer) 535 | 536 | output_spec = contrib_tpu.TPUEstimatorSpec( 537 | mode=mode, 538 | loss=total_loss, 539 | train_op=train_op, 540 | scaffold_fn=scaffold_fn) 541 | elif mode == tf.estimator.ModeKeys.EVAL: 542 | if task_name not in ["sts-b", "cola"]: 543 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 544 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 545 | accuracy = tf.metrics.accuracy( 546 | labels=label_ids, predictions=predictions, 547 | weights=is_real_example) 548 | loss = tf.metrics.mean( 549 | values=per_example_loss, weights=is_real_example) 550 | return { 551 | "eval_accuracy": accuracy, 552 | "eval_loss": loss, 553 | } 554 | elif task_name == "sts-b": 555 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 556 | """Compute Pearson correlations for STS-B.""" 557 | # Display labels and predictions 558 | concat1 = contrib_metrics.streaming_concat(logits) 559 | concat2 = contrib_metrics.streaming_concat(label_ids) 560 | 561 | # Compute Pearson correlation 562 | pearson = contrib_metrics.streaming_pearson_correlation( 563 | logits, label_ids, weights=is_real_example) 564 | 565 | # Compute MSE 566 | # mse = tf.metrics.mean(per_example_loss) 567 | mse = tf.metrics.mean_squared_error( 568 | label_ids, logits, weights=is_real_example) 569 | 570 | loss = tf.metrics.mean( 571 | values=per_example_loss, 572 | weights=is_real_example) 573 | 574 | return {"pred": concat1, "label_ids": concat2, "pearson": pearson, 575 | "MSE": mse, "eval_loss": loss,} 576 | elif task_name == "cola": 577 | def metric_fn(per_example_loss, label_ids, logits, is_real_example): 578 | """Compute Matthew's correlations for STS-B.""" 579 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 580 | # https://en.wikipedia.org/wiki/Matthews_correlation_coefficient 581 | tp, tp_op = tf.metrics.true_positives( 582 | predictions, label_ids, weights=is_real_example) 583 | tn, tn_op = tf.metrics.true_negatives( 584 | predictions, label_ids, weights=is_real_example) 585 | fp, fp_op = tf.metrics.false_positives( 586 | predictions, label_ids, weights=is_real_example) 587 | fn, fn_op = tf.metrics.false_negatives( 588 | predictions, label_ids, weights=is_real_example) 589 | 590 | # Compute Matthew's correlation 591 | mcc = tf.div_no_nan( 592 | tp * tn - fp * fn, 593 | tf.pow((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn), 0.5)) 594 | 595 | # Compute accuracy 596 | accuracy = tf.metrics.accuracy( 597 | labels=label_ids, predictions=predictions, 598 | weights=is_real_example) 599 | 600 | loss = tf.metrics.mean( 601 | values=per_example_loss, 602 | weights=is_real_example) 603 | 604 | return {"matthew_corr": (mcc, tf.group(tp_op, tn_op, fp_op, fn_op)), 605 | "eval_accuracy": accuracy, "eval_loss": loss,} 606 | 607 | eval_metrics = (metric_fn, 608 | [per_example_loss, label_ids, logits, is_real_example]) 609 | output_spec = contrib_tpu.TPUEstimatorSpec( 610 | mode=mode, 611 | loss=total_loss, 612 | eval_metrics=eval_metrics, 613 | scaffold_fn=scaffold_fn) 614 | else: 615 | output_spec = contrib_tpu.TPUEstimatorSpec( 616 | mode=mode, 617 | predictions={ 618 | "probabilities": probabilities, 619 | "predictions": predictions 620 | }, 621 | scaffold_fn=scaffold_fn) 622 | return output_spec 623 | 624 | return model_fn 625 | 626 | 627 | # This function is not used by this file but is still used by the Colab and 628 | # people who depend on it. 629 | def input_fn_builder(features, seq_length, is_training, drop_remainder): 630 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 631 | 632 | all_input_ids = [] 633 | all_input_mask = [] 634 | all_segment_ids = [] 635 | all_label_ids = [] 636 | 637 | for feature in features: 638 | all_input_ids.append(feature.input_ids) 639 | all_input_mask.append(feature.input_mask) 640 | all_segment_ids.append(feature.segment_ids) 641 | all_label_ids.append(feature.label_id) 642 | 643 | def input_fn(params): 644 | """The actual input function.""" 645 | batch_size = params["batch_size"] 646 | 647 | num_examples = len(features) 648 | 649 | # This is for demo purposes and does NOT scale to large data sets. We do 650 | # not use Dataset.from_generator() because that uses tf.py_func which is 651 | # not TPU compatible. The right way to load data is with TFRecordReader. 652 | d = tf.data.Dataset.from_tensor_slices({ 653 | "input_ids": 654 | tf.constant( 655 | all_input_ids, shape=[num_examples, seq_length], 656 | dtype=tf.int32), 657 | "input_mask": 658 | tf.constant( 659 | all_input_mask, 660 | shape=[num_examples, seq_length], 661 | dtype=tf.int32), 662 | "segment_ids": 663 | tf.constant( 664 | all_segment_ids, 665 | shape=[num_examples, seq_length], 666 | dtype=tf.int32), 667 | "label_ids": 668 | tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32), 669 | }) 670 | 671 | if is_training: 672 | d = d.repeat() 673 | d = d.shuffle(buffer_size=100) 674 | 675 | d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder) 676 | return d 677 | 678 | return input_fn 679 | 680 | 681 | # This function is not used by this file but is still used by the Colab and 682 | # people who depend on it. 683 | def convert_examples_to_features(examples, label_list, max_seq_length, 684 | tokenizer, task_name): 685 | """Convert a set of `InputExample`s to a list of `InputFeatures`.""" 686 | 687 | features = [] 688 | print('Length of examples:',len(examples)) 689 | for (ex_index, example) in enumerate(examples): 690 | if ex_index % 10000 == 0: 691 | #tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 692 | print("Writing example %d of %d" % (ex_index, len(examples))) 693 | feature = convert_single_example(ex_index, example, label_list, 694 | max_seq_length, tokenizer, task_name) 695 | 696 | features.append(feature) 697 | return features 698 | 699 | 700 | 701 | 702 | # Load parameters 703 | max_seq_length = hp.sequence_length 704 | do_lower_case = hp.do_lower_case 705 | vocab_file = hp.vocab_file 706 | tokenizer = tokenization.FullTokenizer.from_scratch(vocab_file=vocab_file, 707 | do_lower_case=do_lower_case, 708 | spm_model_file=None) 709 | processor = ClassifyProcessor() 710 | label_list = processor.get_labels() 711 | data_dir = hp.data_dir 712 | 713 | 714 | def get_features(): 715 | # Load train data 716 | train_examples = processor.get_train_examples(data_dir) 717 | # Get onehot feature 718 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify') 719 | input_ids = [f.input_ids for f in features] 720 | input_masks = [f.input_mask for f in features] 721 | segment_ids = [f.segment_ids for f in features] 722 | label_ids = [f.label_id for f in features] 723 | print('Get features finished!') 724 | return input_ids,input_masks,segment_ids,label_ids 725 | 726 | 727 | def get_features_test(): 728 | # Load test data 729 | train_examples = processor.get_test_examples(data_dir) 730 | # Get onehot feature 731 | features = convert_examples_to_features( train_examples, label_list, max_seq_length, tokenizer,task_name='classify_test') 732 | input_ids = [f.input_ids for f in features] 733 | input_masks = [f.input_mask for f in features] 734 | segment_ids = [f.segment_ids for f in features] 735 | label_ids = [f.label_id for f in features] 736 | print('Get features(test) finished!') 737 | return input_ids,input_masks,segment_ids,label_ids 738 | 739 | 740 | def create_example(line,set_type): 741 | """Creates examples for the training and dev sets.""" 742 | guid = "%s-%s" % (set_type, 1) 743 | text_a = tokenization.convert_to_unicode(line[1]) 744 | label = tokenization.convert_to_unicode(line[0]) 745 | example = InputExample(guid=guid, text_a=text_a, text_b=None, label=label) 746 | return example 747 | 748 | 749 | def get_feature_test(sentence): 750 | example = create_example(['0',sentence],'test') 751 | feature = convert_single_example(0, example, label_list,max_seq_length, tokenizer,task_name='classify') 752 | return feature.input_ids,feature.input_mask,feature.segment_ids,feature.label_id 753 | 754 | 755 | def get_features_emoji(): 756 | # Load train data 757 | lines = read_csv(os.path.join(hp.data_dir, hp.train_data)) 758 | # Get features emoji 759 | contents = [l[1] for l in lines] 760 | contents_emoji = [get_emoji(l) for l in contents] 761 | return [emoji2id(l) for l in contents_emoji] 762 | 763 | 764 | def get_features_emoji_test(): 765 | # Load train data 766 | lines = read_csv(os.path.join(hp.data_dir, hp.test_data)) 767 | # Get features emoji 768 | contents = [l[1] for l in lines] 769 | contents_emoji = [get_emoji(l) for l in contents] 770 | return [emoji2id(l) for l in contents_emoji] 771 | 772 | 773 | def get_feature_emoji_test(sentence): 774 | # Get feature emoji 775 | content_emoji = get_emoji(sentence) 776 | return emoji2id(content_emoji) 777 | 778 | 779 | if __name__ == '__main__': 780 | ## 获取参数: Test 781 | sentence = '天天向上' 782 | feature = get_feature_test(sentence) 783 | print(feature) 784 | sentence = "⏲⌚豌豆🥳🥳🥳射手🥰🦰" 785 | feature = get_feature_emoji_test(sentence) 786 | print(feature) 787 | 788 | 789 | 790 | 791 | 792 | 793 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/data/README.md: -------------------------------------------------------------------------------- 1 | # 数据集 2 | - data/sa_train.csv 3 | - data/sa_test.csv 4 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/data/sa_test.csv: -------------------------------------------------------------------------------- 1 | content,label 2 | 新配色喜欢,1 3 | 一星也不想给,-1 4 | 我就想说上个苹果X我媳妇儿用了差不多两年,0 5 | 大部分人还是被iso的流畅度吸引,1 6 | 一共买了7台,0 7 | 全金属身、超薄机身,1 8 | 还是值得了,1 9 | 紫色真的很好看,1 10 | 下次介绍朋友买,1 11 | 四个档位的水雾都拍了一张图,0 12 | 外观我觉得还可以,1 13 | 而且快递竟然打电话说下午不送了,-1 14 | 比不多一个晚上,0 15 | 这说明原创上电量使用是吹了牛+的,-1 16 | 放客厅非常合适,1 17 | 可惜没那么多钱了,0 18 | 因为没有防抖,-1 19 | 一个办用,0 20 | 电陶炉不挑锅具真是太方便啦,1 21 | 安装师傅也很尽心尽力,1 22 | 但是就是等得太久了,-1 23 | 网络速度快了很多,1 24 | 非标快哦,1 25 | 目前体验了下,0 26 | 不带音频,0 27 | 让自己想办法,0 28 | 但是时间长了就好了,0 29 | 客服人员服务态度也很好,1 30 | 买来给老父亲的,0 31 | 简直就是卡死,-1 32 | 最烦的就是这个充电头了,-1 33 | 女生也能单手操作,1 34 | 总是乱来,-1 35 | 国际品牌,0 36 | 虽然四个滤芯儿是送的,0 37 | 商品跟原装的一样灵活,1 38 | 安装师傅态度还不错.第二天打了客服电话后,1 39 | 这次给女儿新房又买一款,1 40 | 家里基本上都是小米的产品,1 41 | 两个人用完全没问题,1 42 | 超级超级好,1 43 | 在4米范围内小声说话都能收到音,1 44 | 水质过滤后改善了好多,1 45 | 都是屏幕和外框接口有缝隙,-1 46 | 开始想买康佳用的是LG萍屏,0 47 | 说月底盘点月初发,0 48 | 70寸还带主机音响,1 49 | 主要功能用于看时间跟测各种健康指标,1 50 | 布料也比较结实,1 51 | 但是整体体验很好,1 52 | 可以洗好多碗具,1 53 | 看着就非常的爽,1 54 | 摁一下自动匹配,0 55 | 才想起来没有评论,0 56 | 手感较好,1 57 | 电量也只用了一格,1 58 | 用过两个了,0 59 | 你们自己慢慢品,0 60 | 整体的效果不错,1 61 | 就像地摊20块钱以下的东西,-1 62 | 水容器大,1 63 | 讲真的以后真的不想再买苹果了,-1 64 | 外形外观:非常的好看,1 65 | 屏幕音效:质还是不错的,1 66 | 然后才发现原来的手机是移动定制机,0 67 | 就直接放在保安室,0 68 | 外观在能接受的范围,1 69 | 电池续航能力也不错,1 70 | 噪音比以前的小太多,1 71 | 价格贵点,-1 72 | 老公也说好,1 73 | 再清澈一些就好了,0 74 | 然后快递师傅还挺好的,1 75 | 一路给我打电话送到家的,1 76 | 适合小设计师们,1 77 | 用了空气泡沫包装,0 78 | 送给对象很喜欢,1 79 | 这款Lenovo,0 80 | 只当给自己起心理安慰吧,1 81 | 这两天热的时候还能降降温,1 82 | 我家阳台小尺寸刚刚好,1 83 | 送个老公的生日礼物,0 84 | 虽说不是JD自己的物流(日日顺快递公司),0 85 | 屏幕音效:l很好的,1 86 | 垃圾的东西,-1 87 | 唯一的缺点应该是拍照和想象中的差点,-1 88 | 商家送的备用PP棉芯很好,1 89 | 望继续做好售后服务,0 90 | 还好客服售后耐心与我沟通消除误会,1 91 | 荣耀品质做工越来越好了,1 92 | 我买的是86寸的电视,0 93 | 国产机越来越好了,1 94 | 使用起来相当流畅,1 95 | 京东买的有保障,1 96 | 智能反应超级迟钝,-1 97 | 用起来键盘还不错,1 98 | 居然还是坏的,-1 99 | 甚至可以用评价高达模型的角度来评价他,0 100 | 给远程安装,0 101 | 同时存电话号码时拼音输不全,-1 -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/data/sa_train.csv: -------------------------------------------------------------------------------- 1 | content,label 2 | 超划算的,1 3 | iPhone的性价比之王,1 4 | 一套不同国家电源转换插头,-1 5 | 由于安装师傅上门安装时,0 6 | 55英寸的这个创维,0 7 | 我更喜欢那个的感觉,1 8 | 拿到手机第二天莫名其妙保护膜就裂了,-1 9 | 全部试了一遍都不行,-1 10 | 送的装备也很齐全,1 11 | 原来买的是内存32G的我在用好用,1 12 | 因为修剪器充电口处有一凸起,-1 13 | 提升幸福感必备,1 14 | 9号送人,0 15 | 紫色骚气,1 16 | 搭配手持吸尘器,0 17 | 包装的很精细,1 18 | 买了好几个华为手机了,1 19 | 当时还一下子买了3台,0 20 | 很喜欢这个牌子的啊,1 21 | 运行速度:除了反应慢,-1 22 | 关键是买的别的加湿器都是水干自动断电,-1 23 | 我只想问你们一个电脑利润就这么高,-1 24 | 用用还可以暂时没有发现问题,1 25 | 在京东购物总是让我高兴,1 26 | 这个门卡使用非常灵敏、方便,1 27 | 包装充分完好,1 28 | 华为的路由器好,1 29 | 希望以后也可以的屏幕跟音效都不错的,1 30 | 无论是正面还是侧面,0 31 | 高大上的东西,1 32 | 我的收据没帮开过来,-1 33 | 外形外观:和x一样,0 34 | 商家能积极处理客户碰到的问题,1 35 | 确实不会买苹果的手机,-1 36 | 目前使用一切正常,1 37 | 唯一觉得不足的就是app,-1 38 | 其他特色:现在我京东的速度很快,1 39 | 直接不回,-1 40 | 两个不足:,-1 41 | 很好支持国货,1 42 | 从内到外都很棒,1 43 | 的质量很好,1 44 | 待机时间:能从早用到晚,1 45 | 待机时间:续航是个鸡肋,-1 46 | 看看没动手,0 47 | 说不能直接放在地板上,0 48 | 后续体验再来追评,0 49 | 而且我还是联通卡,0 50 | 我们生活使用日常杠杠滴,1 51 | 充电慢费电快,-1 52 | 自从知道评论之后京豆可以抵现金了,0 53 | 运行速度:效果很不错,1 54 | 中午送到马上装好,0 55 | 然后Windows10和以前的系统有点不同,0 56 | 到的也快,1 57 | 感觉没有我的5T好用,-1 58 | 相信京东没问题,1 59 | 晚上睡觉开起好睡多了,1 60 | 华为是首选,1 61 | 还好可以,1 62 | 适合单身租房的,1 63 | 但是只给5月24以后买的保价,0 64 | 昨晚定,0 65 | 电池比较耐用、不再担心一天一次冲,1 66 | 已经离不开戴森了,1 67 | 物流速度很快电脑到了没有任何瑕疵,1 68 | 感觉音质很棒,1 69 | 待机时间没使用的情况下4天左右可以的,0 70 | 拍照效果:基本不用,0 71 | 这次活动价,0 72 | 东西用着很好,1 73 | 好不好用还没试,0 74 | 后悔换苹果了,-1 75 | 看起来挺不错,1 76 | 以后还找你购买,1 77 | 总算是没有让人失望,1 78 | 待机时间:全是很久,1 79 | 和v20犹豫了一番,0 80 | 扫拖一起干,1 81 | 这次买净水器滤芯也是很坚决,0 82 | 买的也都是256G,0 83 | 反复装卡重启都不行,-1 84 | 我觉得有点厚,-1 85 | 结果京东物流速度很快,1 86 | 公司常年采购,0 87 | 到货速度稍慢些,-1 88 | 宝宝们都很喜欢,1 89 | 外观看上去赞,1 90 | 这个电水壶确实好用,1 91 | 物流又给力,1 92 | 在门上有档次解锁识别速度,1 93 | 商家非常诚信,1 94 | 半个月过去了,0 95 | 容量大小:3口之家正合适,1 96 | 而且京东快递小姐姐服务还好,1 97 | 平时折叠放着特别的小,0 98 | 屏幕音效:屏幕a屏,0 99 | 由于是双十一买的,0 100 | 也优惠不了几块钱,-1 101 | 打400电话也不给安装,-1 102 | 电脑性价比可以,1 103 | 送货的师傅服务超好,1 104 | 拍照效果:拍照效果就是下面那个特别真实清晰,1 105 | 这次装修先预埋,0 106 | 这款颜色漂亮,1 107 | 我严重怀疑他们发的是翻新机,-1 108 | 一个人用大小刚好,1 109 | 二档稍微有点声音,0 110 | 比如刷,0 111 | 手机一直忘了来评价,0 112 | 好看的不得了,1 113 | 而且贵的壶和便宜的区别不大,-1 114 | 用了很长时间才来评论的,0 115 | 但是你要是戴上眼镜就不能识别了,-1 116 | 尺度也够用,1 117 | 好在很清晰,1 118 | 节省了地方,1 119 | 妈妈换手机很方便,1 120 | 运行速度:运行速度比其它手机更优秀,1 121 | 师傅送货也快,1 122 | 就拍了,0 123 | 自己买配件升级拖地,0 124 | 另外充电器有点大不太方便携带,-1 125 | 指示灯都是准确完好,1 126 | 外形外观:高端大气配置较高,1 127 | 非常失望的一次购物,-1 128 | 东西真心一般,0 129 | 还赠送了手机卡,1 130 | 忘记拍照就不上图了,0 131 | 事很多,0 132 | 比L580画面清晰,1 133 | 音质很好画质也很清晰,1 134 | 极具个性,1 135 | 作为入伙礼物买的,0 136 | 再到指纹锁的使用,0 137 | 售前售后都好,1 138 | 热情服务完,1 139 | 产品吸尘很好,1 140 | 这个价格买的很值得,1 141 | 这次买的电器很满意,1 142 | 和我同事华为的比快多了,1 143 | 以后买手机还来这买,1 144 | 中度使用也能一天,1 145 | 外形外观:手机,0 146 | 最最重要的是拿到手已经贴了原装手机屏幕膜,1 147 | 金属后盖,0 148 | 855puls稳得一批,1 149 | 办公用挺好,1 150 | 买来运动计步用的,0 151 | 扫起来很干净,1 152 | 帮朋友公司购买,0 153 | 却是没有水雾,-1 154 | 这款ⅵvo手机配骁龙855芯片,0 155 | 这个试了,0 156 | 主要客服也非常好哦,1 157 | 没想到千元机也这么强大,1 158 | 老婆很喜欢大小正合适,1 159 | 肖正友,0 160 | 发票到了,0 161 | 商家才有这样的底气,0 162 | 运行速度:麒麟980处理器运行速度很快,1 163 | 拍照质量真的超赞,1 164 | 噪音大小:还好,1 165 | 听着没啥感觉唉,0 166 | 来了就直接点亮了,0 167 | 相机算是对得起价格了,1 168 | 大小正好合适,1 169 | 沾了点水,0 170 | 就是插座这总插拔会不会坏,0 171 | 1.照相,0 172 | 卖家检测没问题,1 173 | 还一百多,0 174 | 诱光灯的效果相当不错,1 175 | 开关门声音太大了,-1 176 | 需要找人安装,0 177 | 颜值俱佳,1 178 | 装的时候就开了两三个小时还凉的挺快,1 179 | 所以硬盘都得多分点,0 180 | 值得回购哟,1 181 | 智能锁之初体验,0 182 | w这个配件还可以,1 183 | 然后还是拆开看了下,0 184 | 今年夏天特别的热,0 185 | 希望质量一如既往的好,1 186 | 购买之前对比了很多款式和牌子的电视,0 187 | 小孩子爱看动画片,1 188 | 其他的话有待日后确认,0 189 | 送给长辈用的,0 190 | 不过找遍了京东,0 191 | 好的卖家卖的宝贝,1 192 | 准备洗了,0 193 | 电视50寸以为没多大,0 194 | 600多的加湿器,0 195 | 搞得我水漏了一地,-1 196 | 一会要这样一会要那样,-1 197 | 五六个了,0 198 | 不知道是几手的,0 199 | 至少做工很不错,1 200 | 得亏有它,1 201 | 和电影里一样,0 202 | 新的刚到手就坏了你敢信,-1 203 | 基本每次充电不到半小时就能满,1 204 | 提示无网络,-1 205 | 东西巳收到,0 206 | 运行速度:暂时用起来很快,1 207 | 这两天主要试机,0 208 | 但是在京东买了这个给父亲用,0 209 | 但是图像处理能力很好,1 210 | 到今天一个月了,0 211 | 不要太挑剔,0 212 | 没其他毛病,1 213 | 解压压缩包就很热,-1 214 | 售后告诉我夏天要开6档,0 215 | 不过就是出雾好小,-1 216 | 最重要的是价格优惠,1 217 | 好说歹说收下了,0 218 | 充电二十多分钟就能充满,1 219 | 客服售后态度极差,-1 220 | 用了一天了没有一点水遗留在桌子上,1 221 | 没有蒸菜层的,-1 222 | 小不足,-1 223 | 耳机总是自己断开,-1 224 | 应该不影响使用,1 225 | 手机颜色跟图片有点不同,-1 226 | 自己找,0 227 | 值得购买大品牌,1 228 | 如果性价比可以再高一点的话就好啦,0 229 | 用十年是没问题的,1 230 | 按网上教程试了半天u盘里的都不识别,-1 231 | 不再断线了、超爽,1 232 | 估计我得抱着它睡觉,0 233 | 这次基础版入手,0 234 | 送的屏幕膜,1 235 | 在反复开关机之后才激活成功,-1 236 | 因为比实体店会便宜实惠好多,1 237 | 还没用就先恢复出厂设置了,-1 238 | 体验下,0 239 | 红薯粉色小鳄鱼竿支架子鼓浪屿啊啊啊啊五环之歌名,0 240 | 但是客服连最基本的东西都不懂,-1 241 | 雾气刚开始很小,-1 242 | 其他特色:就是原装充电器充电很慢,-1 243 | 华为10plus,0 244 | 今天安上了,0 245 | 下次再试试看,0 246 | 加水量并不多,1 247 | 说是可以无理由退换,1 248 | 总结:不推荐此型号,-1 249 | 不知道怎么样用段时间在看看,0 250 | 未使用的情况下,0 251 | 还好是练体育的,0 252 | 但是运行速度还行,1 253 | 问了又说去那边问,-1 254 | 也买了一台,0 255 | 收到宝贝特意用了一段时间过来评价,0 256 | 安装的师傅也很尽心,1 257 | 想不想买也就自己看的办吧,0 258 | 安装起来也很简单,1 259 | 还送整机10年保修,1 260 | 其他特色:拍照效果好,1 261 | 放两条就洗不干净了,-1 262 | 也不厚,0 263 | 对用户使用很友好,1 264 | 物流服务都超好,1 265 | 就是最简单的一种,1 266 | 比想象中的要小的多,-1 267 | 没想到这么大声音,-1 268 | 电视语音识别度很高在厨房都可以听清楚你要说的话,1 269 | 但是还是稍微重了一些,-1 270 | 和商家描述得一致,1 271 | 手机外观轻薄款音效非常好,1 272 | 厂家直接给我换一个,1 273 | 放心了不少,1 274 | 这个还没换呢,0 275 | 装上后那水立刻变得清澈透明,1 276 | 质量问题,-1 277 | 可以无线打印,1 278 | 比较热情,1 279 | 这种电器之类的,0 280 | 安装师傅的服务态度非常好,1 281 | 质量也满意,1 282 | 按照他们的方法,0 283 | 今天最后一局还剩百分之11的电我又开了一局,0 284 | 很好正品,1 285 | 亲测待机量30+,0 286 | 但是想起这个屏幕,0 287 | 沉浸式音乐感受&rdquo,1 288 | 京东大家电价格便宜,1 289 | 然后色彩还原度也比较好,1 290 | 1-2秒就能识别,1 291 | 明明是想买路由器的,0 292 | 能带大耳机,0 293 | 商品没啥,0 294 | 出差、旅行携带方便,1 295 | 声明快递员很给力,1 296 | 现在的快递都是下楼自己拿的,0 297 | 还没有我九块九三张的好,-1 298 | 特意把儿子煲好的IE60拿了试了下,0 299 | 快递很差,-1 300 | 为健康观影保驾护航,0 301 | 外形外观:白色的加点彩虹一样的颜色,0 302 | 非常美😊,1 303 | 非常耐心❤️",1 304 | 非常耐斯👌,1 305 | 非常讨人喜欢💕,1 306 | 非常让人喜欢😊,1 307 | 非常贴近我的状👌态,1 308 | 非常赞👍,1 309 | 非常赞👍家里之前的冰箱有点小了,1 310 | 非常迷你😄,1 311 | 非常适合女生👍👍,1 312 | 非常适合我这种臭美的女生😏,1 313 | 非常适合看😜😜😜,1 314 | 非常鄙视👎👎👎👎,-1 315 | 非常长不错😊",1 316 | #😠骗人的吗我投诉,-1 317 | #🤢🤢🤢,-1 318 | (泽米张帆)❤",1 319 | (给租户买的😂)颜值还挺高的,1 320 | )-♡😊喜欢的程度没法用言语表达,1 321 | )_💑啊对充电速度真的很快,1 322 | *****超棒哈哈哈😁,1 323 | **商家、恶心🤢,-1 324 | **让我彻底伤心💔,-1 325 | 日了🐶,-1 326 | 日了🐶了,-1 327 | 2.自己的苏宁物流慢的像一坨💩(第一天早上八点下单,-1 -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/hyperparameters.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | import sys 11 | pwd = os.path.dirname(os.path.abspath(__file__)) 12 | sys.path.append(pwd) 13 | 14 | from sentiment_analysis_albert_emoji.utils import load_vocabulary 15 | 16 | 17 | class Hyperparamters: 18 | # Train parameters 19 | num_train_epochs = 5 20 | print_step = 10 21 | batch_size = 64 22 | batch_size_eval = 128 23 | summary_step = 10 24 | num_saved_per_epoch = 3 25 | max_to_keep = 100 26 | 27 | # File model 28 | logdir = 'logdir/model_01' 29 | file_save_model = 'model/model_save' 30 | file_load_model = 'model/model_load' 31 | 32 | # Train data and test data 33 | train_data = "sa_train.csv" 34 | test_data = "sa_test.csv" 35 | 36 | # Optimization parameters 37 | warmup_proportion = 0.1 38 | use_tpu = None 39 | do_lower_case = True 40 | learning_rate = 5e-5 41 | 42 | # TextCNN parameters 43 | num_filters = 128 44 | filter_sizes = [2,3,4,5,6,7] 45 | embedding_size = 768 46 | keep_prob = 0.5 47 | 48 | # Sequence and Label 49 | sequence_length = 128 50 | num_labels = 3 51 | dict_label = { 52 | '0': '-1', 53 | '1': '0', 54 | '2': '1'} 55 | 56 | # ALBERT parameters 57 | name = 'albert_base_zh' 58 | bert_path = os.path.join(pwd,name) 59 | data_dir = os.path.join(pwd,'data') 60 | vocab_file = os.path.join(pwd,name,'vocab_chinese.txt') 61 | init_checkpoint = os.path.join(pwd,name,'albert_model.ckpt') 62 | saved_model_path = os.path.join(pwd,'model') 63 | 64 | # Emoji 65 | sequence_length_emoji = 16 66 | file_vocab_emoji = os.path.join(pwd, name, 'vocab_emoji.txt') 67 | vocab_emoji_id2char,vocab_emoji_char2id = load_vocabulary(file_vocab_emoji) 68 | vocab_size_emoji = len(vocab_emoji_char2id) 69 | 70 | 71 | if __name__ == '__main__': 72 | hp = Hyperparamters() 73 | print(hp.batch_size) 74 | 75 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/lamb_optimizer.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | from __future__ import absolute_import 11 | from __future__ import division 12 | from __future__ import print_function 13 | 14 | import re 15 | import six 16 | import tensorflow.compat.v1 as tf 17 | from tensorflow.python.ops import array_ops 18 | from tensorflow.python.ops import linalg_ops 19 | from tensorflow.python.ops import math_ops 20 | 21 | 22 | 23 | class LAMBOptimizer(tf.train.Optimizer): 24 | """LAMB (Layer-wise Adaptive Moments optimizer for Batch training).""" 25 | # A new optimizer that includes correct L2 weight decay, adaptive 26 | # element-wise updating, and layer-wise justification. The LAMB optimizer 27 | # was proposed by Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, 28 | # James Demmel, and Cho-Jui Hsieh in a paper titled as Reducing BERT 29 | # Pre-Training Time from 3 Days to 76 Minutes (arxiv.org/abs/1904.00962) 30 | 31 | def __init__(self, 32 | learning_rate, 33 | weight_decay_rate=0.0, 34 | beta_1=0.9, 35 | beta_2=0.999, 36 | epsilon=1e-6, 37 | exclude_from_weight_decay=None, 38 | exclude_from_layer_adaptation=None, 39 | name="LAMBOptimizer"): 40 | """Constructs a LAMBOptimizer.""" 41 | super(LAMBOptimizer, self).__init__(False, name) 42 | 43 | self.learning_rate = learning_rate 44 | self.weight_decay_rate = weight_decay_rate 45 | self.beta_1 = beta_1 46 | self.beta_2 = beta_2 47 | self.epsilon = epsilon 48 | self.exclude_from_weight_decay = exclude_from_weight_decay 49 | # exclude_from_layer_adaptation is set to exclude_from_weight_decay if the 50 | # arg is None. 51 | # TODO(jingli): validate if exclude_from_layer_adaptation is necessary. 52 | if exclude_from_layer_adaptation: 53 | self.exclude_from_layer_adaptation = exclude_from_layer_adaptation 54 | else: 55 | self.exclude_from_layer_adaptation = exclude_from_weight_decay 56 | 57 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 58 | """See base class.""" 59 | assignments = [] 60 | for (grad, param) in grads_and_vars: 61 | if grad is None or param is None: 62 | continue 63 | 64 | param_name = self._get_variable_name(param.name) 65 | 66 | m = tf.get_variable( 67 | name=six.ensure_str(param_name) + "/adam_m", 68 | shape=param.shape.as_list(), 69 | dtype=tf.float32, 70 | trainable=False, 71 | initializer=tf.zeros_initializer()) 72 | v = tf.get_variable( 73 | name=six.ensure_str(param_name) + "/adam_v", 74 | shape=param.shape.as_list(), 75 | dtype=tf.float32, 76 | trainable=False, 77 | initializer=tf.zeros_initializer()) 78 | 79 | # Standard Adam update. 80 | next_m = ( 81 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 82 | next_v = ( 83 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 84 | tf.square(grad))) 85 | 86 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 87 | 88 | # Just adding the square of the weights to the loss function is *not* 89 | # the correct way of using L2 regularization/weight decay with Adam, 90 | # since that will interact with the m and v parameters in strange ways. 91 | # 92 | # Instead we want ot decay the weights in a manner that doesn't interact 93 | # with the m/v parameters. This is equivalent to adding the square 94 | # of the weights to the loss with plain (non-momentum) SGD. 95 | if self._do_use_weight_decay(param_name): 96 | update += self.weight_decay_rate * param 97 | 98 | ratio = 1.0 99 | if self._do_layer_adaptation(param_name): 100 | w_norm = linalg_ops.norm(param, ord=2) 101 | g_norm = linalg_ops.norm(update, ord=2) 102 | ratio = array_ops.where(math_ops.greater(w_norm, 0), array_ops.where( 103 | math_ops.greater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0) 104 | 105 | update_with_lr = ratio * self.learning_rate * update 106 | 107 | next_param = param - update_with_lr 108 | 109 | assignments.extend( 110 | [param.assign(next_param), 111 | m.assign(next_m), 112 | v.assign(next_v)]) 113 | return tf.group(*assignments, name=name) 114 | 115 | def _do_use_weight_decay(self, param_name): 116 | """Whether to use L2 weight decay for `param_name`.""" 117 | if not self.weight_decay_rate: 118 | return False 119 | if self.exclude_from_weight_decay: 120 | for r in self.exclude_from_weight_decay: 121 | if re.search(r, param_name) is not None: 122 | return False 123 | return True 124 | 125 | def _do_layer_adaptation(self, param_name): 126 | """Whether to do layer-wise learning rate adaptation for `param_name`.""" 127 | if self.exclude_from_layer_adaptation: 128 | for r in self.exclude_from_layer_adaptation: 129 | if re.search(r, param_name) is not None: 130 | return False 131 | return True 132 | 133 | def _get_variable_name(self, param_name): 134 | """Get the variable name from the tensor name.""" 135 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) 136 | if m is not None: 137 | param_name = m.group(1) 138 | return param_name 139 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/model/model_load/README.md: -------------------------------------------------------------------------------- 1 | # 推理所用的模型 2 | 3 | ## 模型路径 4 | - model_load/checkpoit 5 | - model_load/model_xx_0.ckpt.data-00000-of-00001 6 | - model_load/model_xx_0.ckpt.index 7 | - model_load/model_xx_0.ckpt.meta 8 | 9 | ## checkpoint内容 10 | ``` 11 | model_checkpoint_path: "model_xx_0.ckpt" 12 | ``` 13 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/model/model_save/README.md: -------------------------------------------------------------------------------- 1 | # 训练过程所得模型 2 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/modules.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 21:01:45 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import tensorflow as tf 10 | from tensorflow.contrib.rnn import DropoutWrapper 11 | 12 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp 13 | 14 | 15 | 16 | def cell_embedding(inputs): 17 | with tf.device('/cpu:0'): 18 | W = tf.Variable(tf.random_uniform([hp.vocab_size_emoji, hp.embedding_size], -1.0, 1.0), name="W_embedding") 19 | embedded_input = tf.nn.embedding_lookup(W,inputs) 20 | return embedded_input 21 | 22 | 23 | def cell_textcnn_two(inputs,is_training): 24 | # Add a dimension in the end:-1 25 | inputs_expand = tf.expand_dims(inputs, -1) 26 | # Create a convolution + maxpool layer for each filter size 27 | pooled_outputs = [] 28 | with tf.name_scope("TextCNN"): 29 | for i, filter_size in enumerate(hp.filter_sizes): 30 | with tf.name_scope("conv-maxpool-%s" % filter_size): 31 | # Convolution Layer 32 | filter_shape = [filter_size, hp.embedding_size, 1, hp.num_filters] 33 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1),dtype=tf.float32, name="W") 34 | b = tf.Variable(tf.constant(0.1, shape=[hp.num_filters]),dtype=tf.float32, name="b") 35 | conv = tf.nn.conv2d( 36 | inputs_expand, 37 | W, 38 | strides=[1, 1, 1, 1], 39 | padding="VALID", 40 | name="conv") 41 | # Apply nonlinearity 42 | h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu") 43 | # Maxpooling over the outputs 44 | pooled = tf.nn.max_pool( 45 | h, 46 | ksize=[1, hp.sequence_length + hp.sequence_length_emoji - filter_size + 1, 1, 1], 47 | strides=[1, 1, 1, 1], 48 | padding='VALID', 49 | name="pool") 50 | pooled_outputs.append(pooled) 51 | # Combine all the pooled features 52 | with tf.name_scope("Concat"): 53 | num_filters_total = hp.num_filters * len(hp.filter_sizes) 54 | h_pool = tf.concat(pooled_outputs, 3) 55 | h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total]) 56 | # Dropout 57 | h_pool_flat_dropout = tf.nn.dropout(h_pool_flat, keep_prob=hp.keep_prob if is_training else 1) 58 | return h_pool_flat_dropout 59 | 60 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/networks.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 20:44:42 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | import tensorflow as tf 11 | 12 | from sentiment_analysis_albert_emoji import modeling,optimization 13 | from sentiment_analysis_albert_emoji.classifier_utils import ClassifyProcessor 14 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp 15 | from sentiment_analysis_albert_emoji.utils import time_now_string 16 | from sentiment_analysis_albert_emoji.modules import cell_textcnn_two 17 | from sentiment_analysis_albert_emoji.modules import cell_embedding 18 | 19 | 20 | num_labels = hp.num_labels 21 | processor = ClassifyProcessor() 22 | bert_config_file = os.path.join(hp.bert_path,'albert_config.json') 23 | bert_config = modeling.AlbertConfig.from_json_file(bert_config_file) 24 | 25 | 26 | def count_model_params(): 27 | """ 28 | Compte the parameters 29 | """ 30 | total_parameters = 0 31 | for variable in tf.trainable_variables(): 32 | shape = variable.get_shape() 33 | variable_parameters = 1 34 | for dim in shape: 35 | variable_parameters *= dim.value 36 | total_parameters += variable_parameters 37 | print(' + Number of params: %.2fM' % (total_parameters / 1e6)) 38 | 39 | 40 | class NetworkAlbert(object): 41 | def __init__(self,is_training): 42 | self.is_training = is_training 43 | 44 | # Chinese placeholder 45 | self.input_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_ids') 46 | self.input_masks = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='input_masks') 47 | self.segment_ids = tf.placeholder(tf.int32, shape=[None, hp.sequence_length], name='segment_ids') 48 | self.label_ids = tf.placeholder(tf.int32, shape=[None], name='label_ids') 49 | 50 | 51 | # Load BERT Pre-training LM 52 | self.model = modeling.AlbertModel( 53 | config=bert_config, 54 | is_training=self.is_training, 55 | input_ids=self.input_ids, 56 | input_mask=self.input_masks, 57 | token_type_ids=self.segment_ids, 58 | use_one_hot_embeddings=False) 59 | 60 | # Get the feature vector with size 3D:(batch_size,sequence_length,hidden_size) 61 | output_layer_init = self.model.get_sequence_output() 62 | 63 | # Emoji placeholder 64 | self.input_emojis = tf.placeholder(tf.int32, shape=[None, hp.sequence_length_emoji], name='input_id_emoji') 65 | 66 | # Cell emoji embedding 67 | self.emoji_embedding = cell_embedding(self.input_emojis) 68 | 69 | # Concat 70 | output_all = tf.concat([output_layer_init,self.emoji_embedding],1) 71 | 72 | # Cell textcnn-emoji 73 | output_layer = cell_textcnn_two(output_all,self.is_training) 74 | 75 | # Hidden size 76 | hidden_size = output_layer.shape[-1].value 77 | 78 | # Full-connection 79 | with tf.name_scope("Full-connection"): 80 | output_weights = tf.get_variable( 81 | "output_weights", [num_labels, hidden_size], 82 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 83 | output_bias = tf.get_variable( 84 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 85 | # Logit 86 | self.logits = tf.nn.bias_add(tf.matmul(output_layer, output_weights, transpose_b=True), output_bias) 87 | self.probabilities = tf.nn.softmax(self.logits, axis=-1) 88 | 89 | # Prediction 90 | with tf.variable_scope("Prediction"): 91 | self.preds = tf.argmax(self.logits, axis=-1, output_type=tf.int32) 92 | 93 | # Summary for tensorboard 94 | with tf.variable_scope("Loss"): 95 | if self.is_training: 96 | self.accuracy = tf.reduce_mean(tf.to_float(tf.equal(self.preds, self.label_ids))) 97 | tf.summary.scalar('Accuracy', self.accuracy) 98 | 99 | # Check whether has loaded model 100 | ckpt = tf.train.get_checkpoint_state(hp.saved_model_path) 101 | checkpoint_suffix = ".index" 102 | if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path + checkpoint_suffix): 103 | print('='*10,'Restoring model from checkpoint!','='*10) 104 | print("%s - Restoring model from checkpoint ~%s" % (time_now_string(), 105 | ckpt.model_checkpoint_path)) 106 | else: 107 | # Load BERT Pre-training LM 108 | print('First time load BERT model!') 109 | tvars = tf.trainable_variables() 110 | if hp.init_checkpoint: 111 | (assignment_map, initialized_variable_names) = \ 112 | modeling.get_assignment_map_from_checkpoint(tvars, 113 | hp.init_checkpoint) 114 | tf.train.init_from_checkpoint(hp.init_checkpoint, assignment_map) 115 | 116 | # Optimization 117 | if self.is_training: 118 | # Global_step 119 | self.global_step = tf.Variable(0, name='global_step', trainable=False) 120 | # Loss 121 | log_probs = tf.nn.log_softmax(self.logits, axis=-1) 122 | one_hot_labels = tf.one_hot(self.label_ids, depth=num_labels, dtype=tf.float32) 123 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 124 | self.loss = tf.reduce_mean(per_example_loss) 125 | # Optimizer 126 | train_examples = processor.get_train_examples(hp.data_dir) 127 | num_train_steps = int( 128 | len(train_examples) / hp.batch_size * hp.num_train_epochs) 129 | num_warmup_steps = int(num_train_steps * hp.warmup_proportion) 130 | self.optimizer = optimization.create_optimizer(self.loss, 131 | hp.learning_rate, 132 | num_train_steps, 133 | num_warmup_steps, 134 | hp.use_tpu, 135 | Global_step=self.global_step, 136 | ) 137 | # Summary for tensorboard 138 | tf.summary.scalar('Loss', self.loss) 139 | self.merged = tf.summary.merge_all() 140 | 141 | # Compte the parameters 142 | count_model_params() 143 | vs = tf.trainable_variables() 144 | for l in vs: 145 | print(l) 146 | 147 | 148 | if __name__ == '__main__': 149 | # Load model 150 | albert = NetworkAlbert(is_training=True) 151 | 152 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/optimization.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | """Functions and classes related to optimization (weight updates).""" 11 | 12 | from __future__ import absolute_import 13 | from __future__ import division 14 | from __future__ import print_function 15 | import re 16 | import six 17 | from six.moves import zip 18 | import tensorflow.compat.v1 as tf 19 | from tensorflow.contrib import tpu as contrib_tpu 20 | 21 | from sentiment_analysis_albert_emoji import lamb_optimizer 22 | 23 | 24 | 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu,Global_step, 26 | optimizer="adamw", poly_power=1.0, start_warmup_step=0): 27 | """Creates an optimizer training op.""" 28 | if Global_step: 29 | global_step = Global_step 30 | else: 31 | global_step = tf.train.get_or_create_global_step() 32 | 33 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 34 | 35 | # Implements linear decay of the learning rate. 36 | learning_rate = tf.train.polynomial_decay( 37 | learning_rate, 38 | global_step, 39 | num_train_steps, 40 | end_learning_rate=0.0, 41 | power=poly_power, 42 | cycle=False) 43 | 44 | # Implements linear warmup. I.e., if global_step - start_warmup_step < 45 | # num_warmup_steps, the learning rate will be 46 | # `(global_step - start_warmup_step)/num_warmup_steps * init_lr`. 47 | if num_warmup_steps: 48 | tf.logging.info("++++++ warmup starts at step " + str(start_warmup_step) 49 | + ", for " + str(num_warmup_steps) + " steps ++++++") 50 | global_steps_int = tf.cast(global_step, tf.int32) 51 | start_warm_int = tf.constant(start_warmup_step, dtype=tf.int32) 52 | global_steps_int = global_steps_int - start_warm_int 53 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 54 | 55 | global_steps_float = tf.cast(global_steps_int, tf.float32) 56 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 57 | 58 | warmup_percent_done = global_steps_float / warmup_steps_float 59 | warmup_learning_rate = init_lr * warmup_percent_done 60 | 61 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 62 | learning_rate = ( 63 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 64 | 65 | # It is OK that you use this optimizer for finetuning, since this 66 | # is how the model was trained (note that the Adam m/v variables are NOT 67 | # loaded from init_checkpoint.) 68 | # It is OK to use AdamW in the finetuning even the model is trained by LAMB. 69 | # As report in the Bert pulic github, the learning rate for SQuAD 1.1 finetune 70 | # is 3e-5, 4e-5 or 5e-5. For LAMB, the users can use 3e-4, 4e-4,or 5e-4 for a 71 | # batch size of 64 in the finetune. 72 | if optimizer == "adamw": 73 | tf.logging.info("using adamw") 74 | optimizer = AdamWeightDecayOptimizer( 75 | learning_rate=learning_rate, 76 | weight_decay_rate=0.01, 77 | beta_1=0.9, 78 | beta_2=0.999, 79 | epsilon=1e-6, 80 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 81 | elif optimizer == "lamb": 82 | tf.logging.info("using lamb") 83 | optimizer = lamb_optimizer.LAMBOptimizer( 84 | learning_rate=learning_rate, 85 | weight_decay_rate=0.01, 86 | beta_1=0.9, 87 | beta_2=0.999, 88 | epsilon=1e-6, 89 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 90 | else: 91 | raise ValueError("Not supported optimizer: ", optimizer) 92 | 93 | if use_tpu: 94 | optimizer = contrib_tpu.CrossShardOptimizer(optimizer) 95 | 96 | tvars = tf.trainable_variables() 97 | grads = tf.gradients(loss, tvars) 98 | 99 | # This is how the model was pre-trained. 100 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 101 | 102 | train_op = optimizer.apply_gradients( 103 | list(zip(grads, tvars)), global_step=global_step) 104 | 105 | # Normally the global step update is done inside of `apply_gradients`. 106 | # However, neither `AdamWeightDecayOptimizer` nor `LAMBOptimizer` do this. 107 | # But if you use a different optimizer, you should probably take this line 108 | # out. 109 | new_global_step = global_step + 1 110 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 111 | return train_op 112 | 113 | 114 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 115 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 116 | 117 | def __init__(self, 118 | learning_rate, 119 | weight_decay_rate=0.0, 120 | beta_1=0.9, 121 | beta_2=0.999, 122 | epsilon=1e-6, 123 | exclude_from_weight_decay=None, 124 | name="AdamWeightDecayOptimizer"): 125 | """Constructs a AdamWeightDecayOptimizer.""" 126 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 127 | 128 | self.learning_rate = learning_rate 129 | self.weight_decay_rate = weight_decay_rate 130 | self.beta_1 = beta_1 131 | self.beta_2 = beta_2 132 | self.epsilon = epsilon 133 | self.exclude_from_weight_decay = exclude_from_weight_decay 134 | 135 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 136 | """See base class.""" 137 | assignments = [] 138 | for (grad, param) in grads_and_vars: 139 | if grad is None or param is None: 140 | continue 141 | 142 | param_name = self._get_variable_name(param.name) 143 | 144 | m = tf.get_variable( 145 | name=six.ensure_str(param_name) + "/adam_m", 146 | shape=param.shape.as_list(), 147 | dtype=tf.float32, 148 | trainable=False, 149 | initializer=tf.zeros_initializer()) 150 | v = tf.get_variable( 151 | name=six.ensure_str(param_name) + "/adam_v", 152 | shape=param.shape.as_list(), 153 | dtype=tf.float32, 154 | trainable=False, 155 | initializer=tf.zeros_initializer()) 156 | 157 | # Standard Adam update. 158 | next_m = ( 159 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 160 | next_v = ( 161 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 162 | tf.square(grad))) 163 | 164 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 165 | 166 | # Just adding the square of the weights to the loss function is *not* 167 | # the correct way of using L2 regularization/weight decay with Adam, 168 | # since that will interact with the m and v parameters in strange ways. 169 | # 170 | # Instead we want ot decay the weights in a manner that doesn't interact 171 | # with the m/v parameters. This is equivalent to adding the square 172 | # of the weights to the loss with plain (non-momentum) SGD. 173 | if self._do_use_weight_decay(param_name): 174 | update += self.weight_decay_rate * param 175 | 176 | update_with_lr = self.learning_rate * update 177 | 178 | next_param = param - update_with_lr 179 | 180 | assignments.extend( 181 | [param.assign(next_param), 182 | m.assign(next_m), 183 | v.assign(next_v)]) 184 | return tf.group(*assignments, name=name) 185 | 186 | def _do_use_weight_decay(self, param_name): 187 | """Whether to use L2 weight decay for `param_name`.""" 188 | if not self.weight_decay_rate: 189 | return False 190 | if self.exclude_from_weight_decay: 191 | for r in self.exclude_from_weight_decay: 192 | if re.search(r, param_name) is not None: 193 | return False 194 | return True 195 | 196 | def _get_variable_name(self, param_name): 197 | """Get the variable name from the tensor name.""" 198 | m = re.match("^(.*):\\d+$", six.ensure_str(param_name)) 199 | if m is not None: 200 | param_name = m.group(1) 201 | return param_name 202 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/predict.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 17:12:37 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | import os 10 | import sys 11 | import tensorflow as tf 12 | pwd = os.path.dirname(os.path.abspath(__file__)) 13 | # os.environ["CUDA_VISIBLE_DEVICES"] = '-1' 14 | sys.path.append(os.path.dirname(os.path.dirname(__file__))) 15 | 16 | 17 | from sentiment_analysis_albert_emoji.networks import NetworkAlbert 18 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp 19 | from sentiment_analysis_albert_emoji.classifier_utils import get_feature_test 20 | from sentiment_analysis_albert_emoji.classifier_utils import get_feature_emoji_test 21 | 22 | 23 | 24 | class ModelAlbertTextCNN(object): 25 | """ 26 | Load NetworkAlbert TextCNN model 27 | """ 28 | def __init__(self,): 29 | self.albert, self.sess = self.load_model() 30 | @staticmethod 31 | def load_model(): 32 | with tf.Graph().as_default(): 33 | sess = tf.Session() 34 | with sess.as_default(): 35 | albert = NetworkAlbert(is_training=False) 36 | saver = tf.train.Saver() 37 | sess.run(tf.global_variables_initializer()) 38 | checkpoint_dir = os.path.abspath(os.path.join(pwd,hp.file_load_model))#small-google-gelu 39 | print (checkpoint_dir) 40 | ckpt = tf.train.get_checkpoint_state(checkpoint_dir) 41 | saver.restore(sess, ckpt.model_checkpoint_path) 42 | return albert,sess 43 | 44 | MODEL = ModelAlbertTextCNN() 45 | print('Load model finished!') 46 | 47 | 48 | 49 | def get_sa(sentence): 50 | """ 51 | Prediction of the sentence's sentiment. 52 | """ 53 | feature = get_feature_test(sentence) 54 | feature_emoji = get_feature_emoji_test(sentence) 55 | fd = {MODEL.albert.input_ids: [feature[0]], 56 | MODEL.albert.input_masks: [feature[1]], 57 | MODEL.albert.segment_ids:[feature[2]], 58 | MODEL.albert.input_emojis:[feature_emoji] 59 | } 60 | output = MODEL.sess.run(MODEL.albert.preds, feed_dict=fd) 61 | return output[0]-1 62 | 63 | 64 | 65 | if __name__ == '__main__': 66 | ## 67 | sentence ='拍照效果:👍' 68 | print ("情感分析结果:",get_sa(sentence)) 69 | 70 | 71 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow1.15 2 | sentencepiece 3 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/tokenization.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 12 14:23:12 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | """Tokenization classes.""" 11 | 12 | from __future__ import absolute_import 13 | from __future__ import division 14 | from __future__ import print_function 15 | 16 | import collections 17 | import re 18 | import unicodedata 19 | import six 20 | from six.moves import range 21 | import tensorflow.compat.v1 as tf 22 | import sentencepiece as spm 23 | 24 | 25 | SPIECE_UNDERLINE = u"▁".encode("utf-8") 26 | 27 | 28 | def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): 29 | """Checks whether the casing config is consistent with the checkpoint name.""" 30 | 31 | # The casing has to be passed in by the user and there is no explicit check 32 | # as to whether it matches the checkpoint. The casing information probably 33 | # should have been stored in the bert_config.json file, but it's not, so 34 | # we have to heuristically detect it to validate. 35 | 36 | if not init_checkpoint: 37 | return 38 | 39 | m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", 40 | six.ensure_str(init_checkpoint)) 41 | if m is None: 42 | return 43 | 44 | model_name = m.group(1) 45 | 46 | lower_models = [ 47 | "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", 48 | "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" 49 | ] 50 | 51 | cased_models = [ 52 | "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", 53 | "multi_cased_L-12_H-768_A-12" 54 | ] 55 | 56 | is_bad_config = False 57 | if model_name in lower_models and not do_lower_case: 58 | is_bad_config = True 59 | actual_flag = "False" 60 | case_name = "lowercased" 61 | opposite_flag = "True" 62 | 63 | if model_name in cased_models and do_lower_case: 64 | is_bad_config = True 65 | actual_flag = "True" 66 | case_name = "cased" 67 | opposite_flag = "False" 68 | 69 | if is_bad_config: 70 | raise ValueError( 71 | "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " 72 | "However, `%s` seems to be a %s model, so you " 73 | "should pass in `--do_lower_case=%s` so that the fine-tuning matches " 74 | "how the model was pre-training. If this error is wrong, please " 75 | "just comment out this check." % (actual_flag, init_checkpoint, 76 | model_name, case_name, opposite_flag)) 77 | 78 | 79 | def preprocess_text(inputs, remove_space=True, lower=False): 80 | """preprocess data by removing extra space and normalize data.""" 81 | outputs = inputs 82 | if remove_space: 83 | outputs = " ".join(inputs.strip().split()) 84 | 85 | if six.PY2 and isinstance(outputs, str): 86 | try: 87 | outputs = six.ensure_text(outputs, "utf-8") 88 | except UnicodeDecodeError: 89 | outputs = six.ensure_text(outputs, "latin-1") 90 | 91 | outputs = unicodedata.normalize("NFKD", outputs) 92 | outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) 93 | if lower: 94 | outputs = outputs.lower() 95 | 96 | return outputs 97 | 98 | 99 | def encode_pieces(sp_model, text, return_unicode=True, sample=False): 100 | """turn sentences into word pieces.""" 101 | 102 | if six.PY2 and isinstance(text, six.text_type): 103 | text = six.ensure_binary(text, "utf-8") 104 | 105 | if not sample: 106 | pieces = sp_model.EncodeAsPieces(text) 107 | else: 108 | pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1) 109 | new_pieces = [] 110 | for piece in pieces: 111 | piece = printable_text(piece) 112 | if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit(): 113 | cur_pieces = sp_model.EncodeAsPieces( 114 | six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b"")) 115 | if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE: 116 | if len(cur_pieces[0]) == 1: 117 | cur_pieces = cur_pieces[1:] 118 | else: 119 | cur_pieces[0] = cur_pieces[0][1:] 120 | cur_pieces.append(piece[-1]) 121 | new_pieces.extend(cur_pieces) 122 | else: 123 | new_pieces.append(piece) 124 | 125 | # note(zhiliny): convert back to unicode for py2 126 | if six.PY2 and return_unicode: 127 | ret_pieces = [] 128 | for piece in new_pieces: 129 | if isinstance(piece, str): 130 | piece = six.ensure_text(piece, "utf-8") 131 | ret_pieces.append(piece) 132 | new_pieces = ret_pieces 133 | 134 | return new_pieces 135 | 136 | 137 | def encode_ids(sp_model, text, sample=False): 138 | pieces = encode_pieces(sp_model, text, return_unicode=False, sample=sample) 139 | ids = [sp_model.PieceToId(piece) for piece in pieces] 140 | return ids 141 | 142 | 143 | def convert_to_unicode(text): 144 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 145 | if six.PY3: 146 | if isinstance(text, str): 147 | return text 148 | elif isinstance(text, bytes): 149 | return six.ensure_text(text, "utf-8", "ignore") 150 | else: 151 | raise ValueError("Unsupported string type: %s" % (type(text))) 152 | elif six.PY2: 153 | if isinstance(text, str): 154 | return six.ensure_text(text, "utf-8", "ignore") 155 | elif isinstance(text, six.text_type): 156 | return text 157 | else: 158 | raise ValueError("Unsupported string type: %s" % (type(text))) 159 | else: 160 | raise ValueError("Not running on Python2 or Python 3?") 161 | 162 | 163 | def printable_text(text): 164 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 165 | 166 | # These functions want `str` for both Python2 and Python3, but in one case 167 | # it's a Unicode string and in the other it's a byte string. 168 | if six.PY3: 169 | if isinstance(text, str): 170 | return text 171 | elif isinstance(text, bytes): 172 | return six.ensure_text(text, "utf-8", "ignore") 173 | else: 174 | raise ValueError("Unsupported string type: %s" % (type(text))) 175 | elif six.PY2: 176 | if isinstance(text, str): 177 | return text 178 | elif isinstance(text, six.text_type): 179 | return six.ensure_binary(text, "utf-8") 180 | else: 181 | raise ValueError("Unsupported string type: %s" % (type(text))) 182 | else: 183 | raise ValueError("Not running on Python2 or Python 3?") 184 | 185 | 186 | def load_vocab(vocab_file): 187 | """Loads a vocabulary file into a dictionary.""" 188 | vocab = collections.OrderedDict() 189 | with tf.gfile.GFile(vocab_file, "r") as reader: 190 | while True: 191 | token = convert_to_unicode(reader.readline()) 192 | if not token: 193 | break 194 | token = token.strip()#.split()[0] 195 | if token not in vocab: 196 | vocab[token] = len(vocab) 197 | return vocab 198 | 199 | 200 | def convert_by_vocab(vocab, items): 201 | """Converts a sequence of [tokens|ids] using the vocab.""" 202 | output = [] 203 | for item in items: 204 | output.append(vocab[item]) 205 | return output 206 | 207 | 208 | def convert_tokens_to_ids(vocab, tokens): 209 | return convert_by_vocab(vocab, tokens) 210 | 211 | 212 | def convert_ids_to_tokens(inv_vocab, ids): 213 | return convert_by_vocab(inv_vocab, ids) 214 | 215 | 216 | def whitespace_tokenize(text): 217 | """Runs basic whitespace cleaning and splitting on a piece of text.""" 218 | text = text.strip() 219 | if not text: 220 | return [] 221 | tokens = text.split() 222 | return tokens 223 | 224 | 225 | class FullTokenizer(object): 226 | """Runs end-to-end tokenziation.""" 227 | 228 | def __init__(self, vocab_file, do_lower_case=True, spm_model_file=None): 229 | self.vocab = None 230 | self.sp_model = None 231 | if spm_model_file: 232 | self.sp_model = spm.SentencePieceProcessor() 233 | tf.logging.info("loading sentence piece model") 234 | self.sp_model.Load(spm_model_file) 235 | # Note(mingdachen): For the purpose of consisent API, we are 236 | # generating a vocabulary for the sentence piece tokenizer. 237 | self.vocab = {self.sp_model.IdToPiece(i): i for i 238 | in range(self.sp_model.GetPieceSize())} 239 | else: 240 | self.vocab = load_vocab(vocab_file) 241 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 242 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 243 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 244 | 245 | @classmethod 246 | def from_scratch(cls, vocab_file, do_lower_case, spm_model_file): 247 | return FullTokenizer(vocab_file, do_lower_case, spm_model_file) 248 | 249 | # @classmethod 250 | # def from_hub_module(cls, hub_module, spm_model_file): 251 | # """Get the vocab file and casing info from the Hub module.""" 252 | # with tf.Graph().as_default(): 253 | # albert_module = hub.Module(hub_module) 254 | # tokenization_info = albert_module(signature="tokenization_info", 255 | # as_dict=True) 256 | # with tf.Session() as sess: 257 | # vocab_file, do_lower_case = sess.run( 258 | # [tokenization_info["vocab_file"], 259 | # tokenization_info["do_lower_case"]]) 260 | # return FullTokenizer( 261 | # vocab_file=vocab_file, do_lower_case=do_lower_case, 262 | # spm_model_file=spm_model_file) 263 | 264 | def tokenize(self, text): 265 | if self.sp_model: 266 | split_tokens = encode_pieces(self.sp_model, text, return_unicode=False) 267 | else: 268 | split_tokens = [] 269 | for token in self.basic_tokenizer.tokenize(text): 270 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 271 | split_tokens.append(sub_token) 272 | 273 | return split_tokens 274 | 275 | def convert_tokens_to_ids(self, tokens): 276 | if self.sp_model: 277 | tf.logging.info("using sentence piece tokenzier.") 278 | return [self.sp_model.PieceToId( 279 | printable_text(token)) for token in tokens] 280 | else: 281 | return convert_by_vocab(self.vocab, tokens) 282 | 283 | def convert_ids_to_tokens(self, ids): 284 | if self.sp_model: 285 | tf.logging.info("using sentence piece tokenzier.") 286 | return [self.sp_model.IdToPiece(id_) for id_ in ids] 287 | else: 288 | return convert_by_vocab(self.inv_vocab, ids) 289 | 290 | 291 | class BasicTokenizer(object): 292 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 293 | 294 | def __init__(self, do_lower_case=True): 295 | """Constructs a BasicTokenizer. 296 | 297 | Args: 298 | do_lower_case: Whether to lower case the input. 299 | """ 300 | self.do_lower_case = do_lower_case 301 | 302 | def tokenize(self, text): 303 | """Tokenizes a piece of text.""" 304 | text = convert_to_unicode(text) 305 | text = self._clean_text(text) 306 | 307 | # This was added on November 1st, 2018 for the multilingual and Chinese 308 | # models. This is also applied to the English models now, but it doesn't 309 | # matter since the English models were not trained on any Chinese data 310 | # and generally don't have any Chinese data in them (there are Chinese 311 | # characters in the vocabulary because Wikipedia does have some Chinese 312 | # words in the English Wikipedia.). 313 | text = self._tokenize_chinese_chars(text) 314 | 315 | orig_tokens = whitespace_tokenize(text) 316 | split_tokens = [] 317 | for token in orig_tokens: 318 | if self.do_lower_case: 319 | token = token.lower() 320 | token = self._run_strip_accents(token) 321 | split_tokens.extend(self._run_split_on_punc(token)) 322 | 323 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 324 | return output_tokens 325 | 326 | def _run_strip_accents(self, text): 327 | """Strips accents from a piece of text.""" 328 | text = unicodedata.normalize("NFD", text) 329 | output = [] 330 | for char in text: 331 | cat = unicodedata.category(char) 332 | if cat == "Mn": 333 | continue 334 | output.append(char) 335 | return "".join(output) 336 | 337 | def _run_split_on_punc(self, text): 338 | """Splits punctuation on a piece of text.""" 339 | chars = list(text) 340 | i = 0 341 | start_new_word = True 342 | output = [] 343 | while i < len(chars): 344 | char = chars[i] 345 | if _is_punctuation(char): 346 | output.append([char]) 347 | start_new_word = True 348 | else: 349 | if start_new_word: 350 | output.append([]) 351 | start_new_word = False 352 | output[-1].append(char) 353 | i += 1 354 | 355 | return ["".join(x) for x in output] 356 | 357 | def _tokenize_chinese_chars(self, text): 358 | """Adds whitespace around any CJK character.""" 359 | output = [] 360 | for char in text: 361 | cp = ord(char) 362 | if self._is_chinese_char(cp): 363 | output.append(" ") 364 | output.append(char) 365 | output.append(" ") 366 | else: 367 | output.append(char) 368 | return "".join(output) 369 | 370 | def _is_chinese_char(self, cp): 371 | """Checks whether CP is the codepoint of a CJK character.""" 372 | # This defines a "chinese character" as anything in the CJK Unicode block: 373 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 374 | # 375 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 376 | # despite its name. The modern Korean Hangul alphabet is a different block, 377 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 378 | # space-separated words, so they are not treated specially and handled 379 | # like the all of the other languages. 380 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 381 | (cp >= 0x3400 and cp <= 0x4DBF) or # 382 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 383 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 384 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 385 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 386 | (cp >= 0xF900 and cp <= 0xFAFF) or # 387 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 388 | return True 389 | 390 | return False 391 | 392 | def _clean_text(self, text): 393 | """Performs invalid character removal and whitespace cleanup on text.""" 394 | output = [] 395 | for char in text: 396 | cp = ord(char) 397 | if cp == 0 or cp == 0xfffd or _is_control(char): 398 | continue 399 | if _is_whitespace(char): 400 | output.append(" ") 401 | else: 402 | output.append(char) 403 | return "".join(output) 404 | 405 | 406 | class WordpieceTokenizer(object): 407 | """Runs WordPiece tokenziation.""" 408 | 409 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): 410 | self.vocab = vocab 411 | self.unk_token = unk_token 412 | self.max_input_chars_per_word = max_input_chars_per_word 413 | 414 | def tokenize(self, text): 415 | """Tokenizes a piece of text into its word pieces. 416 | 417 | This uses a greedy longest-match-first algorithm to perform tokenization 418 | using the given vocabulary. 419 | 420 | For example: 421 | input = "unaffable" 422 | output = ["un", "##aff", "##able"] 423 | 424 | Args: 425 | text: A single token or whitespace separated tokens. This should have 426 | already been passed through `BasicTokenizer. 427 | 428 | Returns: 429 | A list of wordpiece tokens. 430 | """ 431 | 432 | text = convert_to_unicode(text) 433 | 434 | output_tokens = [] 435 | for token in whitespace_tokenize(text): 436 | chars = list(token) 437 | if len(chars) > self.max_input_chars_per_word: 438 | output_tokens.append(self.unk_token) 439 | continue 440 | 441 | is_bad = False 442 | start = 0 443 | sub_tokens = [] 444 | while start < len(chars): 445 | end = len(chars) 446 | cur_substr = None 447 | while start < end: 448 | substr = "".join(chars[start:end]) 449 | if start > 0: 450 | substr = "##" + six.ensure_str(substr) 451 | if substr in self.vocab: 452 | cur_substr = substr 453 | break 454 | end -= 1 455 | if cur_substr is None: 456 | is_bad = True 457 | break 458 | sub_tokens.append(cur_substr) 459 | start = end 460 | 461 | if is_bad: 462 | output_tokens.append(self.unk_token) 463 | else: 464 | output_tokens.extend(sub_tokens) 465 | return output_tokens 466 | 467 | 468 | def _is_whitespace(char): 469 | """Checks whether `chars` is a whitespace character.""" 470 | # \t, \n, and \r are technically control characters but we treat them 471 | # as whitespace since they are generally considered as such. 472 | if char == " " or char == "\t" or char == "\n" or char == "\r": 473 | return True 474 | cat = unicodedata.category(char) 475 | if cat == "Zs": 476 | return True 477 | return False 478 | 479 | 480 | def _is_control(char): 481 | """Checks whether `chars` is a control character.""" 482 | # These are technically control characters but we count them as whitespace 483 | # characters. 484 | if char == "\t" or char == "\n" or char == "\r": 485 | return False 486 | cat = unicodedata.category(char) 487 | if cat in ("Cc", "Cf"): 488 | return True 489 | return False 490 | 491 | 492 | def _is_punctuation(char): 493 | """Checks whether `chars` is a punctuation character.""" 494 | cp = ord(char) 495 | # We treat all non-letter/number ASCII as punctuation. 496 | # Characters such as "^", "$", and "`" are not in the Unicode 497 | # Punctuation class but we treat them as punctuation anyways, for 498 | # consistency. 499 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 500 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 501 | return True 502 | cat = unicodedata.category(char) 503 | if cat.startswith("P"): 504 | return True 505 | return False 506 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 30 21:42:07 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | import os 11 | #os.environ["CUDA_VISIBLE_DEVICES"] = '-1' 12 | import numpy as np 13 | import tensorflow as tf 14 | 15 | from sentiment_analysis_albert_emoji.classifier_utils import get_features 16 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_test 17 | from sentiment_analysis_albert_emoji.networks import NetworkAlbert 18 | from sentiment_analysis_albert_emoji.hyperparameters import Hyperparamters as hp 19 | from sentiment_analysis_albert_emoji.utils import shuffle_one,select 20 | from sentiment_analysis_albert_emoji.utils import time_now_string 21 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_emoji 22 | from sentiment_analysis_albert_emoji.classifier_utils import get_features_emoji_test 23 | 24 | 25 | 26 | # Load Model 27 | pwd = os.path.dirname(os.path.abspath(__file__)) 28 | MODEL = NetworkAlbert(is_training=True) 29 | 30 | # Get data features 31 | input_ids,input_masks,segment_ids,label_ids = get_features() 32 | input_ids_test,input_masks_test,segment_ids_test,label_ids_test = get_features_test() 33 | num_train_samples = len(input_ids) 34 | arr = np.arange(num_train_samples) 35 | num_batchs = int((num_train_samples - 1)/hp.batch_size) + 1 36 | print('number of batch:',num_batchs) 37 | ids_test = np.arange(len(input_ids_test)) 38 | 39 | 40 | # Get emoji features 41 | input_emojis = get_features_emoji() 42 | input_emojis_test = get_features_emoji_test() 43 | 44 | 45 | # Set up the graph 46 | saver = tf.train.Saver(max_to_keep=hp.max_to_keep) 47 | sess = tf.Session() 48 | sess.run(tf.global_variables_initializer()) 49 | 50 | # Load model saved before 51 | MODEL_SAVE_PATH = os.path.join(pwd, hp.file_save_model) 52 | ckpt = tf.train.get_checkpoint_state(MODEL_SAVE_PATH) 53 | if ckpt and ckpt.model_checkpoint_path: 54 | saver.restore(sess, ckpt.model_checkpoint_path) 55 | print('Restored model!') 56 | 57 | 58 | with sess.as_default(): 59 | # Tensorboard writer 60 | writer = tf.summary.FileWriter(hp.logdir, sess.graph) 61 | for i in range(hp.num_train_epochs): 62 | indexs = shuffle_one(arr) 63 | for j in range(num_batchs-1): 64 | i1 = indexs[j * hp.batch_size:min((j + 1) * hp.batch_size, num_train_samples)] 65 | 66 | # Get features 67 | input_id_ = select(input_ids,i1) 68 | input_mask_ = select(input_masks,i1) 69 | segment_id_ = select(segment_ids,i1) 70 | label_id_ = select(label_ids,i1) 71 | 72 | # Get features emoji 73 | input_emoji_ = select(input_emojis,i1) 74 | 75 | # Feed dict 76 | fd = {MODEL.input_ids: input_id_, 77 | MODEL.input_masks: input_mask_, 78 | MODEL.segment_ids:segment_id_, 79 | MODEL.label_ids:label_id_, 80 | MODEL.input_emojis:input_emoji_} 81 | 82 | # Optimizer 83 | sess.run(MODEL.optimizer, feed_dict = fd) 84 | 85 | # Tensorboard 86 | if j%hp.summary_step==0: 87 | summary,glolal_step = sess.run([MODEL.merged,MODEL.global_step], feed_dict = fd) 88 | writer.add_summary(summary, glolal_step) 89 | 90 | # Save Model 91 | if j%(num_batchs//hp.num_saved_per_epoch)==0: 92 | if not os.path.exists(os.path.join(pwd, hp.file_save_model)): 93 | os.makedirs(os.path.join(pwd, hp.file_save_model)) 94 | saver.save(sess, os.path.join(pwd, hp.file_save_model, 'model_%s_%s.ckpt'%(str(i),str(j)))) 95 | 96 | # Log 97 | if j % hp.print_step == 0: 98 | loss = sess.run(MODEL.loss, feed_dict = fd) 99 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss:%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss))) 100 | # Loss of Test data 101 | indexs_test = shuffle_one(ids_test)[:hp.batch_size_eval] 102 | input_id_test = select(input_ids_test,indexs_test) 103 | input_mask_test = select(input_masks_test,indexs_test) 104 | segment_id_test = select(segment_ids_test,indexs_test) 105 | label_id_test = select(label_ids_test,indexs_test) 106 | 107 | # Get features emoji 108 | input_emoji_test = select(input_emojis_test,indexs_test) 109 | 110 | fd_test = {MODEL.input_ids:input_id_test, 111 | MODEL.input_masks:input_mask_test , 112 | MODEL.segment_ids:segment_id_test, 113 | MODEL.label_ids:label_id_test, 114 | MODEL.input_emojis:input_emoji_test} 115 | loss = sess.run(MODEL.loss, feed_dict = fd_test) 116 | print('Time:%s, Epoch:%s, Batch number:%s/%s, Loss(test):%s'%(time_now_string(),str(i),str(j),str(num_batchs),str(loss))) 117 | print('Optimization finished') 118 | 119 | 120 | 121 | 122 | 123 | 124 | -------------------------------------------------------------------------------- /sentiment_analysis_albert_emoji/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu May 29 20:40:40 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | 10 | import time 11 | import emoji 12 | import pandas as pd 13 | import numpy as np 14 | 15 | 16 | 17 | def time_now_string(): 18 | return time.strftime("%Y-%m-%d %H:%M:%S",time.localtime( time.time() )) 19 | 20 | 21 | def get_emoji(sentence): 22 | emoji_list = emoji.emoji_lis(sentence) 23 | return ''.join([l['emoji'] for l in emoji_list]) 24 | 25 | 26 | def load_csv(file,header=0,encoding="utf-8"): 27 | return pd.read_csv(file, 28 | encoding=encoding, 29 | header=header, 30 | error_bad_lines=False) 31 | 32 | 33 | def save_csv(dataframe,file,header=True,index=None,encoding="utf-8"): 34 | return dataframe.to_csv(file, 35 | mode='w+', 36 | header=header, 37 | index=index, 38 | encoding=encoding) 39 | 40 | 41 | def save_excel(dataframe,file,header=True,sheetname='Sheet1'): 42 | return dataframe.to_excel(file, 43 | header=header, 44 | sheet_name=sheetname) 45 | 46 | 47 | 48 | def load_excel(file,header=0,sheetname=None): 49 | dfs = pd.read_excel(file, 50 | header=header, 51 | sheet_name=sheetname) 52 | sheet_names = list(dfs.keys()) 53 | print('Name of first sheet:',sheet_names[0]) 54 | df = dfs[sheet_names[0]] 55 | print('Load excel data finished!') 56 | return df.fillna("") 57 | 58 | 59 | def load_txt(file): 60 | with open(file,encoding='utf-8',errors='ignore') as fp: 61 | lines = fp.readlines() 62 | lines = [l.strip() for l in lines] 63 | return lines 64 | 65 | 66 | def save_txt(file,lines): 67 | lines = [l+'\n' for l in lines] 68 | with open(file,'w+',encoding='utf-8') as fp:#a+添加 69 | fp.writelines(lines) 70 | 71 | 72 | def shuffle_two(a1,a2): 73 | """ 74 | 随机打乱a1和a2两个 75 | """ 76 | ran = np.arange(len(a1)) 77 | np.random.shuffle(ran) 78 | a1_ = [a1[l] for l in ran] 79 | a2_ = [a2[l] for l in ran] 80 | return a1_, a2_ 81 | 82 | 83 | def load_vocabulary(file_vocabulary_label): 84 | """ 85 | Load vocabulary to dict 86 | """ 87 | vocabulary = load_txt(file_vocabulary_label) 88 | dict_id2char,dict_char2id = {},{} 89 | for i,l in enumerate(vocabulary): 90 | dict_id2char[i] = str(l) 91 | dict_char2id[str(l)] = i 92 | return dict_id2char,dict_char2id 93 | 94 | 95 | def get_word_sequence(words,vocabulary,Reverse=True,k=1000): 96 | """ 97 | words: a list of word or string 98 | """ 99 | words = [l.lower() for l in words] 100 | dic = {} 101 | for word in words: 102 | if word not in dic: 103 | dic[word] = 1 104 | else: 105 | dic[word] = dic[word] + 1 106 | return sorted(dic.items(),key = lambda x:x[0],reverse = Reverse)[:k] 107 | 108 | 109 | def select(data,ids): 110 | return [data[i] for i in ids] 111 | 112 | 113 | def shuffle_one(a1): 114 | ran = np.arange(len(a1)) 115 | np.random.shuffle(ran) 116 | return np.array(a1)[ran].tolist() 117 | 118 | 119 | def cut_list(data,size): 120 | """ 121 | data: a list 122 | size: the size of cut 123 | """ 124 | return [data[i * size:min((i + 1) * size, len(data))] for i in range(int(len(data)-1)//size + 1)] 125 | 126 | 127 | def cut_list_by_size(data,lengths): 128 | """ 129 | data: a list 130 | lengths: the different sizes of cut 131 | """ 132 | list_block = [] 133 | for l in lengths: 134 | list_block.append(data[:l]) 135 | data = data[l:] 136 | return list_block 137 | 138 | 139 | 140 | if __name__ == "__main__": 141 | print(time_now_string()) 142 | 143 | # 144 | string = '☕️🥂' 145 | print(get_emoji(string)) 146 | # 147 | file_vocab_emoji ='albert_base_zh/vocab_emoji.txt' 148 | vocab = load_txt(file_vocab_emoji) 149 | print(vocab[-3:]) 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/README.md: -------------------------------------------------------------------------------- 1 | 2 | # Sentiment analysis bayes 3 | 4 | 1、vocabulary_pearson_40000.txt是通过pearson计算词汇和情感的相关性,得到的TOP40000个词汇,并做了排序。 5 | 2、训练中取了对情感影响最的前2000个词。 6 | 7 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/bayes.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Jun 7 13:40:26 2018 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | import numpy as np 10 | from sentiment_analysis_bayes.utils import load_txt, save_txt 11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp 12 | 13 | pwd = os.path.dirname(os.path.abspath(__file__)) 14 | 15 | 16 | def classify(vec2classify, p0, p1, class_): 17 | """ 18 | Classifier function of Bayes. 19 | """ 20 | p1 = sum(vec2classify * p1) + np.log(class_) 21 | p0 = sum(vec2classify * p0) + np.log(1 - class_) 22 | if p1 > p0: 23 | return 1 24 | else: 25 | return 0 26 | 27 | 28 | def load_model(): 29 | """ 30 | Load bayes parameters 31 | """ 32 | p0 = np.array([float(l) for l in load_txt(hp.file_p0)]) 33 | p1 = np.array([float(l) for l in load_txt(hp.file_p1)]) 34 | class_ = float(load_txt(hp.file_class)[0]) 35 | return p0, p1, class_ 36 | 37 | 38 | def save_model(p0, p1, class_, file_p0, file_p1, file_class): 39 | """ 40 | Save bayes parameters 41 | """ 42 | save_txt(file_p0, [str(l) for l in p0]) 43 | save_txt(file_p1, [str(l) for l in p1]) 44 | save_txt(file_class, [str(class_)]) 45 | print('Save model finished!') 46 | 47 | 48 | if __name__ == '__main__': 49 | # Predict 50 | print('我爱武汉') 51 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/test.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/test_feature.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test_feature.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/test_label.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/test_label.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/train.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/train_feature.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train_feature.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/data/train_label.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hellonlp/sentiment-analysis/5813e3b5ec08cb0da0bfaf1ba2cc4d5752b2ff83/sentiment_analysis_bayes/data/train_label.zip -------------------------------------------------------------------------------- /sentiment_analysis_bayes/dict/stopwords.txt: -------------------------------------------------------------------------------- 1 | ' 2 | ' 3 | 4 | ' 5 | : 6 | ) 7 | , 8 | . 9 | : 10 | ; 11 | ] 12 | } 13 | ¢ 14 | ' 15 | " 16 | 、 17 | 。 18 | 〉 19 | 》 20 | 」 21 | 』 22 | 】 23 | 〕 24 | 〗 25 | 〞 26 | ︰ 27 | ︱ 28 | ︳ 29 | ﹐ 30 | 、 31 | ﹒ 32 | ﹔ 33 | ﹕ 34 | ﹚ 35 | ﹜ 36 | ﹞ 37 | ) 38 | , 39 | . 40 | : 41 | ; 42 | | 43 | } 44 | ︴ 45 | ︶ 46 | ︸ 47 | ︺ 48 | ︼ 49 | ︾ 50 | ﹀ 51 | ﹂ 52 | ﹄ 53 | ﹏ 54 | 、 55 | ~ 56 | ¢ 57 | 々 58 | ‖ 59 | • 60 | · 61 | ˇ 62 | 63 | ′ 64 | ’ 65 | ” 66 | ( 67 | [ 68 | { 69 | 70 | £¥ 71 | ' 72 | " 73 | ‵ 74 | 〈 75 | 《 76 | 77 | 「 78 | 『 79 | 【 80 | 〔 81 | 〖 82 | ( 83 | [ 84 | { 85 | £ 86 | ¥ 87 | 〝 88 | ︵ 89 | ︷ 90 | ︹ 91 | ︻ 92 | ︽ 93 | ︿ 94 | ﹁ 95 | ﹃ 96 | ﹙ 97 | ﹛ 98 | ﹝ 99 | ( 100 | { 101 | “ 102 | ‘ 103 | … 104 | ' 105 | ' 106 | ' 107 | \ 108 | / 109 | / 110 | × 111 | Π 112 | Δ 113 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/hyperparameters.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jun 22 10:42:41 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | 10 | pwd = os.path.dirname(os.path.abspath(__file__)) 11 | 12 | 13 | class Hyperparamters: 14 | # Parameters 15 | feature_size = 2000 16 | 17 | # Stopwords 18 | file_stopwords = os.path.join(pwd, 'dict/stopwords.txt') 19 | file_vocabulary = os.path.join(pwd, 'dict/vocabulary_pearson_40000.txt') 20 | 21 | # Train data 22 | file_train_data = os.path.join(pwd, 'data/train.csv') 23 | file_test_data = os.path.join(pwd, 'data/test.csv') 24 | # 25 | file_train_feature = os.path.join(pwd, 'data/train_feature.txt') 26 | file_train_label = os.path.join(pwd, 'data/train_label.txt') 27 | # 28 | file_test_feature = os.path.join(pwd, 'data/test_feature.txt') 29 | file_test_label = os.path.join(pwd, 'data/test_label.txt') 30 | 31 | # model file 32 | file_p0 = os.path.join(pwd, 'model/p0.txt') 33 | file_p1 = os.path.join(pwd, 'model/p1.txt') 34 | file_class = os.path.join(pwd, 'model/class.txt') 35 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/load.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jul 21 16:12:10 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import numpy as np 9 | 10 | from sentiment_analysis_bayes.utils import load_txt, drop_stopwords 11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp 12 | 13 | # Load data 14 | vocabulary = [str(w.replace('\n', '')) for w in load_txt(hp.file_vocabulary)][:hp.feature_size] 15 | stopwords = set(load_txt(hp.file_stopwords)) 16 | 17 | 18 | def get_sentence_feature(sentence): 19 | """ 20 | Transform a sentence to one-hot vector. 21 | """ 22 | words = drop_stopwords(sentence, stopwords) 23 | return [int(words.count(w)) for w in vocabulary] 24 | 25 | 26 | def load_label(file_train_label): 27 | """ 28 | Load data label. 29 | """ 30 | return np.array([int(line) for line in load_txt(file_train_label)]) 31 | 32 | 33 | def load_feature(file_train_feature): 34 | """ 35 | Load data one-hot feature. 36 | """ 37 | return np.array([eval(line) for line in load_txt(file_train_feature)]) 38 | 39 | 40 | if __name__ == '__main__': 41 | # 42 | train_label = load_label(hp.file_train_label) 43 | print(train_label[:5]) 44 | # 45 | train_feature = load_feature(hp.file_test_feature) 46 | print(train_feature[0][:20]) 47 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/model/class.txt: -------------------------------------------------------------------------------- 1 | 0.5214 2 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/predict.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon May 13 10:46:08 2019 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | import sys 10 | 11 | pwd = os.path.dirname(os.path.abspath(__file__)) 12 | sys.path.append(pwd) 13 | from sentiment_analysis_bayes.bayes import classify 14 | from sentiment_analysis_bayes.bayes import load_model 15 | from sentiment_analysis_bayes.load import get_sentence_feature 16 | 17 | p0Vec, p1Vec, pClass1 = load_model() 18 | 19 | 20 | def sa(sentence): 21 | """ 22 | Predict a sentence's sentiment. 23 | """ 24 | vector = get_sentence_feature(sentence) 25 | point = classify(vector, p0Vec, p1Vec, pClass1) 26 | if point == 1: 27 | return 'Positif' 28 | elif point == 0: 29 | return 'Negitif' 30 | 31 | 32 | if __name__ == '__main__': 33 | # Test 34 | content = '我喜欢武汉' 35 | content = '我讨厌武汉' 36 | print(sa(content)) 37 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/prepare.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jul 21 16:38:31 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | from sentiment_analysis_bayes.utils import drop_stopwords 9 | from sentiment_analysis_bayes.utils import load_txt, load_csv, save_txt 10 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp 11 | 12 | # Load data 13 | vocabulary = [str(w.replace('\n', '')) for w in load_txt(hp.file_vocabulary)][:hp.feature_size] 14 | stopwords = set(load_txt(hp.file_stopwords)) 15 | 16 | 17 | def sentence2feature(sentence): 18 | """ 19 | Transform a sentence to a One-hot vector. 20 | """ 21 | words = drop_stopwords(sentence, stopwords) 22 | return [words.count(w) for w in vocabulary] 23 | 24 | 25 | if __name__ == '__main__': 26 | # Train data 27 | df = load_csv(hp.file_train_data) 28 | contents = df['content'].tolist() 29 | labels = df['label'].tolist() 30 | train_features = [str(sentence2feature(l)) for l in contents] 31 | save_txt('data/train_feature.txt', train_features) 32 | train_labels = [str(l) for l in labels] 33 | save_txt('data/train_label.txt', train_labels) 34 | 35 | # Test data 36 | df = load_csv(hp.file_test_data) 37 | contents = df['content'].tolist() 38 | labels = df['label'].tolist() 39 | test_features = [str(sentence2feature(l)) for l in contents] 40 | save_txt('data/test_feature.txt', test_features) 41 | test_labels = [str(l) for l in labels] 42 | save_txt('data/test_label.txt', test_labels) 43 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/train.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jul 21 15:50:55 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import numpy as np 9 | from sentiment_analysis_bayes.bayes import save_model 10 | from sentiment_analysis_bayes.load import load_feature, load_label 11 | from sentiment_analysis_bayes.hyperparameters import Hyperparamters as hp 12 | 13 | 14 | def main(x, y): 15 | """ 16 | Training 17 | """ 18 | numTrainDocs = len(x) 19 | numWords = len(x[0]) 20 | pAbusive = sum(y) / float(numTrainDocs) 21 | p0Num, p1Num = np.ones(numWords), np.ones(numWords) 22 | p0Deom, p1Deom = 2, 2 23 | for i in range(numTrainDocs): 24 | if y[i] == 1: 25 | p1Num = p1Num + x[i] 26 | p1Deom = p1Deom + sum(x[i]) 27 | else: 28 | p0Num = p0Num + x[i] 29 | p0Deom = p0Deom + sum(x[i]) 30 | if i % 100 == 0: 31 | print(i) 32 | p1Vect = p1Num / p1Deom 33 | p0Vect = p0Num / p0Deom 34 | p1VectLog = np.zeros(len(p1Vect)) 35 | for i in range(len(p1Vect)): 36 | p1VectLog[i] = np.log(p1Vect[i]) 37 | p0VectLog = np.zeros(len(p0Vect)) 38 | for i in range(len(p0Vect)): 39 | p0VectLog[i] = np.log(p0Vect[i]) 40 | return p0VectLog, p1VectLog, pAbusive 41 | 42 | 43 | if __name__ == '__main__': 44 | # Train 45 | train_data, train_label = load_feature(hp.file_train_feature), load_label(hp.file_train_label) 46 | p0, p1, class_ = main(train_data, train_label) 47 | # Save model 48 | f1 = 'model/p0.txt' 49 | f2 = 'model/p1.txt' 50 | f3 = 'model/class.txt' 51 | save_model(p0, p1, class_, f1, f2, f3) 52 | -------------------------------------------------------------------------------- /sentiment_analysis_bayes/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jul 21 15:45:17 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import time 9 | import jieba 10 | import numpy as np 11 | import pandas as pd 12 | 13 | 14 | def time_now_string(): 15 | """ 16 | Time now. 17 | """ 18 | return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())) 19 | 20 | 21 | def cut_list(data, size): 22 | """ 23 | data: a list 24 | size: the size of cut 25 | """ 26 | return [data[i * size:min((i + 1) * size, len(data))] for i in range(int(len(data) - 1) // size + 1)] 27 | 28 | 29 | def load_txt(file): 30 | """ 31 | load a txt. 32 | """ 33 | with open(file, encoding='utf-8', errors='ignore') as fp: 34 | lines = fp.readlines() 35 | lines = [l.strip() for l in lines] 36 | return lines 37 | 38 | 39 | def save_txt(file, lines): 40 | """ 41 | Save a txt. 42 | """ 43 | lines = [l + '\n' for l in lines] 44 | with open(file, 'w+', encoding='utf-8') as fp: 45 | fp.writelines(lines) 46 | 47 | 48 | def drop_stopwords(sentence, stopwords): 49 | """ 50 | Delete stopwords that we don't need. 51 | """ 52 | return [l for l in jieba.lcut(str(sentence)) if l not in stopwords] 53 | 54 | 55 | def load_csv(file, header=0, encoding="utf-8-sig"): 56 | """ 57 | Load a Data-frame from a csv. 58 | """ 59 | return pd.read_csv(file, 60 | encoding=encoding, 61 | header=header, 62 | error_bad_lines=False) 63 | 64 | 65 | def save_csv(dataframe, file, header=True, index=None, encoding="utf-8-sig"): 66 | """ 67 | Save a Data-frame by a csv. 68 | """ 69 | return dataframe.to_csv(file, 70 | mode='w+', 71 | header=header, 72 | index=index, 73 | encoding=encoding) 74 | 75 | 76 | def shuffle_two(a1, a2): 77 | """ 78 | Shuffle two lists by the same index. 79 | """ 80 | ran = np.arange(len(a1)) 81 | np.random.shuffle(ran) 82 | return [a1[l] for l in ran], [a2[l] for l in ran] 83 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/README.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 1、本项目是在python3.6基础上做的。 3 | 2、本项目为中文的文本情感分析,为多文本分类,一共3个标签:1、0、-1,分别表示正面、中面和负面的情感。 4 | 3、欢迎大家联系我 www.hellonlp.com 5 | 6 | # 使用方法 7 | 1、预测 8 | python preidict.py 9 | 10 | # 知乎代码解读 11 | https://zhuanlan.zhihu.com/p/142011031 12 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/insufficiently.txt: -------------------------------------------------------------------------------- 1 | 半点 2 | 不大 3 | 不丁点儿 4 | 不甚 5 | 不怎么 6 | 聊 7 | 没怎么 8 | 轻度 9 | 弱 10 | 丝毫 11 | 微 12 | 相对 13 | 不那么 14 | 不是那么 15 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/inverse.txt: -------------------------------------------------------------------------------- 1 | 不 2 | 不是 3 | 没 4 | 没有 5 | 无 6 | 非 7 | 莫 8 | 弗 9 | 毋 10 | 未 11 | 否 12 | 别 13 | 无 14 | 不够 15 | 不是 16 | 不曾 17 | 未必 18 | 不要 19 | 难以 20 | 未曾 -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/ish.txt: -------------------------------------------------------------------------------- 1 | 点点滴滴 2 | 多多少少 3 | 怪 4 | 好生 5 | 还 6 | 或多或少 7 | 略 8 | 略加 9 | 略略 10 | 略微 11 | 略为 12 | 蛮 13 | 稍 14 | 稍稍 15 | 稍微 16 | 稍为 17 | 稍许 18 | 挺 19 | 未免 20 | 相当 21 | 些 22 | 些微 23 | 些小 24 | 一点 25 | 一点儿 26 | 一些 27 | 有点 28 | 有点儿 29 | 有些 30 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/more.txt: -------------------------------------------------------------------------------- 1 | 大不了 2 | 多 3 | 更 4 | 更加 5 | 更进一步 6 | 更为 7 | 还 8 | 还要 9 | 较 10 | 较比 11 | 较为 12 | 进一步 13 | 那般 14 | 那么 15 | 那样 16 | 强 17 | 如斯 18 | 益 19 | 益发 20 | 尤甚 21 | 逾 22 | 愈 23 | 愈 ... 愈 24 | 愈发 25 | 愈加 26 | 愈来愈 27 | 愈益 28 | 远远 29 | 越 ... 越 30 | 越发 31 | 越加 32 | 越来越 33 | 越是 34 | 这般 35 | 这样 36 | 足 37 | 足足 38 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/most.txt: -------------------------------------------------------------------------------- 1 | 百分之百 2 | 倍加 3 | 备至 4 | 不得了 5 | 不堪 6 | 不可开交 7 | 不亦乐乎 8 | 不折不扣 9 | 彻头彻尾 10 | 充分 11 | 到头 12 | 地地道道 13 | 极 14 | 极度 15 | 极其 16 | 极为 17 | 截然 18 | 尽 19 | 惊人地 20 | 绝 21 | 绝顶 22 | 绝对 23 | 绝对化 24 | 刻骨 25 | 酷 26 | 满 27 | 满贯 28 | 满心 29 | 莫大 30 | 奇 31 | 入骨 32 | 甚为 33 | 十二分 34 | 十分 35 | 十足 36 | 滔天 37 | 透 38 | 完全 39 | 完完全全 40 | 万 41 | 万般 42 | 万分 43 | 万万 44 | 无比 45 | 无度 46 | 无可估量 47 | 无以复加 48 | 无以伦比 49 | 要命 50 | 要死 51 | 已极 52 | 已甚 53 | 异常 54 | 逾常 55 | 贼 56 | 之极 57 | 之至 58 | 至极 59 | 卓绝 60 | 最为 61 | 佼佼 62 | 最 63 | 相当 64 | 非常 65 | 超级 66 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/not.txt: -------------------------------------------------------------------------------- 1 | 不 2 | 没 3 | 无 4 | 非 5 | 莫 6 | 弗 7 | 勿 8 | 毋 9 | 未 10 | 否 11 | 别 12 | 無 13 | 休 14 | 难道 15 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/over.txt: -------------------------------------------------------------------------------- 1 | 不为过 2 | 超 3 | 超额 4 | 超外差 5 | 超微结构 6 | 超物质 7 | 出头 8 | 多 9 | 浮 10 | 过 11 | 过度 12 | 过分 13 | 过火 14 | 过劲 15 | 过了头 16 | 过猛 17 | 过热 18 | 过甚 19 | 过头 20 | 过于 21 | 过逾 22 | 何止 23 | 何啻 24 | 开外 25 | 苦 26 | 老 27 | 偏 28 | 强 29 | 溢 30 | 忒 31 | 极端 -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/ponctuation_sentiment.txt: -------------------------------------------------------------------------------- 1 | ' 2 | ' 3 | 4 | ' 5 | : 6 | ) 7 | , 8 | . 9 | : 10 | ; 11 | ] 12 | } 13 | ¢ 14 | ' 15 | " 16 | 、 17 | 。 18 | 〉 19 | 》 20 | 」 21 | 』 22 | 】 23 | 〕 24 | 〗 25 | 〞 26 | ︰ 27 | ︱ 28 | ︳ 29 | ﹐ 30 | 、 31 | ﹒ 32 | ﹔ 33 | ﹕ 34 | ﹚ 35 | ﹜ 36 | ﹞ 37 | ) 38 | , 39 | . 40 | : 41 | ; 42 | | 43 | } 44 | ︴ 45 | ︶ 46 | ︸ 47 | ︺ 48 | ︼ 49 | ︾ 50 | ﹀ 51 | ﹂ 52 | ﹄ 53 | ﹏ 54 | 、 55 | ~ 56 | ¢ 57 | 々 58 | ‖ 59 | • 60 | · 61 | ˇ 62 | 63 | ′ 64 | ’ 65 | ” 66 | ( 67 | [ 68 | { 69 | 70 | £¥ 71 | ' 72 | " 73 | ‵ 74 | 〈 75 | 《 76 | 77 | 「 78 | 『 79 | 【 80 | 〔 81 | 〖 82 | ( 83 | [ 84 | { 85 | £ 86 | ¥ 87 | 〝 88 | ︵ 89 | ︷ 90 | ︹ 91 | ︻ 92 | ︽ 93 | ︿ 94 | ﹁ 95 | ﹃ 96 | ﹙ 97 | ﹛ 98 | ﹝ 99 | ( 100 | { 101 | “ 102 | ‘ 103 | … 104 | ' 105 | ' 106 | ' 107 | \ 108 | / 109 | / 110 | × 111 | Π 112 | Δ 113 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/dict/very.txt: -------------------------------------------------------------------------------- 1 | 不过 2 | 不少 3 | 不胜 4 | 惨 5 | 沉 6 | 沉沉 7 | 出奇 8 | 大为 9 | 多 10 | 多多 11 | 多加 12 | 多么 13 | 分外 14 | 格外 15 | 够瞧的 16 | 够戗 17 | 好 18 | 好不 19 | 何等 20 | 很多 21 | 很是 22 | 很 23 | 坏 24 | 可 25 | 老 26 | 老大 27 | 良 28 | 颇 29 | 颇为 30 | 甚 31 | 实在 32 | 太甚 33 | 特 34 | 特别 35 | 尤 36 | 尤其 37 | 尤为 38 | 尤以 39 | 远 40 | 着实 41 | 曷 42 | 碜 -------------------------------------------------------------------------------- /sentiment_analysis_dict/hyperparameters.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jan 6 20:44:08 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | from sentiment_analysis_dict.utils import ToolGeneral 10 | 11 | 12 | pwd = os.path.dirname(os.path.abspath(__file__)) 13 | tool = ToolGeneral() 14 | 15 | 16 | class Hyperparams: 17 | '''Hyper parameters''' 18 | # Load sentiment dictionary 19 | deny_word = tool.load_dict(os.path.join(pwd,'dict','not.txt')) 20 | posdict = tool.load_dict(os.path.join(pwd,'dict','positive.txt')) 21 | negdict = tool.load_dict(os.path.join(pwd,'dict', 'negative.txt')) 22 | pos_neg_dict = posdict|negdict 23 | # Load adverb dictionary 24 | mostdict = tool.load_dict(os.path.join(pwd,'dict','most.txt')) 25 | verydict = tool.load_dict(os.path.join(pwd,'dict','very.txt')) 26 | moredict = tool.load_dict(os.path.join(pwd,'dict','more.txt')) 27 | ishdict = tool.load_dict(os.path.join(pwd,'dict','ish.txt')) 28 | insufficientlydict = tool.load_dict(os.path.join(pwd,'dict','insufficiently.txt')) 29 | overdict = tool.load_dict(os.path.join(pwd,'dict','over.txt')) 30 | inversedict = tool.load_dict(os.path.join(pwd,'dict','inverse.txt')) 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/networks.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Oct 25 11:48:24 2017 4 | 5 | @author: cm 6 | """ 7 | 8 | import os 9 | import sys 10 | import jieba 11 | import numpy as np 12 | sys.path.append(os.path.dirname(os.path.dirname(__file__))) 13 | 14 | from sentiment_analysis_dict.utils import ToolGeneral 15 | from sentiment_analysis_dict.hyperparameters import Hyperparams as hp 16 | 17 | 18 | tool = ToolGeneral() 19 | jieba.load_userdict(os.path.join(os.path.dirname(os.path.abspath(__file__)), 'dict','jieba_sentiment.txt')) 20 | 21 | 22 | class SentimentAnalysis(): 23 | """ 24 | Sentiment Analysis with some dictionarys 25 | """ 26 | def sentiment_score_list(self,dataset): 27 | seg_sentence = tool.sentence_split_regex(dataset) 28 | count1,count2 = [],[] 29 | for sentence in seg_sentence: 30 | words = jieba.lcut(sentence, cut_all=False) 31 | i = 0 32 | a = 0 33 | for word in words: 34 | """ 35 | poscount 积极词的第一次分值; 36 | poscount2 积极反转后的分值; 37 | poscount3 积极词的最后分值(包括叹号的分值) 38 | """ 39 | poscount,negcount,poscount2,negcount2,poscount3,negcount3 = 0,0,0,0,0,0 40 | if word in hp.posdict : 41 | if word in ['好','真','实在'] and words[min(i+1,len(words)-1)] in hp.pos_neg_dict and words[min(i+1,len(words)-1)] != word: 42 | continue 43 | else: 44 | poscount +=1 45 | c = 0 46 | for w in words[a:i]: # 扫描情感词前的程度词 47 | if w in hp.mostdict: 48 | poscount *= 4 49 | elif w in hp.verydict: 50 | poscount *= 3 51 | elif w in hp.moredict: 52 | poscount *= 2 53 | elif w in hp.ishdict: 54 | poscount *= 0.5 55 | elif w in hp.insufficientlydict: 56 | poscount *= -0.3 57 | elif w in hp.overdict: 58 | poscount *= -0.5 59 | elif w in hp.inversedict: 60 | c+= 1 61 | else: 62 | poscount *= 1 63 | if tool.is_odd(c) == 'odd': # 扫描情感词前的否定词数 64 | poscount *= -1.0 65 | poscount2 += poscount 66 | poscount = 0 67 | poscount3 = poscount + poscount2 + poscount3 68 | poscount2 = 0 69 | else: 70 | poscount3 = poscount + poscount2 + poscount3 71 | poscount = 0 72 | a = i+1 73 | elif word in hp.negdict: # 消极情感的分析,与上面一致 74 | if word in ['好','真','实在'] and words[min(i+1,len(words)-1)] in hp.pos_neg_dict and words[min(i+1,len(words)-1)] != word: 75 | continue 76 | else: 77 | negcount += 1 78 | d = 0 79 | for w in words[a:i]: 80 | if w in hp.mostdict: 81 | negcount *= 4 82 | elif w in hp.verydict: 83 | negcount *= 3 84 | elif w in hp.moredict: 85 | negcount *= 2 86 | elif w in hp.ishdict: 87 | negcount *= 0.5 88 | elif w in hp.insufficientlydict: 89 | negcount *= -0.3 90 | elif w in hp.overdict: 91 | negcount *= -0.5 92 | elif w in hp.inversedict: 93 | d += 1 94 | else: 95 | negcount *= 1 96 | if tool.is_odd(d) == 'odd': 97 | negcount *= -1.0 98 | negcount2 += negcount 99 | negcount = 0 100 | negcount3 = negcount + negcount2 + negcount3 101 | negcount2 = 0 102 | else: 103 | negcount3 = negcount + negcount2 + negcount3 104 | negcount = 0 105 | a = i + 1 106 | i += 1 107 | pos_count = poscount3 108 | neg_count = negcount3 109 | count1.append([pos_count,neg_count]) 110 | if words[-1] in ['!','!']:# 扫描感叹号前的情感词,发现后权值*2 111 | count1 = [[j*2 for j in c] for c in count1] 112 | 113 | for w_im in ['但是','但']: 114 | if w_im in words : # 扫描但是后面的情感词,发现后权值*5 115 | ind = words.index(w_im) 116 | count1_head = count1[:ind] 117 | count1_tail = count1[ind:] 118 | count1_tail_new = [[j*5 for j in c] for c in count1_tail] 119 | count1 = [] 120 | count1.extend(count1_head) 121 | count1.extend(count1_tail_new) 122 | break 123 | if words[-1] in ['?','?']:# 扫描是否有问好,发现后为负面 124 | count1 = [[0,2]] 125 | 126 | count2.append(count1) 127 | count1=[] 128 | return count2 129 | 130 | def sentiment_score(self,s): 131 | senti_score_list = self.sentiment_score_list(s) 132 | if senti_score_list != []: 133 | negatives=[] 134 | positives=[] 135 | for review in senti_score_list: 136 | score_array = np.array(review) 137 | AvgPos = np.sum(score_array[:,0]) 138 | AvgNeg = np.sum(score_array[:,1]) 139 | negatives.append(AvgNeg) 140 | positives.append(AvgPos) 141 | pos_score = np.mean(positives) 142 | neg_score = np.mean(negatives) 143 | if pos_score >=0 and neg_score<=0: 144 | pos_score = pos_score 145 | neg_score = abs(neg_score) 146 | elif pos_score >=0 and neg_score>=0: 147 | pos_score = pos_score 148 | neg_score = neg_score 149 | else: 150 | pos_score,neg_score=0,0 151 | return pos_score,neg_score 152 | 153 | def normalization_score(self,sent): 154 | score1,score0 = self.sentiment_score(sent) 155 | if score1 > 4 and score0 > 4: 156 | if score1 >= score0: 157 | _score1 = 1 158 | _score0 = score0/score1 159 | elif score1 < score0: 160 | _score0 = 1 161 | _score1 = score1/score0 162 | else : 163 | if score1 >= 4 : 164 | _score1 = 1 165 | elif score1 < 4 : 166 | _score1 = score1/4 167 | if score0 >= 4 : 168 | _score0 = 1 169 | elif score0 < 4 : 170 | _score0 = score0/4 171 | return _score1,_score0 172 | 173 | 174 | if __name__ =='__main__': 175 | sa = SentimentAnalysis() 176 | text = '我妈说明儿不让出去玩' 177 | print(sa.normalization_score(text)) 178 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/preidict.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Jan 7 10:28:41 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | 9 | from sentiment_analysis_dict.networks import SentimentAnalysis 10 | 11 | SA = SentimentAnalysis() 12 | 13 | 14 | def predict(sent): 15 | """ 16 | 1: positif 17 | 0: neutral 18 | -1: negatif 19 | """ 20 | score1,score0 = SA.normalization_score(sent) 21 | if score1 == score0: 22 | result = 0 23 | elif score1 > score0: 24 | result = 1 25 | elif score1 < score0: 26 | result = -1 27 | return result 28 | 29 | 30 | if __name__ =='__main__': 31 | text = '对你不满意' 32 | text = '大美女' 33 | text = '帅哥' 34 | text = '我妈说明儿不让出去玩' 35 | print(predict(text)) 36 | -------------------------------------------------------------------------------- /sentiment_analysis_dict/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jan 6 20:47:37 2020 4 | 5 | @author: cm 6 | """ 7 | 8 | import re 9 | 10 | 11 | class ToolGeneral(): 12 | """ 13 | Tool function 14 | """ 15 | def is_odd(self,num): 16 | if num % 2 == 0: 17 | return 'even' 18 | else: 19 | return 'odd' 20 | 21 | def load_dict(self,file): 22 | """ 23 | Load dictionary 24 | """ 25 | with open(file,encoding='utf-8',errors='ignore') as fp: 26 | lines = fp.readlines() 27 | lines = [l.strip() for l in lines] 28 | print("Load data from file (%s) finished !"%file) 29 | dictionary = [word.strip() for word in lines] 30 | return set(dictionary) 31 | 32 | 33 | def sentence_split_regex(self,sentence): 34 | """ 35 | Segmentation of sentence 36 | """ 37 | if sentence is not None: 38 | sentence = re.sub(r"–+|—+", "-", sentence) 39 | sub_sentence = re.split(r"[。,,!!??;;\s…~~]+|\.{2,}|…+| +|_n|_t", sentence) 40 | sub_sentence = [s for s in sub_sentence if s != ''] 41 | if sub_sentence != []: 42 | return sub_sentence 43 | else: 44 | return [sentence] 45 | return [] 46 | 47 | 48 | if __name__ == "__main__": 49 | # 50 | tool = ToolGeneral() 51 | # 52 | s = '我今天。昨天上午,还有现在' 53 | ls = tool.sentence_split_regex(s) 54 | print(ls) 55 | 56 | 57 | 58 | --------------------------------------------------------------------------------