├── README.md ├── data.tar.gz ├── image ├── 1.jpg └── 2.jpg └── source ├── main.py ├── models └── ace.py ├── train.sh └── utils ├── __init__.py ├── basic.py ├── char.txt └── data_loader.py /README.md: -------------------------------------------------------------------------------- 1 | # Aggregation Cross-Entropy for Sequence Recognition 2 | This repository contains the code for the paper **Aggregation Cross-Entropy for Sequence Recognition**. Zecheng Xie, Yaoxiong Huang, Yuanzhi Zhu, Lianwen Jin, Yuliang Liu and Lele Xie. CVPR. 2019. [\[Paper\]](https://arxiv.org/abs/1904.08364) 3 | 4 | Connectionist temporal classification (CTC) and attention mechanism are the most popular methods for sequence-learning problem. However, CTC relies on a sophisticated forward-backward algorithm for transcription, which prevents it from addressing two-dimensional (2D) prediction problem, whereas the attention mechanism leans on a complex attention module to fulfill its functionality, resulting in additional network parameters and runtime. 5 | 6 | In this paper, we propose a novel method, aggregation cross-entropy (ACE), for sequence recognition from a brand new perspective. The ACE loss function exhibits competitive performance to CTC and the attention mechanism, with much quicker implementation (as it involves only four fundamental formulas), faster inference\back-propagation (approximately *O(1)* in parallel), less storage requirement (no parameter and negligible runtime memory), and convenient employment (by replacing CTC with ACE). Furthermore, the proposed ACE loss function exhibits two noteworthy properties: (1) it can be directly applied for 2D prediction by flattening the 2D prediction into 1D prediction as the input and (2) it requires only characters and their numbers in the sequence annotation for supervision, which allows it to advance beyond sequence recognition, e.g., counting problem. 7 | 8 | ![](./image/1.jpg) 9 | Figure 1: Illustration of proposed ACE loss function. Generally, the 1D and 2D predictions are generated by integrated CNN-LSTM and FCN model, respectively. For the ACE loss function, the 2D prediction is further flattened to 1D prediction. During aggregation, the 1D predictions at all time-steps are accumulated for each class independently. After normalization, the prediction, together with the ground-truth, is utilized for loss estimation based on cross-entropy. 10 | 11 | ![](./image/2.jpg) 12 | Figure 2: Toy example to show the advantage of ACE loss function. Resnet-50 trained with ACE loss function is able to recognize shuffled characters in the images. For each sub-image, the right column shows the 2D prediction of the recognition model for the text images. It is noteworthy that they have similar character distributions in the 2D space. 13 | 14 | ## Requirements 15 | - [Python 3.6](https://www.python.org/) 16 | - [tensorflow 1.13+](https://pytorch.org/) 17 | - [OpenCV](https://opencv.org/) 18 | 19 | ## Data Preparation 20 | tar -xzvf data.tar.gz 21 | 22 | ## Training and Testing 23 | Start training: (in 'source/' folder) 24 | ```bash 25 | sh train.sh 26 | ``` 27 | - The training process should take **about 10s** for 100 iterations on a 1080Ti. 28 | 29 | ## Citation 30 | ``` 31 | @inproceedings{xie2019ace, 32 | title = {Aggregation Cross-Entropy for Sequence Recognition}, 33 | author = {Zecheng Xie, Yaoxiong Huang, Yuanzhi Zhu, Lianwen Jin, Yuliang Liu and Lele Xie}, 34 | booktitle = {CVPR}, 35 | year = {2019}, 36 | } 37 | ``` 38 | 39 | ## Attention 40 | The project is only free for academic research purposes. 41 | 42 | ## Reference 43 | https://github.com/summerlvsong/Aggregation-Cross-Entropy -------------------------------------------------------------------------------- /data.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsing-cv/Aggregation-Cross-Entropy-for-Sequence-Recognition/b3fdcdbcf02eea5b01959dffe1bc6380c7dbbff8/data.tar.gz -------------------------------------------------------------------------------- /image/1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsing-cv/Aggregation-Cross-Entropy-for-Sequence-Recognition/b3fdcdbcf02eea5b01959dffe1bc6380c7dbbff8/image/1.jpg -------------------------------------------------------------------------------- /image/2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsing-cv/Aggregation-Cross-Entropy-for-Sequence-Recognition/b3fdcdbcf02eea5b01959dffe1bc6380c7dbbff8/image/2.jpg -------------------------------------------------------------------------------- /source/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | from __future__ import print_function, division 3 | import torch 4 | import argparse 5 | import numpy as np 6 | import tensorflow as tf 7 | import tensorflow.keras.layers as KL 8 | from models.ace import ACE 9 | from utils.data_loader import ImageDataset 10 | tf.enable_eager_execution() 11 | # import torch.nn as nn 12 | # from torch import optim 13 | # import torch.nn.functional as F 14 | # from models.seq_module import ACE 15 | # from torch.autograd import Variable 16 | # from models.solver import seq_solver 17 | from utils.basic import timeSince 18 | # from torch.utils.data import DataLoader 19 | # from utils.data_loader import ImageDataset 20 | 21 | parser = argparse.ArgumentParser() 22 | parser.add_argument('--model_path', type=str, default='../log/snapshot/model-{:0>2d}.pkl') 23 | parser.add_argument('--total_epoch', type=int, default=50, help='total epoch number') 24 | parser.add_argument('--train_path', type=str, default='../data/train.txt') 25 | parser.add_argument('--test_path', type=str, default='../data/test.txt') 26 | parser.add_argument('--train_batch_size', type=int, default=50, help='training batch size') 27 | parser.add_argument('--test_batch_size', type=int, default=50, help='testing batch size') 28 | parser.add_argument('--last_epoch', type=int, default=0, help='last epoch') 29 | parser.add_argument('--class_num', type=int, default=26, help='class number') 30 | parser.add_argument('--dict', type=str, default='_abcdefghijklmnopqrstuvwxyz') 31 | opt = parser.parse_args() 32 | 33 | class _Bottleneck(tf.keras.Model): 34 | def __init__(self, filters, block, 35 | downsampling=False, stride=1, **kwargs): 36 | super(_Bottleneck, self).__init__(**kwargs) 37 | 38 | filters1, filters2, filters3 = filters 39 | conv_name_base = 'res' + block + '_branch' 40 | bn_name_base = 'bn' + block + '_branch' 41 | 42 | self.downsampling = downsampling 43 | self.stride = stride 44 | self.out_channel = filters3 45 | 46 | self.conv2a = KL.Conv2D(filters1, (1, 1), strides=(stride, stride), 47 | kernel_initializer='he_normal', 48 | name=conv_name_base + '2a') 49 | self.bn2a = KL.BatchNormalization(name=bn_name_base + '2a') 50 | 51 | self.conv2b = KL.Conv2D(filters2, (3, 3), padding='same', 52 | kernel_initializer='he_normal', 53 | name=conv_name_base + '2b') 54 | self.bn2b = KL.BatchNormalization(name=bn_name_base + '2b') 55 | 56 | self.conv2c = KL.Conv2D(filters3, (1, 1), 57 | kernel_initializer='he_normal', 58 | name=conv_name_base + '2c') 59 | self.bn2c = KL.BatchNormalization(name=bn_name_base + '2c') 60 | 61 | if self.downsampling: 62 | self.conv_shortcut = KL.Conv2D(filters3, (1, 1), strides=(stride, stride), 63 | kernel_initializer='he_normal', 64 | name=conv_name_base + '1') 65 | self.bn_shortcut = KL.BatchNormalization(name=bn_name_base + '1') 66 | 67 | def __call__(self, inputs, training=False): 68 | x = self.conv2a(inputs) 69 | x = self.bn2a(x, training=training) 70 | x = tf.nn.relu(x) 71 | 72 | x = self.conv2b(x) 73 | x = self.bn2b(x, training=training) 74 | x = tf.nn.relu(x) 75 | 76 | x = self.conv2c(x) 77 | x = self.bn2c(x, training=training) 78 | 79 | if self.downsampling: 80 | shortcut = self.conv_shortcut(inputs) 81 | shortcut = self.bn_shortcut(shortcut, training=training) 82 | else: 83 | shortcut = inputs 84 | 85 | x += shortcut 86 | x = tf.nn.relu(x) 87 | 88 | return x 89 | 90 | 91 | class ResNet(tf.keras.Model): 92 | def __init__(self, depth, **kwargs): 93 | super(ResNet, self).__init__(**kwargs) 94 | if depth not in [50, 101]: 95 | raise AssertionError('depth must be 50 or 101.') 96 | self.depth = depth 97 | self.padding = KL.ZeroPadding2D((3, 3)) 98 | self.conv1 = KL.Conv2D(64, (7, 7), strides=(2, 2), kernel_initializer='he_normal', name='conv1') 99 | self.bn_conv1 = KL.BatchNormalization(name='bn_conv1') 100 | self.max_pool = KL.MaxPooling2D((3, 3), strides=(2, 2), padding='same') 101 | 102 | self.res2a = _Bottleneck([64, 64, 256], block='2a', downsampling=True, stride=1) 103 | self.res2b = _Bottleneck([64, 64, 256], block='2b') 104 | self.res2c = _Bottleneck([64, 64, 256], block='2c') 105 | 106 | self.res3a = _Bottleneck([128, 128, 512], block='3a', downsampling=True, stride=2) 107 | self.res3b = _Bottleneck([128, 128, 512], block='3b') 108 | self.res3c = _Bottleneck([128, 128, 512], block='3c') 109 | self.res3d = _Bottleneck([128, 128, 512], block='3d') 110 | 111 | self.res4a = _Bottleneck([256, 256, 1024], block='4a', downsampling=True, stride=2) 112 | self.res4b = _Bottleneck([256, 256, 1024], block='4b') 113 | self.res4c = _Bottleneck([256, 256, 1024], block='4c') 114 | self.res4d = _Bottleneck([256, 256, 1024], block='4d') 115 | self.res4e = _Bottleneck([256, 256, 1024], block='4e') 116 | self.res4f = _Bottleneck([256, 256, 1024], block='4f') 117 | if self.depth == 101: 118 | self.res4g = _Bottleneck([256, 256, 1024], block='4g') 119 | self.res4h = _Bottleneck([256, 256, 1024], block='4h') 120 | self.res4i = _Bottleneck([256, 256, 1024], block='4i') 121 | self.res4j = _Bottleneck([256, 256, 1024], block='4j') 122 | self.res4k = _Bottleneck([256, 256, 1024], block='4k') 123 | self.res4l = _Bottleneck([256, 256, 1024], block='4l') 124 | self.res4m = _Bottleneck([256, 256, 1024], block='4m') 125 | self.res4n = _Bottleneck([256, 256, 1024], block='4n') 126 | self.res4o = _Bottleneck([256, 256, 1024], block='4o') 127 | self.res4p = _Bottleneck([256, 256, 1024], block='4p') 128 | self.res4q = _Bottleneck([256, 256, 1024], block='4q') 129 | self.res4r = _Bottleneck([256, 256, 1024], block='4r') 130 | self.res4s = _Bottleneck([256, 256, 1024], block='4s') 131 | self.res4t = _Bottleneck([256, 256, 1024], block='4t') 132 | self.res4u = _Bottleneck([256, 256, 1024], block='4u') 133 | self.res4v = _Bottleneck([256, 256, 1024], block='4v') 134 | self.res4w = _Bottleneck([256, 256, 1024], block='4w') 135 | 136 | self.res5a = _Bottleneck([512, 512, 2048], block='5a', downsampling=True, stride=2) 137 | self.res5b = _Bottleneck([512, 512, 2048], block='5b') 138 | self.res5c = _Bottleneck([512, 512, 2048], block='5c') 139 | 140 | self.out_channel = (256, 512, 1024, 2048) 141 | 142 | def __call__(self, inputs, training=True): 143 | x = self.padding(inputs) 144 | x = self.conv1(x) 145 | x = self.bn_conv1(x, training=training) 146 | x = tf.nn.relu(x) 147 | x = self.max_pool(x) 148 | 149 | x = self.res2a(x, training=training) 150 | x = self.res2b(x, training=training) 151 | C2 = x = self.res2c(x, training=training) 152 | 153 | x = self.res3a(x, training=training) 154 | x = self.res3b(x, training=training) 155 | x = self.res3c(x, training=training) 156 | C3 = x = self.res3d(x, training=training) 157 | 158 | x = self.res4a(x, training=training) 159 | x = self.res4b(x, training=training) 160 | x = self.res4c(x, training=training) 161 | x = self.res4d(x, training=training) 162 | x = self.res4e(x, training=training) 163 | x = self.res4f(x, training=training) 164 | if self.depth == 101: 165 | x = self.res4g(x, training=training) 166 | x = self.res4h(x, training=training) 167 | x = self.res4i(x, training=training) 168 | x = self.res4j(x, training=training) 169 | x = self.res4k(x, training=training) 170 | x = self.res4l(x, training=training) 171 | x = self.res4m(x, training=training) 172 | x = self.res4n(x, training=training) 173 | x = self.res4o(x, training=training) 174 | x = self.res4p(x, training=training) 175 | x = self.res4q(x, training=training) 176 | x = self.res4r(x, training=training) 177 | x = self.res4s(x, training=training) 178 | x = self.res4t(x, training=training) 179 | x = self.res4u(x, training=training) 180 | x = self.res4v(x, training=training) 181 | x = self.res4w(x, training=training) 182 | C4 = x 183 | 184 | # x = self.res5a(x, training=training) 185 | # x = self.res5b(x, training=training) 186 | # C5 = x = self.res5c(x, training=training) 187 | 188 | # return C2, C3, C4, C5 189 | return C4 190 | 191 | class ResnetEncoderDecoder(tf.keras.Model): 192 | def __init__(self): 193 | super(ResnetEncoderDecoder, self).__init__() 194 | self.resnet = ResNet(50) 195 | self.out = tf.keras.layers.Dense(opt.class_num+1) 196 | self.loss_layer = ACE(opt.dict) 197 | 198 | def call(self, inputs, training=True): 199 | input, label = inputs[0], inputs[1] 200 | input = self.resnet(input) 201 | # print ("input shape", input.shape) 202 | input = tf.nn.softmax(self.out(input),dim=-1) 203 | 204 | return self.loss_layer([input,label]) 205 | 206 | 207 | if __name__ == "__main__": 208 | 209 | model = ResnetEncoderDecoder() 210 | 211 | 212 | optimizer = tf.train.RMSPropOptimizer(learning_rate=0.0001) 213 | checkpoint_path = "./checkpoints/train" 214 | ckpt = tf.train.Checkpoint(model=model, 215 | optimizer=optimizer, 216 | step=tf.train.get_or_create_global_step()) 217 | ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5) 218 | 219 | train_set = ImageDataset(data_path = opt.train_path, char_path="utils/char.txt", batch_size=128, training=True).data_generation() 220 | test_set = ImageDataset(data_path = opt.test_path, char_path="utils/char.txt", batch_size=128, training=False).data_generation() 221 | 222 | start_epoch = 0 223 | if ckpt_manager.latest_checkpoint: 224 | ckpt.restore(ckpt_manager.latest_checkpoint) 225 | start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1]) 226 | print (f'Latest checkpoint restored!!\n\tModel path is {ckpt_manager.latest_checkpoint}') 227 | 228 | epochs = 100000 229 | for epoch in range(start_epoch, epochs): 230 | loss_history = [] 231 | for step, (inputs, labels) in enumerate(train_set): 232 | # print (inputs) 233 | with tf.GradientTape() as tape: 234 | loss = model(inputs["images"], training=True) 235 | correct_count, len_total, pre_total = model.loss_layer.result_analysis(step) 236 | recall = float(correct_count) / len_total 237 | precision = correct_count / (pre_total+0.000001) 238 | print(f'Epoch: {epoch:3d} it: {step:6d}, loss: {loss:.4f}, recall: {recall:.4f}, precision: {precision:.4f}') 239 | 240 | grads = tape.gradient(loss, model.variables) 241 | optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step()) 242 | 243 | loss_history.append(loss.numpy()) 244 | # if step == 0: 245 | # loss_aver = loss 246 | # loss_aver = 0.9*loss_aver+0.1*loss 247 | # if step == len(self.lmdb_train)-1: 248 | ckpt_manager.save() 249 | # the_solver = seq_solver(model = model, 250 | # lmdb = [lmdb_train, lmdb_test], 251 | # optimizer = optimizer, 252 | # scheduler = scheduler, 253 | # total_epoch = opt.total_epoch, 254 | # model_path = opt.model_path, 255 | # last_epoch = opt.last_epoch) 256 | 257 | # the_solver.forward() 258 | 259 | -------------------------------------------------------------------------------- /source/models/ace.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import math 3 | import torch 4 | import random 5 | import itertools 6 | import numpy as np 7 | import tensorflow as tf 8 | 9 | class ACE(tf.keras.Model): 10 | 11 | def __init__(self, dictionary, name='aggregrate_cross_entropy', **kwargs): 12 | super(ACE, self).__init__(name=name, **kwargs) 13 | self.softmax = None 14 | self.label = None 15 | self.dict = dictionary 16 | 17 | def call(self, inputs): 18 | input, label = inputs[0], inputs[1] 19 | self.bs,self.h,self.w,_ = input.shape.as_list() 20 | T_ = self.h*self.w 21 | 22 | input = tf.reshape(input, (self.bs,T_,-1)) 23 | input = input + 1e-10 24 | 25 | self.softmax = input 26 | nums,dist = label[:,0],label[:,1:] 27 | nums = T_ - nums 28 | 29 | self.label = tf.concat([tf.expand_dims(nums, -1),dist], 1) 30 | 31 | # ACE Implementation (four fundamental formulas) 32 | input = tf.reduce_sum(input, axis=1) 33 | input = input/T_ 34 | label = label/T_ 35 | loss = (-tf.reduce_sum(tf.math.log(input)*label))/self.bs 36 | 37 | return loss 38 | 39 | 40 | def decode_batch(self): 41 | out_best = tf.argmax(self.softmax, 2).numpy() 42 | pre_result = [0]*self.bs 43 | for j in range(self.bs): 44 | pre_result[j] = out_best[j][out_best[j]!=0].astype(np.int32) 45 | return pre_result 46 | 47 | 48 | def vis(self,iteration): 49 | sn = np.random.randint(0,self.bs-1) 50 | print(f'Test image {iteration*50+sn:4d}') 51 | pred = tf.argmax(self.softmax, 2).numpy() 52 | pred = pred[sn].astype(np.int32).tolist() # sample #0 53 | pred_string = ''.join([f'{self.dict[pn]:2s}' for pn in pred]) 54 | pred_string_set = [pred_string[i:i+self.w*2] for i in range(0, len(pred_string), self.w*2)] 55 | print('Prediction: ') 56 | for pre_str in pred_string_set: 57 | print(pre_str) 58 | label = self.label.numpy().astype(np.int32) # (batch_size, num_classes) 59 | label = ''.join([f'{self.dict[idx]:2s}:{pn:2d} ' for idx, pn in enumerate(label[sn]) if idx != 0 and pn != 0]) 60 | label = 'Label: ' + label 61 | print(label) 62 | 63 | def result_analysis(self, iteration): 64 | prediction = self.decode_batch() 65 | correct_count = 0 66 | pre_total = 0 67 | len_total = self.label[:,1:].numpy().sum() 68 | label_data = self.label.numpy() 69 | for idx, pre_list in enumerate(prediction): 70 | for pw in pre_list: 71 | if label_data[idx][pw] > 0: 72 | correct_count = correct_count + 1 73 | label_data[idx][pw] -= 1 74 | 75 | pre_total += len(pre_list) 76 | 77 | if np.random.random() < 0.05: 78 | self.vis(iteration) 79 | 80 | return correct_count, len_total, pre_total -------------------------------------------------------------------------------- /source/train.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | filename="../log/log/log_`date +%y_%m_%d_%H_%M_%S`.txt" 4 | CUDA_VISIBLE_DEVICES=0 python -u main.py \ 5 | 2>&1 | tee $filename 6 | 7 | 8 | 9 | -------------------------------------------------------------------------------- /source/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tsing-cv/Aggregation-Cross-Entropy-for-Sequence-Recognition/b3fdcdbcf02eea5b01959dffe1bc6380c7dbbff8/source/utils/__init__.py -------------------------------------------------------------------------------- /source/utils/basic.py: -------------------------------------------------------------------------------- 1 | import time 2 | import math 3 | 4 | 5 | def asMinutes(s): 6 | m = math.floor(s / 60) 7 | s -= m * 60 8 | return '%dm %ds' % (m, s) 9 | 10 | 11 | def timeSince(since): 12 | now = time.time() 13 | s = now - since 14 | return '%s' % (asMinutes(s)) 15 | -------------------------------------------------------------------------------- /source/utils/char.txt: -------------------------------------------------------------------------------- 1 | a 2 | b 3 | c 4 | d 5 | e 6 | f 7 | g 8 | h 9 | i 10 | j 11 | k 12 | l 13 | m 14 | n 15 | o 16 | p 17 | q 18 | r 19 | s 20 | t 21 | u 22 | v 23 | w 24 | x 25 | y 26 | z 27 | _ -------------------------------------------------------------------------------- /source/utils/data_loader.py: -------------------------------------------------------------------------------- 1 | import cv2 2 | import numpy as np 3 | import tensorflow as tf 4 | import unicodedata 5 | import multiprocessing 6 | 7 | class ImageDataset(): 8 | """Face Landmarks dataset.""" 9 | 10 | def __init__(self, data_path, char_path, batch_size=10, training=True, transform=None): 11 | """ 12 | Args: 13 | data_path (string): Path to the files with images and their annotations. 14 | length (string): image number. 15 | class_num (int): class number. 16 | """ 17 | with open(data_path) as fh: 18 | self.img_and_label = fh.readlines() 19 | with open(char_path) as f: 20 | self.char_id_map = {char.strip():idx for idx,char in enumerate(f)} 21 | print (self.char_id_map) 22 | self.length = len(self.img_and_label) 23 | self.class_num = len(self.char_id_map) 24 | self.indexes = np.arange(self.length) 25 | self.batch_size = batch_size 26 | self.transform = transform if training else None 27 | 28 | 29 | def __len__(self): 30 | return self.length 31 | 32 | def __getitem__(self, index): 33 | img_and_label = self.img_and_label[index].strip() 34 | pth, word = img_and_label.split(' ') # image path and its annotation 35 | 36 | image = cv2.imread(pth)#,0) 37 | image = cv2.pyrDown(image).astype('float32') # 100*100 38 | 39 | word = [ord(var)-97 for var in word] # a->0 40 | 41 | label = np.zeros((self.class_num)).astype('float32') 42 | 43 | for ln in word: 44 | label[int(ln+1)] += 1 # label construction for ACE 45 | 46 | label[0] = len(word) 47 | return pth,image,label 48 | 49 | 50 | def data_generation(self): 51 | steps_of_per_epoch = self.length//self.batch_size 52 | while True: 53 | with multiprocessing.Pool(processes=8) as pool: 54 | np.random.shuffle(self.img_and_label) 55 | for i, data in enumerate(range(steps_of_per_epoch)): 56 | paths = [] 57 | images = [] 58 | labels = [] 59 | index_batch = self.indexes[i * self.batch_size:(i + 1) * self.batch_size] 60 | for idx in index_batch: 61 | # print (idx) 62 | path,image,label = self.__getitem__(idx) 63 | # print ("label", label) 64 | paths.append(path) 65 | images.append(image) 66 | labels.append(label) 67 | 68 | yield {"path":np.array(paths), "images": np.array(images)}, np.array(labels) 69 | 70 | 71 | --------------------------------------------------------------------------------