├── 图像人脸OCR语音算法模型整理.docx
└── README.md
/图像人脸OCR语音算法模型整理.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taylorlu/MachineLearningDOC/HEAD/图像人脸OCR语音算法模型整理.docx
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## 图像、人脸、OCR、语音相关算法整理
2 | ##### [概述-图像语音机器学习(Outline-Image & Audio & Machine Learning)](#0)
3 | ##### 1. [通用物体检测和识别(General Object Detection/Recognition)](#1)
4 | ##### 2. [特定物体检测和识别和检索(Specific Object Detection/CBIR)](#2)
5 | ##### 3. [物体跟踪(Object Tracking)](#3)
6 | ##### 4. [物体分割(Object Segmentation)](#4)
7 | ##### 5. [人脸检测(Face Detection)](#5)
8 | ##### 6. [人脸关键点对齐(Face Alignment)](#6)
9 | ##### 7. [人脸识别(Face Recognition)](#7)
10 | ##### 8. [人像重建(Face Reconstruct)](#8)
11 | ##### 9. [OCR字符识别(Wild Scene & Hand Written)](#9)
12 | ##### 10. [语音识别(Automatic Speech Recognition/Speech to Text)](#10)
13 | ##### 11. [说话人识别(Speaker Recognition/Identification/Verification)](#11)
14 | ##### 12. [说话人语音分割(Speaker Diarization)](#12)
15 | ##### 13. [语音合成(Text To Speech)](#13)
16 | ##### 14. [声纹转换(Voice Conversion)](#14)
17 | ##### 15. [人脸生物特征(Age Gender)](#15)
18 |
19 | **概述-图像语音机器学习(Outline-Image & Audio & Machine Learning)**
20 | + 图像:
21 | ```
22 | 1. 变换(Transform),分为旋转、放缩、平移、仿射、投影
23 | ```
24 | Rotation和Scale可以看做是一个SVD分解,对于二维图像,对应2x2矩阵。
25 | Translate为了支持矩阵相加,需要扩充一列,所以前三者结合变成一个2x3或3x3矩阵。
26 | Affine加上了翻转和斜切,保持点的共线性和直线的平行性,共有6个自由度dof。
27 | Projection变换不是线性的,共有8个自由度。
28 | 可参考[Transformations](https://courses.cs.washington.edu/courses/csep576/11sp/pdf/Transformations.pdf)。
29 | 通过对变换做处理,可用于变形OCR的纠正,比如[TPS算法](https://profs.etsmtl.ca/hlombaert/thinplates)。
30 | ```
31 | 2. 卷积(convolution),分为一阶、二阶
32 | ```
33 | 一阶算子有Roberts、Sobel、Prewitt,由于只求了一阶导数,所以一次只能检测一个方向的边缘。
34 | 二阶算子有Laplace、LoG、DoG,是角点检测的第一步,不抗噪。
35 | 卷积其实就是信号处理里面的求积再求和运算,在CNN中,卷积核是需要训练的参数,但由于大多数是共享的,参数量并不大,一般不需要Dropout。由于训练出的卷积核大多并不对称,所以并没有旋转不变性(rotation invariant),对于放缩和平移不变性也只能由pooling层起很小的作用。最初的方法是通过Data Argument,在NIPS2015上,[spatial transformer networks](https://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf)提出了一种自动学习变换矩阵的BP网络,对于数据增强的依赖大大降低。
36 | ```
37 | 3. 大津阈值二值化,分水岭分割
38 | 离散傅里叶变换DFT,离散余弦变换DCT,小波变换Wavelet
39 | 图像的一阶二阶矩,形状描述
40 | 颜色空间(RGB, YUV, HSV)
41 | 以上用于视频编码和图像分析的多
42 | ```
43 | ```
44 | 4. 图像融合
45 | ```
46 | 图像融合可用在深度学习后处理,比如分割后的物体融合到另一个背景,人像换脸等。常用的有[poisson Image Editing](https://www.cs.virginia.edu/~connelly/class/2014/comp_photo/proj2/poisson.pdf)
47 |
48 | + 语音:
49 | ```
50 | 1. wav和mfcc
51 | ```
52 | 由于语音是含有时域信息的,在进行实时频域转换的时候会采用加窗的短时STFT变换,根据不同的窗函数,会生成不同频段的频谱值。mfcc是基于梅尔频率的倒谱,是非线性的对数倒频谱。在进行ASR、SV时,一般都会先将wav文件转成mfcc进行处理,当然也不排除直接用wav的,比如wavenet, sincnet等。采用mfcc的好处是既含有时域信息也含有频域信息,由小窗函数将数据压缩成二维可采用普通CNN网络对其进行处理。
53 | ```
54 | 2. 听歌识曲,哼唱识别
55 | ```
56 | 曾经研究过的传统方法,基于mfcc和倒排索引。
57 | 1. A Highly Robust Audio Fingerprinting System
58 | 2. ROBUST AUDIO FINGERPRINT EXTRACTION ALGORITHM
59 | 3. An Industrial-Strength Audio Search Algorithm
60 | 深度学习的检索
61 | A Tutorial on Deep Learning for Music Information Retrieval
62 |
63 |
64 | + 统计学习:
65 | ```
66 | 1. SVM支持向量机
67 | ```
68 | 这个是老外写的一本《支持向量机导论》,网上中文英文都有。
69 | an introduction to support vector machines and other kernel-based learning methods
70 | 包含从核函数到VC维最大泛化间隔,到KKT不等式约束的拉格朗日对偶问题,再到SMO算法求解拉格朗日乘子,算是很完整的一个教材了。
71 |
72 | ```
73 | 2. Adaboost
74 | ```
75 | 从弱学习机到强学习机,是一种迭代算法,只要分类器比随机分类器好一点,它就能逐渐迭代出一个强分类器。优点是不容易过拟合,缺点对噪声敏感。
76 | 1. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
77 | 2. Multi-class AdaBoost
78 |
79 | ```
80 | 3. Decision tree决策树
81 | ```
82 | 主要用在数据挖掘,最优树的生成常用有ID3/4/5,CART等算法,缺点是不稳定,特别是样本数量不一致的情况。
83 |
84 | ```
85 | 4. 贝叶斯网络、随机森林
86 | ```
87 |
88 | ```
89 | 5. EM/GMM模型
90 | ```
91 | 含有隐变量的聚类模型。隐变量/隐分布就是每个概率分布的权重以及每个样本属于每个分布的概率。
92 | EM算法分为2步,E-Step是固定已知变量利用Jensen不等式求对数似然函数的极值,更新隐变量,M-Step是在固定隐变量求整个似然函数的极值,更新已知变量
93 | GMM模型是先假定分布是高斯分布,已知变量即均值和方差,隐变量即高斯分布的权重。
94 | EM算法对初始值敏感,无法保证全局最优。用途很多,比如聚类、声纹模型UBM。
95 | 神经网络求解EM算法:
96 | 1. Neural Expectation Maximization
97 | https://github.com/sjoerdvansteenkiste/Neural-EM
98 |
99 | ```
100 | 6. 无监督聚类Kmeans、Meanshift,基于图模型的Spectral Clustering
101 | ```
102 |
103 | ```
104 | 7. 不用指定聚类个数的模型DBSCAN、Chinese Whisper
105 | ```
106 | + 深度学习:
107 | 深度学习即完全基于神经网络的模型,包括CNN空域、RNN时域等模型,重点在于网络设计、损失函数设计,以及优化器这3方面。
108 | **网络设计**代表性的有CNN、空洞卷积、通道可分离卷积、DropOut、RNN/LSTM/GRU、Attention/Self-Attention/Transformer、Resnet、Inception系列、Squeezenet/Mobilenet/Shufflenet等
109 | **损失函数**代表性的有Triplet loss、Center loss、SphereFace、ArcFace、AMSoftmax等
110 | **优化器**主要有SGD、Moment、Adagrad、Adadelta、Adam、RMSprop、Adabound、Admm等,还有其他加快收敛防止过拟合的方法如Batchnorm,正则化等。
111 |
112 |
113 | 1. **通用物体检测和识别(General Object Detection/Recognition)**
114 | + 传统方法:
115 | ```
116 | 1. 基于Bag Of Words词袋模型的,SIFT/SURF+KMeans+SVM
117 | 2. 基于Sparse Coding稀疏编码的,LLC
118 | 3. 基于聚合特征的,Fisher Vector/VLAD
119 | 4. 基于变形部件组合模型的,DPM用到HOG/Latent SVM
120 | 5. 有关角点的检测和描述,近几年有基于深度学习的方法,如LIFT、DELP、LFNET,缺点是速度慢
121 | ```
122 | - 相关论文:
123 | ```
124 | 1. Visual Object Recognition, Kristen Grauman
125 | 2. Locality-constrained Linear Coding for Image Classification
126 | 3. Fisher Kernels on Visual Vocabularies for Image Categorization
127 | 4. Improving the Fisher Kernel for Large-Scale Image Classification
128 | 5. Aggregating local descriptors into a compact image representation
129 | 6. Object Detection with Discriminatively Trained Part Based Models
130 | 7. LIFT: Learned Invariant Feature Transform
131 | 8. Large-Scale Image Retrieval with Attentive Deep Local Feature
132 | 9. LF-Net: Learning Local Features from Images
133 | ```
134 | - 相关开源地址:
135 | * http://www.vlfeat.org
136 | * https://github.com/rbgirshick/voc-dpm
137 | * https://github.com/cbod/cs766-llc
138 | * https://github.com/nashory/DeLF-pytorch
139 | * https://github.com/vcg-uvic/lf-net-release
140 |
141 |
142 | + 深度学习:
143 | ```
144 | RCNN/SPPNet/Faster RCNN,Yolo系列,SSD,R-FCN,RetinaNet,CFENet
145 | ```
146 | - 相关论文:
147 | ```
148 | 1. Rich feature hierarchies for accurate object detection and semantic segmentation
149 | 2. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
150 | 3. Fast R-CNN
151 | 4. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
152 | 5. You Only Look Once: Unified, Real-Time Object Detection
153 | 6. YOLO9000: Better, Faster, Stronger
154 | 7. YOLOv3: An Incremental Improvemen
155 | 8. SSD: Single Shot MultiBox Detector
156 | 9. R-FCN: Object Detection via Region-based Fully Convolutional Networks
157 | 10. Focal Loss for Dense Object Detection
158 | 11. CFENet: An Accurate and Efficient Single-Shot Object Detector for Autonomous Driving
159 | ```
160 | - 相关开源地址:
161 | * https://github.com/rbgirshick/rcnn
162 | * https://github.com/rbgirshick/fast-rcnn
163 | * https://github.com/rbgirshick/py-faster-rcnn
164 | * https://github.com/balancap/SSD-Tensorflow
165 | * https://github.com/chuanqi305/MobileNet-SSD
166 | * https://github.com/gliese581gg/YOLO_tensorflow
167 | * https://github.com/choasup/caffe-yolo9000
168 | * https://github.com/qqwweee/keras-yolo3
169 | * https://github.com/daijifeng001/R-FCN
170 | * https://github.com/YuwenXiong/py-R-FCN
171 | * https://github.com/daijifeng001/caffe-rfcn
172 | * https://github.com/facebookresearch/Detectron
173 |
174 |
175 | 2. **特定物体检测和识别和检索(Specific Object Detection/CBIR)**
176 | - 特定物体只识别一张特定的图,不能进行大样本训练,也即不需要进行训练和学习。大多数只是用Artificial Feature手工特征,比如特征点,而且对于刚性物体,特征点匹配可以用SVD分解和RANSAC计算出仿射变换矩阵,进而判断物体边缘的方向。也有基于神经网络的,如R-MAC,NetVlad,但用的都是backpone预训练模型。
177 | - 特征点匹配,基于欧氏距离的,如SIFT/SURF,基于海明距离的,如AKAZE/FREAK,欧氏距离的检索可以用KD-Tree或者其他算法如hnsw、Falconn,海明距离的检索用LSH。
178 | - 基于Fisher Vector/VLAD,采用随机超平面的方式切换成海明距离进行检索
179 | - 检索,基于欧式距离的检索有hnsw、Falconn、Faiss等开源库。
180 | + 相关论文:
181 | ```
182 | 1. Aggregating Deep Convolutional Features for Image Retrieval
183 | 2. PARTICULAR OBJECT RETRIEVAL WITH INTEGRAL MAX-POOLING OF CNN ACTIVATIONS
184 | 3. Deep Learning of Binary Hash Codes for Fast Image Retrieval
185 | 4. Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks
186 | 5. Bags of Local Convolutional Features for Scalable Instance Search
187 | 6. Deep Image Retrieval: Learning global representations for image search
188 | 7. Region-Based Image Retrieval Revisited
189 | ```
190 | + 相关开源地址:
191 | * https://github.com/Relja/netvlad
192 | * https://github.com/uzh-rpg/netvlad_tf_open
193 | * https://github.com/nmslib/hnswlib
194 | * https://github.com/facebookresearch/faiss
195 | * https://github.com/FALCONN-LIB/FALCONN
196 | * https://github.com/imatge-upc/retrieval-2016-icmr
197 |
198 |
199 | 3. **物体跟踪(Object Tracking)**
200 | - 光流法
201 | - 卡尔曼滤波器
202 | - 均值漂移
203 | 物体跟踪在OpenCV里面都有实现,大多都是针对刚性物体,对于人脸这种物体不适合。
204 | 深度学习的方法:
205 | - CFNet
206 | + 相关论文:
207 | ```
208 | End-to-end representation learning for Correlation Filter based tracking
209 | ```
210 | + 相关开源地址:
211 | * https://github.com/bertinetto/cfnet
212 |
213 |
214 | 4. **物体分割(Object Segmentation)**
215 | - 目前主流的都是基于神经网络的。
216 | - FCN、SegNet、PSPNet、MaskRCNN 、DeepLab系列、RefineNet、DeeperLab
217 | + 相关论文:
218 | ```
219 | 1. Fully Convolutional Networks for Semantic Segmentation
220 | 2. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
221 | 3. Pyramid Scene Parsing Network
222 | 4. Mask R-CNN
223 | 5. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
224 | 6. Rethinking Atrous Convolution for Semantic Image Segmentation
225 | 7. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
226 | 8. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation
227 | 9. DeeperLab: Single-Shot Image Parser
228 | 10. MobileNetV2: Inverted Residuals and Linear Bottlenecks
229 | ```
230 |
231 | + 相关开源地址:
232 | * https://github.com/shekkizh/FCN.tensorflow
233 | * https://github.com/alexgkendall/caffe-segnet
234 | * https://github.com/hszhao/PSPNet
235 | * https://github.com/Vladkryvoruchko/PSPNet-Keras-tensorflow
236 | * https://github.com/matterport/Mask_RCNN
237 | * https://github.com/sthalles/deeplab_v3
238 | * https://github.com/DrSleep/tensorflow-deeplab-resnet
239 | * https://github.com/guosheng/refinenet
240 | * https://github.com/DrSleep/light-weight-refinenet
241 |
242 |
243 | 5. **人脸检测(Face Detection)**
244 | + 传统方法:特征提取+分类器的方式
245 | ```
246 | 特征主要有HOG、HAAR等,分类器有Adaboost、SVM、Cascade等。
247 | 常用的开源库有:OpenCV、Dlib等。
248 | ```
249 | + 深度学习:
250 | ```
251 | MTCNN、PyramidBox、HR、Face R-CNN、SSH、RSA、S3FD、FaceBoxes
252 | ```
253 | + 相关论文:
254 | ```
255 | 1. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
256 | 2. PyramidBox: A Context-assisted Single Shot Face Detector.
257 | 3. Finding Tiny Faces
258 | 4. Face R-CNN
259 | 5. SSH: Single Stage Headless Face Detector
260 | 6. Recurrent Scale Approximation for Object Detection in CNN
261 | 7. S 3FD: Single Shot Scale-invariant Face Detector
262 | 8. FaceBoxes: A CPU Real-time Face Detector with High Accuracy
263 | ```
264 | + 相关开源地址:
265 | * https://github.com/kpzhang93/MTCNN_face_detection_alignment
266 | * https://github.com/EricZgw/PyramidBox
267 | * https://github.com/cydonia999/Tiny_Faces_in_Tensorflow
268 | * https://github.com/mahyarnajibi/SSH
269 | * https://github.com/sciencefans/RSA-for-object-detection
270 | * https://github.com/louis-she/sfd.pytorch
271 | * https://github.com/sfzhang15/FaceBoxes
272 |
273 |
274 | 6. **人脸关键点对齐(Face Alignment)**
275 | + 一些人脸检测算法中会集成有人脸关键点对齐,在训练时2个任务的误差函数加权相加。对齐有2D和3D的区别,2D只考虑二维信息,3D需要有3维模型,能预测人脸的姿态信息。
276 | + 2D关键点对齐:DCNN、MTCNN、TCDCN、LAB
277 | + 3D关键点对齐:3DDFA、DenseReg、FAN、PRNet、PIPA
278 | + 相关论文:
279 | ```
280 | 1. Facial Landmark Detection by Deep Multi-task Learning
281 | 2. Deep Convolutional Network Cascade for Facial Point Detection
282 | 3. Look at Boundary: A Boundary-Aware Face Alignment Algorithm
283 | 4. Face Alignment Across Large Poses: A 3D Solution
284 | 5. Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting
285 | 6. Dense Face Alignment
286 | 7. DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild
287 | 8. How far are we from solving the 2D & 3D Face Alignment problem
288 | 9. Learning Dense Facial Correspondences in Unconstrained Images
289 | 10. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network
290 | 11. Dense Face Alignment
291 | ```
292 | + 相关开源地址:
293 | * https://github.com/zhzhanp/TCDCN-face-alignment
294 | * https://github.com/wywu/LAB
295 | * https://github.com/cleardusk/3DDFA
296 | * https://github.com/ralpguler/DenseReg
297 | * https://github.com/YadiraF/PRNet
298 | * http://cvlab.cse.msu.edu/project-pifa.html
299 |
300 |
301 | 7. **人脸识别(Face Recognition)**
302 | + 非神经网络:GaussianFace高斯脸
303 | + 深度学习:大多数和损失函数设计有关
304 | + DeepFace、DeepID系列、VGGFace、FaceNet、CenterLoss、MarginalLoss、SphereFace、ArcFace、AMSoftmax
305 | + 相关论文:
306 | ```
307 | 1. Surpassing Human-Level Face Verification Performance on LFW with GaussianFace
308 | 2. DeepFace: Closing the Gap to Human-Level Performance in Face Verification
309 | 3. Deep Learning Face Representation from Predicting 10,000 Classes
310 | 4. Deep Learning Face Representation by Joint Identification-Verification
311 | 5. DeepID3: Face Recognition with Very Deep Neural Networks
312 | 6. Deep Face Recognition
313 | 7. FaceNet: A Unified Embedding for Face Recognition and Clustering
314 | 8. A Discriminative Feature Learning Approach for Deep Face Recognition
315 | 9. Marginal Loss for Deep Face Recognition
316 | 10. SphereFace: Deep Hypersphere Embedding for Face Recognition
317 | 11. ArcFace: Additive Angular Margin Loss for Deep Face Recognition
318 | 12. Additive Margin Softmax for Face Verification
319 | ```
320 | + 相关开源地址:
321 | * https://github.com/jangerritharms/GaussianFace
322 | * http://www.robots.ox.ac.uk/~vgg/software/vgg_face/
323 | * https://github.com/davidsandberg/facenet
324 | * https://github.com/wy1iu/sphereface
325 | * https://github.com/xialuxi/arcface-caffe
326 | * https://github.com/deepinsight/insightface
327 |
328 |
329 | 8. **人像重建(Face Reconstruct)**
330 | + 基本上都是基于3D的,人像重建后可以进行姿态估计,以及换脸。有的换脸算法需要多张人脸训练GAN网络。
331 | + PRNet、VRN、Face2Face
332 | + 相关论文:
333 | ```
334 | 1. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications
335 | 2. 3D Face Reconstruction with Geometry Details from a Single Image
336 | 3. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network
337 | 4. CNN-based Real-time Dense Face Reconstruction with Inverse-rendered Photo-realistic Face Images
338 | 5. Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression
339 | 6. Deep Video Portraits
340 | 7. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track
341 | 8. paGAN: Real-time Avatars Using Dynamic Textures
342 | 9. On Face Segmentation, Face Swapping, and Face Perception
343 | 10. Extreme 3D Face Reconstruction: Looking Past Occlusions
344 | ```
345 | + 相关开源地址:
346 | * https://github.com/YadiraF/PRNet
347 | * https://github.com/AaronJackson/vrn
348 | * https://github.com/deepfakes/faceswap
349 | * https://github.com/datitran/face2face-demo
350 | * https://github.com/YuvalNirkin/face_swap
351 | * https://github.com/anhttran/extreme_3d_faces
352 |
353 |
354 | 9. **OCR字符识别(Wild Scene & Hand Written)**
355 | + OCR涉及到字符场景定位和分割,以及字符识别。传统的方法是采用垂直方向直方图形式对字符进行分割,然后一个个字符分别送入分类器进行识别。由于CNN/RNN/CTC动态规划算法及Attention机制的出现,当今的主流模型是CNN+RNN+CTC,采用和语音识别类似的自动语素分割的方式。检测框一般是水平的,如果要纠正还需要用Hough变换把文本方向纠正。近几年又出现了很多支持不同形状的文本区域检测方法,一种是基于分割的,如PixelLink、TextSnake,一种是基于回归的,如TextBoxes、DMPNet、RSDD,还有结合2者的,如SSTD。还有检测和识别端到端的,如FOTS、EAA、Mask TextSpotter、STN-OCR。
356 | + 字符区域检测:
357 | CTPN、EAST、TextBoxes++、AdvancedEast、TextSnake、Mask TextSpotter、DMPNet、RSDD、LOMO、PSENet、Pixel-Anchor
358 | + 相关论文:
359 | ```
360 | 1. Detecting Text in Natural Image with Connectionist Text Proposal Network
361 | 2. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
362 | 3. Single Shot Scene Text Retrieval
363 | 4. EAST: An Efficient and Accurate Scene Text Detector
364 | 5. DeepTextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
365 | 6. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
366 | 7. Multi-Oriented Text Detection with Fully Convolutional Networks
367 | 8. Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network
368 | 9. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
369 | 10. An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes
370 | 11. Rotation-Sensitive Regression for Oriented Scene Text Detection
371 | 12. Character Region Awareness for Text Detection
372 | 13. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes
373 | 14. Shape Robust Text Detection with Progressive Scale Expansion Network
374 | 15. Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks
375 | 16. 总结Overview:https://github.com/whitelok/image-text-localization-recognition
376 | 17. 挑战赛:http://rrc.cvc.uab.es
377 | 18. An end-to-end textspotter with explicit alignment and attention
378 | 19. STN-OCR: A single Neural Network for Text Detection and Text Recognition
379 | ```
380 | + 字符识别:
381 | 针对wild形变场景,检测到的框有粗糙的矩形,也有精确的多边形,在识别之前一般要进行纠正。关于纠正其实大体分为2个方向,一个是基于character划分的,如TextSnake、Char-Net,还有一种是通过TPS+STN网络自动去训练多点纠正的参数,这在很多Paper里面都有介绍。
382 | CRNN、GRCNN、CRAFT、ASTER、MORAN、ESIR、FAN,支持垂直方向文本识别的AON
383 | + 相关论文:
384 | ```
385 | 1. Gated Recurrent Convolution Neural Network for OCR
386 | 2. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
387 | 3. What is wrong with scene text recognition model comparisons? dataset and model analysis
388 | 4. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
389 | 5. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition
390 | 6. MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition
391 | 7. SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
392 | 8. ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification
393 | 9. AON: Towards Arbitrarily-Oriented Text Recognition
394 | 10. Simultaneous Recognition of Horizontal and Vertical Text in Natural Images
395 | 11. Focusing Attention: Towards Accurate Text Recognition in Natural Images
396 | ```
397 | + 相关开源地址:
398 | * https://github.com/eragonruan/text-detection-ctpn
399 | * https://github.com/MhLiao/TextBoxes_plusplus
400 | * https://github.com/lluisgomez/single-shot-str
401 | * https://github.com/huoyijie/AdvancedEAST
402 | * https://github.com/MichalBusta/DeepTextSpotter
403 | * https://github.com/Jianfeng1991/GRCNN-for-OCR
404 | * https://github.com/princewang1994/TextSnake.pytorch
405 | * https://github.com/clovaai/deep-text-recognition-benchmark
406 | * https://github.com/bgshih/aster
407 | * https://github.com/liuheng92/tensorflow_PSENet
408 | * https://github.com/whai362/PSENet
409 | * https://github.com/Canjie-Luo/MORAN_v2
410 | * https://github.com/Bartzi/see
411 | * https://github.com/huizhang0110/AON
412 | * https://github.com/Bartzi/stn-ocr
413 |
414 | + 手写字体识别:
415 | hand written由于各种书法风格,难度远高于印刷字体。NIPS上发表的几篇基于2维LSTM-RNN的方法,后面又有提速版的attention机制,这种方法支持一段手写文本的自动分行及对齐。后面ECCV又出现了一篇分多步的方法。
416 | + 相关论文:
417 | ```
418 | 1. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
419 | 2. Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention
420 | 3. Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition
421 | 4. Start, Follow, Read: End-to-End Full-Page Handwriting Recognition
422 | 5. Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
423 | 6. Handwriting Recognition of Historical Documents with few labeled data
424 | 7. Measuring Human Perception to Improve Handwritten Document Transcription
425 | 8. Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition
426 | 9. Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting Recognition
427 | 10. Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model
428 | ```
429 | + 相关开源地址:
430 | * https://github.com/cwig/start_follow_read
431 | * https://github.com/0x454447415244/HandwritingRecognitionSystem
432 | * http://www.tbluche.com/scan_attend_read.html
433 |
434 |
435 | 10. **语音识别(Automatic Speech Recognition/Speech to Text)**
436 | + 传统方式基于GMM-HMM模型和Vertibi算法
437 | + 深度学习:对WAV进行MFCC短时频谱信号提取,依次采用CNN卷积网络和LSTM循环网络以及CTC Loss误差函数进行建模。
438 | GRU-CTC、DFCNN、DFSMN、DeepSpeech、CLDNN
439 | + 相关论文
440 | ```
441 | 1. DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION
442 | 2. Deep Speech: Scaling up end-to-end speech recognition
443 | 3. CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS
444 | ```
445 | + 相关开源地址:
446 | * https://github.com/buriburisuri/speech-to-text-wavenet
447 | * https://github.com/Kyubyong/tacotron
448 | * https://github.com/PaddlePaddle/DeepSpeech
449 |
450 |
451 | 11. **说话人识别(Speaker Recognition/Identification/Verification)**
452 | + 声纹识别的主要问题在于语音时长、文本无关、开集比对、背景噪声等问题上。目前基于d-vector、x-vector的深度学习模型和TE2E/GE2E等的损失函数设计在短时长上比较占优势。传统方法的state-of-the-art是i-vector,采用pLDA信道补偿算法,所有基于深度学习的模型都会引用ivector的ERR作为baseline进行比对。以前的方法有UBM-GMM和JFA信道补偿,但是需要大量的不同信道的语料样本。传统方法的相关开源框架有Kaldi、ALIZE、SIDEKIT、pyannote-audio等。深度学习的方法有d-vector、x-vector、j-vector(文本有关)以及结合E2E损失函数的模型。还有基于GhostVlad和直接基于wave信号的SINCNET。
453 | + 相关开源地址:
454 | * http://www-lium.univ-lemans.fr/sidekit/
455 | * https://alize.univ-avignon.fr/
456 | * http://www.kaldi-asr.org/
457 | * https://github.com/rajathkmp/speaker-verification
458 | * https://github.com/wangleiai/dVectorSpeakerRecognition
459 | * https://github.com/Janghyun1230/Speaker_Verification
460 | * https://github.com/pyannote/pyannote-audio
461 | * https://github.com/WeidiXie/VGG-Speaker-Recognition
462 | * https://github.com/mravanelli/SincNet
463 |
464 |
465 | 12. **说话人语音分割(Speaker Diarization)**
466 | - 语音智能分割是基于说话人识别的,说话人识别效果的好坏决定语音分割的效果,当然还有切换点的识别效果也很重要。首先需要用VAD静音检测对语音进行分割,最简单的是用振幅来判断,如果有背景音则需要设计其他的VAD算法。切换点的判断可以通过BIC贝叶斯准则,最后就是聚类,判断哪些片段属于一个说话人,对于无监督学习算法,先验信息说话人数量显得尤为重要。目前基于深度学习的框架也有不少,比如最近Google出的UIS-RNN(其实是另类的聚类方法),还有法国LIUM团队的S4D。
467 | + 相关论文:
468 | ```
469 | 1. FULLY SUPERVISED SPEAKER DIARIZATION
470 | 2. SPAKER DIARIZATION WITH LSTM
471 | 3. S4D: Speaker Diarization Toolkit in Python
472 | ```
473 | + 相关开源地址:
474 | * https://github.com/google/uis-rnn
475 | * https://github.com/wq2012/SpectralCluster
476 | * https://projets-lium.univ-lemans.fr/s4d
477 |
478 |
479 | 13. **语音合成(Text To Speech)**
480 | - 文本转语音,传统方法是采用语素拼接,这种方式合成的语音比较生硬,没有语调。当前Baidu、Google、FaceBook等出了很多基于深度学习的方法。一般的流程是先Encoder再Decoder,最后用Griffin-Lim算法或者WaveNet自回归模型将MFCC变成wave信号。
481 | WaveNet系列(MFCC-->WAVE)、DeepVoice系列、Tacotron系列、VoiceLoop、ClariNet
482 |
483 | + 相关论文:
484 | ```
485 | 1. VOICELOOP: VOICE FITTING AND SYNTHESIS VIA A PHONOLOGICAL LOOP
486 | 2. TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS
487 | 3. NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS
488 | 4. Deep Voice: Real-time Neural Text-to-Speech
489 | 5. Deep Voice 2: Multi-Speaker Neural Text-to-Speech
490 | 6. DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH
491 | 7. WAVENET: A GENERATIVE MODEL FOR RAW AUDIO
492 | 8. Parallel WaveNet: Fast High-Fidelity Speech Synthesis
493 | 9. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
494 | 10. SAMPLE EFFICIENT ADAPTIVE TEXT-TO-SPEECH
495 | 11. FastSpeech: Fast, Robust and Controllable Text to Speech
496 | ```
497 | + 相关开源地址:
498 | * https://github.com/ibab/tensorflow-wavenet
499 | * https://github.com/keithito/tacotron
500 | * https://github.com/Kyubyong/tacotron
501 | * https://github.com/c1niv/Voiceloop_TensorFlow
502 | * https://github.com/israelg99/deepvoice
503 | * https://github.com/andabi/parallel-wavenet-vocoder
504 | * https://github.com/xcmyz/FastSpeech
505 |
506 |
507 | 14. **声纹转换(Voice Conversion)**
508 | - 声纹转换其实就是TTS的多人版,根据说话人的不同将文本生成不同的wave信号。大多数都是在网络架构中加入说话人Embedding向量,如DeepVoice2/DeepVoice3,Tacotron2,有的甚至会在声码器Vocoder中加入,比如WaveNet。
509 | + 相关开源地址:
510 | * https://github.com/r9y9/deepvoice3_pytorch
511 | * https://github.com/Kyubyong/deepvoice3
512 | * https://github.com/Rayhane-mamah/Tacotron-2
513 | * https://github.com/GSByeon/multi-speaker-tacotron-tensorflow
514 |
515 |
516 | 14. **人脸生物特征(Age Gender Estimate)**
517 | - 经典的DEX模型,SSR-NET精简模型
518 | + 相关论文:
519 | ```
520 | 1. DEX: Deep EXpectation of apparent age from a single image
521 | 2. Age Progression/Regression by Conditional Adversarial Autoencode
522 | 3. SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation
523 | 4. Deep Regression Forests for Age Estimation
524 | ```
525 | + 相关开源地址:
526 | * https://github.com/truongnmt/multi-task-learning
527 | * https://github.com/ZZUTK/Face-Aging-CAAE
528 | * https://github.com/yu4u/age-gender-estimation
529 | * https://github.com/shamangary/SSR-Net
530 | * https://github.com/shenwei1231/caffe-DeepRegressionForests
531 |
532 |
--------------------------------------------------------------------------------